We generally investigate the probability of likely events for many an uncertain situation. Once we determine the chances of its occurrence, we then determine its frequency too, i.e., number of times or number of favourable cases and so on. If we know the general form of the distribution of the random variable, then it is easier to understand the probability of that event. Thus, Normal (continuous case) distributions and discrete distributions (Poisson, Binomial, Geometric, Laplace, and so on) are very popular for modelling purposes under any uncertain circumstances.
In theoretical distributions, we can identify the general form of distribution through several statistical measures. For example, when Mean, Median, and Mode are the same, then the data follows a Normal distribution. Likewise, when Mean = Variance, it has a Poisson process. In NBD, Variance is greater than the Mean. From such properties of the data, we can identify the general form of distribution. But the converse is not true. That is, from E(X) and V(X), we cannot reconstruct the probability distribution of X and hence cannot compute quantities such as P[|X – E(X)| <= C]. So, there are situations where, by just knowing the Mean and Variance we cannot find out the underlying distribution. In such cases, the known distribution may not exist. However, even if we are unable to connect to known distribution, we still can give a very useful upper (or lower) bound to such probabilities.
Chebychev’s Inequality enables us to learn the probability limit of the likely event of the random variable. It provides the probability interval even though it doesn’t give the exact probability. The inequality is defined as follows.
P [ | X – E(X)| < € ] >= 1 – Var(x) / €_Square ; € = σk ; E(X) = Mean ; V(X) = σ-square
Thus, by choosing an arbitrary value of K from the above inequality, the lower limit of the probability for a desired limit of the random variable can be determined. This inequality was proposed by Russian Mathematician Chebyshev and helps in understanding the variance that measures the variability of the expected value of a random variable.
For illustration, consider an experiment of tossing a coin 16 times and determining the chances of obtaining Heads between the 4th and the 12th toss. This can be easily determined by modelling through Binomial distribution. Here, the experiment is finite, the trials are independent, the outcome has two different classifications and the probability of successes remains same. Computing using MS Excel equation [binom.dist (12,16,0.5,true) – binom.dist (4,16,0.5,false)], a probability of 0.950958 is obtained. However, assuming that the known distribution does not exist, how does the solution appear using Chebyshev’s inequality?
For the above problem, Mean = 8 and Standard deviation = 2 that is derived from the binomial distribution itself. Therefore, by Chebychev’s Inequality, the probability for P [ |X-E(x)| < kσ ] = P[|X-8| < 2k] > = 1 – (1/k_Square) has to be found. Thus by choosing k=2, we get P[| 4 < X < 12 ] >= 1 – (1/4) = 1 – 0.25 = 75%. Therefore, the chance of getting heads between the 4th toss and the 12th toss of a coin in 16 successive trials is at least 75% which is the lower limit of the probability of successes.
Thus, Chebychev’s Inequality can be used for any set of observations irrespective of the shape of data. In symmetrical distribution, 67% of data lie between one time of standard deviation of the mean, 95% of data lie between 2 times of standard deviation of the mean and so on. This is widely mentioned as an empirical rule. But the interesting fact of Chebychev’s Inequality is that, it applies to any form of the distribution under any uncertain circumstances.
Found this interesting? Do you think you would like to learn more about such statistical tools and techniques that can be used in data analysis? Check out Jigsaw Academy’s Data Analytics courses and find out more about how you can make a career in data.