Variables possess certain properties. These properties are commonly referred to as the characteristics of the variables. Many times the variables have identical properties. These identical properties take a mathematical form known as the probability density function (PDF) or the probability mass function(PMF) of the variables. There are several advantages once the variable is associated with known PMF or PDF. The most popular being the Normal distribution. Most statistical procedures in inferential statistics assumes that the sample drawn from the population has a normal property.
There are different ways of identifying whether a given variable has a Normal property or not. The most famous and one of the simplest methods to identify the characteristics of the variables is the diagrammatical presentation method. The most popular diagrammatic presentation is the Histogram, which is based on frequency distribution, useful in identifying the shape of a distribution. When you plot the variable through a Histogram, and if it forms a bell shaped diagram, then we conclude that it has a Normal property. Further, Histograms can also be used for identifying any distribution of data based on the shape that it forms. The purpose of preparing a histogram is to infer a known probability density function or probability mass function.
However the Histogram is not useful for evaluating the fit of the chosen distribution. When there are a small number of data points, a histogram can be rather ragged. Further, our understanding of the fit depends on the widths of the histogram intervals. But even if the intervals are well chosen, grouping data into cells makes it difficult to compare a histogram to a continuous probability density function.
The Q-Q Plots or Quantile-Quantile Plots overcomes all the limitations of the Histogram plot. Let’s take an example:
Consider a sample data from a population. The construction of Q-Q plots starts with ordering the sample data from smallest to largest. Let k denote the ranking or order number. Therefore k=1 for the smallest and k = n for the largest. The q-q plot is based on the fact that the ordered value of k is an estimate of (k-1/2)/n quantile of the sample data. In other words, the ordered values are close to inverse of cumulative distribution of (k-0.5)/n, where n is the sample size. If the cumulative distribution function belongs to an appropriate known distribution, then the plot of ordered values and the known cumulative distributional values will approximately form a straight line. On the other hand, if the assumed distribution is inappropriate, the points will deviate from the straight line in a systematic manner. Therefore, if we assume that the cdf is from a normal distribution, then obtaining a straight line after the plot confirms that the sample data indeed belongs to the normal population.
The following is the R command for attaining a Quantile-Quantile plot for the variable income which is a part of state dataset.
qqnorm(state$Income, main=”Normal Q-Q Plot”,xlab = “Theoretical Quantiles”, ylab = “Income Quantiles”)
Thus, from the graph we conclude that the variable Income follows a Normal distribution. In Q-Q plot we usually focus on the central part of the linearity rather than the extreme plots. This is because the normal curve is asymptotic to X-axis and so reflects in the extreme points of the straight line plots. In similar manner one could get Quantile-Quantile plots for different distributions using qq Plot functions in R software.
In conclusion, through empirical procedures like Histogram & Q-Q plots, we can identify the general form of the distribution. Among these two, Q-Q plots perform well even for small sample size. As most of our statistical methods depend on the normality assumptions, checking normality for the sample data is important. If we violate this assumption then the inference driven from the analysis may not be precise.
So, though both empirical procedures, the Histogram & Q-Q plots, can identify the general form of the distribution, we can see that the Q-Q plots perform well even for small sample sizes. What is important to remember is that as most of our statistical methods depend on the normality assumptions, checking normality for the sample data is important. If we violate this assumption then the inference driven from the analysis may not be precise.
Hope you found this interesting. If you like working with numbers why don’t you check out Jigsaw Academy’s Analytics and Big Data courses? Do something you love…Let your career soar.