Explaining Correlation to a Newbie to Data Analytics

8 Jun 2015

“If two or more quantities vary in sympathy so that movements in one tend to be accompanied by the corresponding movement in the other(s), then they are said to be correlated” – Conner.

Now if that in itself sounds like a complicated definition, then let’s try this, “Correlation is a mutual relationship or connection between two or more things.”

Correlation is really one of the very basics of data analysis and is an important tool for a data analyst, as it can help define trends, make predictions and uncover root causes for certain phenomena.

There could be essentially two types of data you can work with when determining correlation:

Univariate Data: In a simple set up we work with a single variable. We measure central tendency to enquire about the representative data, dispersion to measure the deviations around the central tendency, skewness to measure the shape and size of the distribution and kurtosis to measure the concentration of the data at the central position. This data, relating to a single variable is called univariate data.

Bivariate data: But it often becomes essential in our analysis to study two variables simultaneously. For example, a> height and weight of a person, b> age and blood pressure, etc. This statistical data on two characters of any individual, measured simultaneously are termed as bivariate data.
In fact, there may or may not be any association between these bivariate data. Here the word ‘correlation’ means the extent of association between any pair of bivariate data, recorded on any individual.

Now let’s discuss the four types of correlation:

Positive correlation
Negative correlation
Zero correlation
Spurious correlation

Positive correlation: If due to increase of any of the two data, the other data also increases, we say that those two data are positively correlated.
For example, height and weight of a male or female are positively correlated.

Negative correlation: If due to increase of any of the two, the other decreases, we say that those two data are negatively correlated.
For example, the price and demand of a commodity are negatively correlated. When the price increases, the demand generally goes down.

Zero correlation: If in between the two data, there is no clear-cut trend. i.e. , the change in one does not guarantee the co-directional change in the other, the two data are said to be non-correlated or may be said to possess, zero correlation.
For example, quality like affection, kindness is in most cases non-correlated with the academic achievements, or better to say that intellect of a person is purely non-correlated with complexion.

Spurious correlation: If the correlation is due to the influence of any other ‘third’ variable, the data is said to be spuriously correlated.
For example, children with “body control problems” and clumsiness has been reported as being associated with adult obesity. One can probably say that uncontrolled and clumsy kids participate less in sports and outdoor activities and that is the ‘third’ variable here. At most times, it is difficult to figure out the ‘third’ variable and even if that is achieved, it is even more difficult to gauge the extent of its influence on the two primary variables.

Now that we have understood the basics, look out for my next post which will delve deeper into the correlation coefficient and its properties.

In the meantime if you want to further your understanding of statistical techniques, check out my previous posts on Time Series Analysis.

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.

Jigsaw’s Data Science with SAS Course – click here.

Jigsaw’s Data Science with R Course – click here.

Jigsaw’s Big Data Course – click here.