An unprecedented explosion in the Data Science industry over recent years has bought in changes that have seen it catapulting to become the most sought-after career choice of this century, leaving everyone wondering what it’s all about.
Let us begin to explore this huge analytical minefield by attempting to grasp one of its most fundamental concepts of Hypothesis Testing.
Access to a study’s statistical or qualitative data without understanding what it means or informs is not enough. Cue Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data.
Analysis and manipulations result in inferences and interpretations, patterns allowing us to conclude the study at hand. An essential part of reaching a solid understanding of the study is to infer whether the findings or observed patterns are a matter of chance or the result of an intervention. This is where statistical testing comes in. This course of study is known as Inferential Statistics, and hypothesis testing falls under its wing.
What is the hypothesis? Simply put, a hypothesis is an unobserved idea or an assumption based on sample data. How can this idea, which has an equal probability of
being false or true, be evaluated? We put it to the test.
How to perform a hypotheses test? In common practice, a hypothesis test evaluates mutually exclusive statements to interpret which is best supported by the data at hand. When proved correct, this hypothesis becomes a theory.
Coming back to our understanding of a hypothesis, what are the types of hypotheses?
1. Null Hypothesis
2. Alternate Hypothesis.
Moving on define null Hypothesis and alternate hypothesis.
Denoted by Ho, the null hypothesis is generally accepted to be accurate or understood as the status quo. Unless rejected, a Null Hypothesis does not have an impact on the course of the study.
Capable of altering our entire understanding or interpretation of the study, an alternative hypothesis, usually denoted by H1 or Ha, is expected to be the opposite of the null hypothesis. Its acceptance means a change in the course of action of the study.
Examples of a null hypothesis and an alternate hypothesis
The presence of Cadmium in the soil does not affect the growth rate of plants’
The hypothesis is tested by comparing growth rates of plants growing in soils containing a varying quantity of Cadmium, from complete absence to a staggering increase.
If the statement proves to be false, only then would it lead to further research on the effect of different concentrations of Cadmium on plants and their growth.
Any other hypothesis, contradictory to the statement ‘ The presence of Cadmium in the soil does not affect plants’ growth rate, is an alternative hypothesis.
Another more straightforward example would be that of determining whether a coin is fair and balanced. If the null hypothesis is said, the probability of the coin flip resulting in a show of heads is equal to the likelihood of a show of tails, and the alternate hypothesis might say ‘ the probability of the coin flip showing heads and tails would be very different.’
In other words, when we look at them from the perspective of pitting them against each other, that is a null hypothesis vs. alternative hypothesis, a null hypothesis meaning ‘this is the interpretation of the study’ and the alternative hypothesis meaning ‘this is not the interpretation of the study.’
This drives us to the understanding that if there is a significant deviation in the alternative hypothesis, the null hypothesis can be rejected.
An explanation of hypotheses based testing:
1. Framing the null and alternative hypotheses.
2. Visualizing an analysis plan, using test data to evaluate the hypotheses, resulting
in a null hypotheses testing and an alternative hypothesis testing.
3. Analyzing the sample data.
4. Interpreting the results. Co-relate the values of the test statistic and the hypotheses. Conclude.
One-tailed Test: The one-tailed test, also called a directional test, considers a critical region of data that would result in the rejection of the null hypothesis if the test sample falls into it, inevitably meaning the acceptance of the alternate hypothesis.
The critical area of distribution in a one-tailed test is one-sided, meaning that the test sample is either greater or lesser than a specific value.
Two-tailed Test: In a two-tailed test, the test sample is checked to be greater or lesser than a range of values, meaning that the critical area of distribution would be two-sided.
If the sample falls into this region, the alternate hypothesis would be accepted, in turn leading to the rejection of the null hypothesis.
Two types of errors can result from a hypothesis test.
Type-I Error: When sample results lead to the rejection of the null hypothesis, despite it being true, they lead to a Type-I error.
Type-II Error: In contrary to a Type-I error, a Type-II error occurs when the null hypothesis is not rejected when it is false.
Other important terminology vital to our understanding of hypothesis testing:
Parameter and Statistic:
A parameter is a well-rounded description of a fixed characteristic of individual elements of study.
In real-world scenarios, it is a complicated and tedious process to analyze or even obtain data regarding large sets of individual elements of study; hence samples or smaller groups are singled out.
A statistic is a summarized description of this relatively digestible sample, an estimation or approximation of the parameter.
Sampling Distribution: The frequency of distribution of a particular statistic obtained by drawing multiple samples from a single study.
Standard Error: The standard error informs how much a statistic of a sample deviates from the average statistic of all the elements in the study.
Decision rules: Defined as rules based on which a null hypothesis is rejected or accepted, and decision rules form an essential component of the analysis plan.
P-value: The P-value defines the weight of evidence in support of the null hypothesis. The probability that the test result is equal to or granter than the value that falls under the region of acceptance is the probability that the test result is equal to or granter.
Region of acceptance:
Covering a range of values, the region of acceptance means that the null hypothesis is not rejected if the test value falls within its range.
The set of values outside this region makes up the rejection region, meaning that if the test value falls within this region, the null hypothesis is rejected.
Having read this article, we hope one would have much better clarity regarding hypothesis testing, one of the most elusive but distinct concepts which form the field of Data Science.
If you are interested in making a career in the Data Science domain, our 11-month in-person PG in Data Science course can help you immensely in becoming a successful Data Science professional.