Sometimes machine learning algorithms need to make predictions on data that are not used to train the model, in such instances the train test procedure is used to estimate its performance. While it is fast and simple, it is limited only to large data sets. When the data sets are small and additional configurations is needed, such as times for when it is used for classification and the data set is not balanced, the train test split cannot be used.
The train test split technique can be used for classification and regression problems to test machine learning algorithms. The procedure takes the given dataset and splits it into two subsets:
The model is first to fit on the available data with known inputs and outputs. t is then run to make predictions on the rest of the data subset to learn from it. This can be used to make predictions on future data sets where the expected input and output values are non-existent.
Train Test Split Procedure in Scikit-Learn
You can use the scikit-learn machine learning library in Python for the implementation of the train test split evaluation procedure using the train_test_split() function. It works by taking the dataset loaded as the input and outputs it split as two subsets:
…
# split into train test sets
train, test = train_test_split(dataset, …)
It is good to split the original data set into input (X) and output (Y) columns. Have the function pass both arrays and split correctly into train and test subsets.
X_train, X_test, y_train, y_test = train_test_split(X, y, …)
The split size of the data can be assigned using the “test_size” argument. It picks a number of rows (integer) or a percentage (float) of the portion of the dataset with values between 0 and 1. More commonly percentages (float) are used where values such as 0.33 or 33 per cent of the data set will be assigned to the test set and the remaining 67 per cent of the data goes to the training set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
For example, consider this data set with 1000 examples:
# split a dataset into train and test sets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=1000)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
When the example data set is split, the function prints the new data set. As it can be seen, it divides it into 670 samples (0.67) to the training set and 330 (0.33) samples to the test set.
(670, 2) (330, 2) (670,) (330,)
The “train_size” argument can also be used to split the data set as a number of rows (integer) or a percentage of the original set ranging from 0 to 1, such as 0.67 for 67 per cent.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)
Train Test Split to Evaluate Machine Learning Models
This section will show an example of how the train test data split can be used to evaluate machine learning models on standard classification.
Here a random forest algorithm on a sonar data set to demonstrate the train test split. The sonar dataset consists of 208 rows of data and 60 numerical input variables. The target variable has two classes, for example, binary classification.
The dataset here is to predict whether the sonar returns indicate a rock or a simulated mine. This example downloads the dataset and summarizes its shape.
# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv’
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
When the dataset is run it is split into input and output elements as 208 rows of data with 60 input variables as expected.
(208 , 60) (208 , )
Now to evaluate a model using the train test split, the loaded data should first be split into input and output.
# split into inputs and outputs
Split the data set for 0.67 for training the model and 0.33 to evaluate it.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
Now to define and fit the model on a training data set
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
the fit model is used to make predictions and use the classification accuracy performance metric to evaluate the predictions
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print(‘Accuracy: %.3f’ % acc)
Finally, the model is evaluated and the performance of making predictions on new data has an accuracy of about78.3 per cent.
(208, 60) (208,)
(139, 60) (69, 60) (139,) (69,)
Accuracy: 0.783
The train test split method is a good evaluation procedure for machine learning algorithms when dealing with large data sets. It is also reliable when training costly models or estimating model performance quickly. This article also lays out the procedure to use the scikit-learn machine learning library to perform the procedure and how to evaluate the machine learning algorithms for classification using the train-slit method.
There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.
Fill in the details to know more
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods
April 28, 2023
Metaverse: The Virtual Universe and its impact on the World of Finance
April 13, 2023
Artificial Intelligence – Learning To Manage The Mind Created By The Human Mind!
March 22, 2023
Wake Up to the Importance of Sleep: Celebrating World Sleep Day!
March 18, 2023
Operations Management and AI: How Do They Work?
March 15, 2023
How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?
What Are the Ethics in Artificial Intelligence (AI)?
November 25, 2022
What is Epoch in Machine Learning?| UNext
November 24, 2022
The Impact Of Artificial Intelligence (AI) in Cloud Computing
November 18, 2022
Role of Artificial Intelligence and Machine Learning in Supply Chain Management
November 11, 2022
Best Python Libraries for Machine Learning in 2022
November 7, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile