Train Test Split For Evaluating Machine Learning Algorithms: An Important Guide (2021)

Introduction to Train Test Split

Sometimes machine learning algorithms need to make predictions on data that are not used to train the model, in such instances the train test procedure is used to estimate its performance. While it is fast and simple, it is limited only to large data sets. When the data sets are small and additional configurations is needed, such as times for when it is used for classification and the data set is not balanced, the train test split cannot be used.  

1. Train Test Split Evaluation

The train test split technique can be used for classification and regression problems to test machine learning algorithms. The procedure takes the given dataset and splits it into two subsets: 

  • Training dataset: it is used to train the algorithm and fit the machine learning model.
  • Test dataset: Using the input element from the training data, the algorithms make predictions.

The model is first to fit on the available data with known inputs and outputs. t is then run to make predictions on the rest of the data subset to learn from it. This can be used to make predictions on future data sets where the expected input and output values are non-existent. 

Train Test Split Procedure in Scikit-Learn

You can use the scikit-learn machine learning library in Python for the implementation of the train test split evaluation procedure using the train_test_split() function. It works by taking the dataset loaded as the input and outputs it split as two subsets: 

# split into train test sets

train, test = train_test_split(dataset, …)

It is good to split the original data set into input (X) and output (Y) columns. Have the function pass both arrays and split correctly into train and test subsets. 

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, …) 

The split size of the data can be assigned using the “test­_size” argument. It picks a number of rows (integer) or a percentage (float) of the portion of the dataset with values between 0 and 1. More commonly percentages (float) are used where values such as 0.33 or 33 per cent of the data set will be assigned to the test set and the remaining 67 per cent of the data goes to the training set. 

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

For example, consider this data set with 1000 examples: 

# split a dataset into train and test sets

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_blobs(n_samples=1000)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) 

When the example data set is split, the function prints the new data set. As it can be seen, it divides it into 670 samples (0.67) to the training set and 330 (0.33) samples to the test set. 

(670, 2) (330, 2) (670,) (330,)

The “train_size” argument can also be used to split the data set as a number of rows (integer) or a percentage of the original set ranging from 0 to 1, such as 0.67 for 67 per cent. 

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67) 

Train Test Split to Evaluate Machine Learning Models

This section will show an example of how the train test data split can be used to evaluate machine learning models on standard classification. 

Here a random forest algorithm on a sonar data set to demonstrate the train test split. The sonar dataset consists of 208 rows of data and 60 numerical input variables. The target variable has two classes, for example, binary classification. 

The dataset here is to predict whether the sonar returns indicate a rock or a simulated mine. This example downloads the dataset and summarizes its shape. 

# summarize the sonar dataset

from pandas import read_csv

# load dataset

url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv’

dataframe = read_csv(url, header=None)

# split into input and output elements

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

When the dataset is run it is split into input and output elements as 208 rows of data with 60 input variables as expected. 

(208 , 60) (208 , )

Now to evaluate a model using the train test split, the loaded data should first be split into input and output.

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Split the data set for 0.67 for training the model and 0.33 to evaluate it.

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) 

Now to define and fit the model on a training data set

# fit the model

model = RandomForestClassifier(random_state=1)

model.fit(X_train, y_train)

the fit model is used to make predictions and use the classification accuracy performance metric to evaluate the predictions

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

print(‘Accuracy: %.3f’ % acc) 

Finally, the model is evaluated and the performance of making predictions on new data has an accuracy of about78.3 per cent. 

(208, 60) (208,)

(139, 60) (69, 60) (139,) (69,)

Accuracy: 0.783

Conclusion

The train test split method is a good evaluation procedure for machine learning algorithms when dealing with large data sets. It is also reliable when training costly models or estimating model performance quickly. This article also lays out the procedure to use the scikit-learn machine learning library to perform the procedure and how to evaluate the machine learning algorithms for classification using the train-slit method. 

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.

ALSO READ

Related Articles

loader
Please wait while your application is being created.
Request Callback