Language to us is quite straight forward, and we have learnt the meaning of words and their context throughout childhood, and sentences make sense effortlessly. The same is not true for machines as they cannot process written language in its raw form. This is the area where ideas such as bag of words and TF-IDF come into place in machine learning. We have to break it down such that the text is converted into a numerical format that machines can read quickly. This is the very basic idea behind what is called Natural Language Processing (NLP).
When it comes to modelling text, there is a complicated problem as it is not straight forward to convert it into something that machine learning algorithms that need specific inputs and outputs can understand. There is no way the machine learning algorithms can work with raw text as there is it not in a format that can directly be fed.
Machines deal with numbers, and the text must first be converted to numbers, more precisely, vectors of numbers. The vectors obtained from converting textual data can reflect the properties of the text from a linguistic perspective. In Natural Language Processing, this is called feature encoding or feature extraction. The bag of words model is a simple and popular feature extraction method.
The bag of words model, which is called BoW in short, is used in machine learning algorithms as a way of extracting features from text to be used in modelling. Due to the simplicity and flexibility of the approach, it can be used in a number of ways to extract features from text documents. The bag of words representation of text describes how words occur within the analyzed document, and it works on two things:
1. A known word vocabulary
2. A measure of how many known words are present
The reason it is called a bag of words is that the model does not consider the order or structure of words or the information present in it, all that is discarded. The model only deals with whether known words occur in the document, even without considering wherein the document.
Here is a bag of words example:
Step 1: Data collection:
Consider these three lines of text as a separate document which needs to be vectorized:
Step 2: Determine the vocabulary:
The vocabulary is defined as the set of all the words found in the documents. These are the only words found in the documents above: the dog, sat, in, the, hat and with.
Step 3: Counting
The vectorization process involves counting the number of times each word appears.
|The dog sat||1||1||1||0||0||0|
|The dog sat in the hat||2||1||1||1||1||0|
|The dog with the hat||2||1||0||0||1||1|
This generates a 6-length vector for each document.
The dog sat: [ 1,1,1,0,0,0]
The dog sat in the hat: [2,1,1,1,1,0]
The dog with the hat: [2,1,0,0,1,1]
As you can see, the bow vector only contains information about what words occur and how many times without contextual information or where they occur.
As can be seen from the previous example, as the vocabulary grows, the vector representation in the documents also grows. This means that for very large documents or thousands of books, the vector length can stretch up to thousands or millions of positions. Since each document can also contain a few known words, that create a lot of empty lots with zeros called a sparse vector. Modelling becomes cumbersome with traditional algorithms, so a number of cleaning methods are used to reduce the size of the vocabulary. That includes ignoring case, ignoring punctuation, fixing misspelt words and ignoring words such as “a”, “of” etc.
Scoring the words is simply attaching a numerical value to mark the occurrence of the words. In the example mentioned above, scoring was binary: the presence or absence of words. Other scoring methods include:
Counts: this is to count every time the word appears in the document.
Frequencies: Calculate the frequency of the words in a document in contrast to the total words in the document.
The bag of words, while simple, also comes with limitations. Its vocabulary needs to be designed carefully to manage the size and sparsity. Computational models are harder when the representations become sparse. The bag of words model also disregards context and the order of words which can otherwise offer a lot of usefulness to it.
The bag of words model is one of the simplest ways of representing data in machine learning algorithms when modelling text. It mainly focuses on scoring the words for the count in terms of the vocabulary. It can however, be made as simple or complex as how one wants.
There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.