from IPython.display import HTML HTML('''<script> code_show=true; function code_toggle() { if (code_show){ $('div.input').hide(); } else { $('div.input').show(); } code_show = !code_show } $( document ).ready(code_toggle); </script> <form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
We all love ted talks, don’t we? Recently I came across a dataset on Kaggle (https://www.kaggle.com/rounakbanik/ted-talks). This dataset gave me an opportunity to put some NLP and text mining techniques to test. If you look at the data, you will realise that there are two files, one talks about the metadata of different Ted Talks such as Speaker Name, Title, Length of the talk, List of similar talks etc. There is another file which has the complete transcripts of the talks. This got me thinking:
I have the data on all the transcripts across many Ted talks, can I try to come up with a way to recommend ted talks based on their similarity, just as is done by official Ted page?
The data came in as a tabular flat file, the transcript for each talk was stored in a row across a column named transcript. Here is how the file looked like
import pandas as pd transcripts=pd.read_csv("E:\\Kaggle\\ted-data\\transcripts.csv") transcripts.head()
After examining how the data looked like, I figured out that I could easily extract the title of the talk from the url. My eventual goal was to use the text in the transcript column to create a measure of similarity. And then recommend the 4 most similar titles to a given talk. It was quite straightforward to separate the title from url using a simple string split operation as shown below
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1]) transcripts.head()
At this point, I was ready to begin piecing together the components that will help me build a talk recommender. In order to achieve this I had to:
Since our final goal is to recommend talks based on the similarity of their content, the first thing we will have to do is to, create a representation of the transcripts that are amenable to comparison. One way of doing this is to create a tfidf vector for each transcript. But what is this tfidf business anyway? Let’s discuss that first.
To represent text, we will think of each transcript as one “Document” and the set of all documents as a “Corpus”. Then we will create a vector representing the count of words that occur in each document, something like this:
As you can see for each document, we have created a vector for count of times each word occurs. So the vector , represents the count of words “This”,”is”,”sentence”,”one”,”two”,”three” in document 1.This is known as a count matrix. There is one issue with such a representation of text though, it doesn’t take into account the importance of words in a document. For example, the word “one” occurs only once in document 1 but is missing in the other documents, so from the point of view of its importance, “one” is an important word for document 1 as it characterises it, but if we look at the count vector for document 1, we can see that “one” gets of weight of 1 so do words like “This”, “is” etc. Issues, regarding the importance of words in a document, can be handled using what is known as Tf-Idf.
In order to understand how Tf-Idf helps in identifying the importance of the words, let’s do a thought experiment and ask ourselves a couple of questions, what determines if a word is important?
A word is important in a document if, it occurs a lot in the document, but rarely in other documents in the corpus. Term Frequencymeasures how often the word appears in a given document, while Inverse term frquency measures how rare the word is in a corpus. The product of these two quantities measures the importance of the word and is known as Tf-Idf. Creating a tf-idf representation is fairly straightforward, if you are working with a machine learning framework, such as scikit-learn, it’s fairly straightforward to create a matrix representation of text data
from sklearn.feature_extraction import text Text=transcripts['transcript'].tolist() tfidf=text.TfidfVectorizer(input=Text,stop_words="english") matrix=tfidf.fit_transform(Text) #print(matrix.shape)
So once we sort the issue of representing word vectors by taking into account the importance of the words, we are all set to tackle the next issue, how to find out which documents (in our case Ted talk transcripts) are similar to a given document?
To find out similar documents among different documents, we will need to compute a measure of similarity. Usually, when dealing with Tf-Idf vectors, we use similarity. Think of similarity as measuring how close one TF-Idf vector is from the other. Now if you remember from the previous discussion, we were able to represent each transcript as a vector, so the similarity will become a means for us to find out how similar the transcript of one Ted Talk is to the other.So essentially, I created a cosine matrix from Tf-Idf vectors to represent how similar each document was to the other, schematically, something like:
Again, using sklearn, doing this was very straighforward
### Get Similarity Scores using cosine similarity from sklearn.metrics.pairwise import cosine_similarity sim_unigram=cosine_similarity(matrix)
All I had to do now was for, each Transcript, find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number. This was accomplished using a few lines of code
def get_similar_articles(x): return ",".join(transcripts['title'].loc[x.argsort()[-5:-1]]) transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram]
Let’s check how we faired, by examining the recommendations. Let’s pickup, any Ted Talk Title from, the list, let’s say we pick up:
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[1]
'AL GORE ON AVERTING CLIMATE CRISIS'
Then, based on our analysis, the four most similar titles are
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[1]
['RORY BREMNER S ONE MAN WORLD SUMMIT', ',ALICE BOWS LARKIN WE RE TOO LATE TO PREVENT CLIMATE CHANGE HERE S HOW WE ADAPT', ',TED HALSTEAD A CLIMATE SOLUTION WHERE ALL SIDES CAN WIN', ',AL GORE S NEW THINKING ON THE CLIMATE CRISIS']
You can clearly, see that by using Tf-Idf vectors to compare transcripts of the talks, we were able to pick up, talks that were on similar themes. You can find the complete code HERE.
Fill in the details to know more
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods
April 28, 2023
Metaverse: The Virtual Universe and its impact on the World of Finance
April 13, 2023
Artificial Intelligence – Learning To Manage The Mind Created By The Human Mind!
March 22, 2023
Wake Up to the Importance of Sleep: Celebrating World Sleep Day!
March 18, 2023
Operations Management and AI: How Do They Work?
March 15, 2023
How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?
What Are the Ethics in Artificial Intelligence (AI)?
November 25, 2022
What is Epoch in Machine Learning?| UNext
November 24, 2022
The Impact Of Artificial Intelligence (AI) in Cloud Computing
November 18, 2022
Role of Artificial Intelligence and Machine Learning in Supply Chain Management
November 11, 2022
Best Python Libraries for Machine Learning in 2022
November 7, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile