Sentiments expressed in online news, comments, reviews, etc plays a vital role in today’s business. The public is increasingly more likely to buy the product of a business, when it has positive reviews on e-commerce and social media websites. For instance, people tend to buy a book which has 4 and up star rating on sites such as Flipkart and Amazon.
However awareness is spreading about bought and paid reviews on the web and today many customers have started to think, “Is this review a fake one?” Yes, its hard to believe but true! Services such as buyazonreviews.com and buyamazonreviews.com will give you a good review; for a price ofcourse.
There are also services and groups of reviewers or spammers who will write negative reviews, again for a price. Recently, Amazon raised a suit opposing those websites for writing fake reviews on their site for financial gain.
Data analytics to the rescue again! Yes today there are some data analytics techniques that can help identify fake reviews. A few years ago three data enthusiasts, Arjun Mukherjee, Bing Liu, and Natalie Glance, proposed an algorithm in “Spotting Fake Reviewer Groups in Consumer Reviews”, WWW 2012, April 16–20, 2012 in Lyon, France. The algorithm has the following three steps: a) Frequent pattern mining b) Identifying set of feature indicators c) Ranking based on relational model. The datasets used were product reviews of manufactured category extracted from the Amazon website.
Let’s now understand what these steps are:
Step 1: Frequent pattern mining – finding groups of people, candidate group, who jointly reviewed several products. In frequent itemset mining, the items are the reviewer ids and the transactions are the set of reviewer ids of a product. The result of the frequent itemsets give us the candidate groups who have reviewed various product together.
Step 2: Identifying set of feature indicators – the following set of spamming behaviour indicators were used.
* Group Time Window
* Group Deviation
* Group Content Similarity
* Group Member Content similarity
* Group Early Time Frame
* Group Size Ratio
* Group Size
* Group Support Count
* Individual Rating Deviation
* Individual Content Similarity
* Individual Early Time Frame
* Individual Member coupling in a group
Step 3 : Ranking Group Spam – GSRank (based on relational model)
Input: Weight matrices
Output: Ranked list of candidate spam groups
The real problem is identifying the true group spammers from the group who coincidentally happen to review the similar set of products. Another challenge is that it is tedious to manually label these fake reviews for model building. In other words, it is not so easy to identify an opinion spam review.
Companies these days use sentient analysis quite regularly to evaluate the sentiments of their customers. So naturally, finding and discounting fake reviews or opinions as spam is quite critical to credible sentiment analysis. Therefore any savvy analyst attempting sentiment analysis must be familiar with this technique. Here’s to Frequent Itemset Mining!