Published in: Proceedings of the international conference on Web search and web data mining Link: http://portal.acm.org/citation.cfm?id=1341560
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them
As authors claimed spam reviews are currently unique area and they are different from other type of web spams as disscussed in Review: Review Spam Detection by the same authors. they classify review spams into 3 types:
- Type 1: Deliberately mislead reviews. Hard to detect.
- Type 2: Reviews on brands only not product.
- Type 3: non-reviews suc has ads, question and answers, …
Their test enviorment is Amazon.com with 5.8 million reviews.
As discussed in previous post, Type 1 reviews are hard to detect. Authors propose new way to study this problem. They first find which reviews are harmful. harmful mean those review that are different from others reviews in a product page.
In their model, 36 feature for reviews, reviewers and products proposed, first these features used to detect type 2 and 3 (duplicate) reviews then they used this as a trainning set for detecting type 1 reviews.
Their result based on AUC ecaluation is 98% for type 2 and 3 spams and 78% for type 1.
- they try to detect good and bad products based on product rating, may be this assumtion is not suitable since we have some spam ratings. although they mentioned to this drawback using data mining method for dicovering which product is good which is bad is recommanded.
- They do not mentioned to cost/benefit of their model.
- Other type of features which are not accesible from user frontpage may be make the results better.
- Lift curve
- AUC – Area under ROC curve
- R – http://www.r-project.org