Review: Review Spam Detection

April 29, 2008
Author: Nitin Jindal and Bing Liu
Year: 2007
Published in: The International World Wide Web Conference Committee
Importance: High


It is now a common practice for e-commerce Web sites to enable
their customers to write reviews of products that they have
purchased. Such reviews provide valuable sources of information
on these products. They are used by potential customers to find
opinions of existing users before deciding to purchase a product.
They are also used by product manufacturers to identify problems
of their products and to find competitive intelligence information
about their competitors. Unfortunately, this importance of reviews
also gives good incentive for spam, which contains false positive
or malicious negative opinions. In this paper, we make an attempt
to study review spam and spam detection. To the best of our
knowledge, there is still no reported study on this problem

My Review

This only two pages paper talks about review spams which is new to the word of web spam. According to author claim review spams are different from web and email spam so we need new methods for detecting review spam. Also review spams are hard to detect even manually.

Why hard to detect?

  1. Similarity to real reviews.
  2. Not enough meta-data for analysing.

Mainly, author tries to detect duplicate reviews in this paper and they provide a model based on shingle method. other type of review as author said are hard to detect and the outcome of their work is small.


Personally think that since still it is hard to detect review spam manually we should improve spam prevention methods such as CAPTCHA in order to disallow review spams (Sreview). So for the time being I have no idea on detecting review spam after postage.

Good reference

Web Data mining book by Bing Liu


Review: Identifying Link Farm Spam Pages

March 3, 2008
Authors: B. Wu and B. D. Davison
Year: 2005
Published in: in Proceedings of the 14th International World Wide Web Conference
Importance: Medium


With the increasing importance of search in guiding today’s web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines’ ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.

My Review

Link farm is a term which has been used in many papers, in this paper authors define it as and I am quoting here: “A network of web sites which are densely connected with each other”.

Main identifying method:

  1. Find domain name of incoming links of page X
  2. Find domain name of outgoing links of page X
  3. Find intersection between incoming and outgoing link
  4. If above intersection be greater than threshold, the mark page as bad page

Expanding method:

  1. For each page which is not marked as bad in previous algorithm find all outgoing link
  2. For each outgoing link, if it link to a bad page which is more than threshold, then the page itself marked as a bad page.
  3. Do step 1 and 2 until there is not page marked whether or not as bad page.

Important Terms

  • TKC effect

Reference Sheet

  • In an study of Yahoo! search, about 68% of all queries have at least one spam page within the top 200 response list from the search engine. about 9% queries have at least one spam page within the top 10 list.

Useful references

  • R. Lempel and S. Moran. The stochastic approach fo link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1-6):387-401, 2000.