Review: Identifying Link Farm Spam Pages

Authors: B. Wu and B. D. Davison
Year: 2005
Published in: in Proceedings of the 14th International World Wide Web Conference
Link: http://www.cse.lehigh.edu/~brian/pubs/2005/www/link-farm-spam.pdf
Importance: Medium

Abstract

With the increasing importance of search in guiding today’s web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines’ ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.

My Review

Link farm is a term which has been used in many papers, in this paper authors define it as and I am quoting here: “A network of web sites which are densely connected with each other”.

Main identifying method:

  1. Find domain name of incoming links of page X
  2. Find domain name of outgoing links of page X
  3. Find intersection between incoming and outgoing link
  4. If above intersection be greater than threshold, the mark page as bad page

Expanding method:

  1. For each page which is not marked as bad in previous algorithm find all outgoing link
  2. For each outgoing link, if it link to a bad page which is more than threshold, then the page itself marked as a bad page.
  3. Do step 1 and 2 until there is not page marked whether or not as bad page.

Important Terms

  • TKC effect

Reference Sheet

  • In an study of Yahoo! search, about 68% of all queries have at least one spam page within the top 200 response list from the search engine. about 9% queries have at least one spam page within the top 10 list.

Useful references

  • R. Lempel and S. Moran. The stochastic approach fo link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1-6):387-401, 2000.
Advertisements

2 Responses to Review: Identifying Link Farm Spam Pages

  1. vidy says:

    Hi Pedram

    This paper may be of interest to you – “Review Spam Detection”

    Link http://www.www2007.org/htmlposters/poster930/

    By Nitin Jindal and Bing Liu
    Department of Computer Science
    University of Illinois at Chicago
    851 South Morgan Street
    Chicago, IL 60607-7053
    nitin.jindal@gmail.com, liub@cs.uic.edu

    Regards

    Vidy

  2. vidy says:

    Hi Pedram

    These papers may be of interest to you

    “Adversarial Information Retrieval: The Manipulation of Web Content”
    http://www.reviews.com/hottopic/hottopic_essay_06.cfm

    “Google Patent on Web Spam, Doorway Pages, and Manipulative Articles”
    http://www.seobythesea.com/?p=922

    Blog Article obout Google 🙂
    http://www.threadwatch.org/node/13925

    Good Bibliography
    http://socialmedia.scribblewiki.com/Annotated_Bibliography

    Spam Patent
    http://forbes.bitpipe.com/detail/PROD/1082029566_946.html&src=forbes.bitpipe.com

    Regards

    Vidy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: