Review: Blog Open Track Task: Spam Blog Classification

February 19, 2008
Authors: Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi and Justin Martineau, James Mayfield.
Year: 2006
Published in: TREC 2006 Blog Track Notebook
Important: Very High


Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting
a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open task submission details
how splogs impact Opinion Identification, and proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

My Review

In this paper, the authors first make it is clear that spam blogs (SPlogs) are different from web spam classification because of three reasons:
Search Engine Coverage: More search engine coverage especially for blogs
Quicker Assessment: Blogs updated sonly in compare with web page
Genre of Blog content: Current web spam detection tools are not suitable for nature of blog contents which comes from personal opinions, …

Authors mentioned to two models of blog spam detections:

  1. Local models
  2. Link-based models

They emphasis on first model which is one that detect spam thought single web page and it does not seek other links or other data source for detecting spam content. Main advantage of this model is quick assessment of spam content.

Techniques used in local model:

  1. Words
  2. Word N-Grams
  3. Tokenized Anchors
  4. Tokenized URLs

Other techniques:

  1. Ping servers discard blogs which ping too frequently
  2. Comment spam model: Akismet
  3. URL/IP blacklist

Splog categories:

  1. Non-blogs
  2. Keyword-stuffing
  3. Post-stitching
  4. Post-plagiarism
  5. Post-weaving
  6. Link-spam
  7. Other

Suggestions for authors:
It is important to how to create bag-of-word in order to detect spam contents. For this task, we can include highly advertisement keyword in to this list since many splogs use these keywords to get high traffic. Also, keywords that are based on times in year, e.g. greeting cards, happy valentine, … during specificity time of year.

Important Terms

  • Local models
  • Link-based models
  • Words
  • Word N-Grams
  • Tokenized Anchors
  • Tokenized URLs
  • Non-blogs
  • Keyword-stuffing
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Link-spam

Useful References
Java, A.; Kolari, P.; Finin, T.; Mayfield, J.; Joshi, A.; and Martineau, J. 2006. The UMBC/JHU blogvox system. In Proceedings of the Fifteenth Text Retrieval Conference.