Review: Blog Open Track Task: Spam Blog Classification

Authors: Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi and Justin Martineau, James Mayfield.
Year: 2006
Published in: TREC 2006 Blog Track Notebook
Important: Very High


Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting
a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open task submission details
how splogs impact Opinion Identification, and proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

My Review

In this paper, the authors first make it is clear that spam blogs (SPlogs) are different from web spam classification because of three reasons:
Search Engine Coverage: More search engine coverage especially for blogs
Quicker Assessment: Blogs updated sonly in compare with web page
Genre of Blog content: Current web spam detection tools are not suitable for nature of blog contents which comes from personal opinions, …

Authors mentioned to two models of blog spam detections:

  1. Local models
  2. Link-based models

They emphasis on first model which is one that detect spam thought single web page and it does not seek other links or other data source for detecting spam content. Main advantage of this model is quick assessment of spam content.

Techniques used in local model:

  1. Words
  2. Word N-Grams
  3. Tokenized Anchors
  4. Tokenized URLs

Other techniques:

  1. Ping servers discard blogs which ping too frequently
  2. Comment spam model: Akismet
  3. URL/IP blacklist

Splog categories:

  1. Non-blogs
  2. Keyword-stuffing
  3. Post-stitching
  4. Post-plagiarism
  5. Post-weaving
  6. Link-spam
  7. Other

Suggestions for authors:
It is important to how to create bag-of-word in order to detect spam contents. For this task, we can include highly advertisement keyword in to this list since many splogs use these keywords to get high traffic. Also, keywords that are based on times in year, e.g. greeting cards, happy valentine, … during specificity time of year.

Important Terms

  • Local models
  • Link-based models
  • Words
  • Word N-Grams
  • Tokenized Anchors
  • Tokenized URLs
  • Non-blogs
  • Keyword-stuffing
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Link-spam

Useful References
Java, A.; Kolari, P.; Finin, T.; Mayfield, J.; Joshi, A.; and Martineau, J. 2006. The UMBC/JHU blogvox system. In Proceedings of the Fifteenth Text Retrieval Conference.


4 Responses to Review: Blog Open Track Task: Spam Blog Classification

  1. vidy says:

    Hi pedram
    This paper looks intresting, it is very recent as well.
    Read all the references from this paper too, they may be intresting as well.

  2. vidy says:

    Hi Pedram

    Check out this thesis on Email Spam, recently published in 2006

    Email Mining Toolkit (EMT) named as Profiling Email Toolkit is intresting and useful


  3. vidy says:

    I briefly went thru this article and found that “opinion retrieval” seems like an intresting topic for research as well, it may be the next big thing after “information retrieval”


  4. vidy says:

    Hi Pedram
    This paper gives a detailed outline on the current Spamming Techniques

    Taxanomy of Web Spam


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: