Review: Characterizing the Splogosphere

March 2, 2008
Authors: Pranam Kolari, Akshay Java, and Tim Finin
Year: 2006
Published in: Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference
Link: http://ebiquity.umbc.edu/paper/html/id/299/Characterizing-the-Splogosphere
Importance: Very High

Abstract

Weblogs or blogs collectively constitute the Blogosphere, forming an influential and interesting subset on the Web. As with most Internet-enabled applications, the ease of content creation and distribution makes the blogosphere spam prone. Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting ads or raising the PageRank of target sites. These splogs make up the splogosphere, and are now inundating blog search engines and update ping servers. In this work we characterize splogs by comparing them against authentic blogs. Our analysis is based on a dataset made publicly available by BlogPulse, and employs a machine learning model that detects splogs with an accuracy of 90%. To round off this analysis and to better understand splogs, we also present our study of a popular blog update ping server, and show how they are overwhelmed by pings sent by splogs. This overall study will facilitate finding effective new techniques to detect and weed out splogs from the blogosphere.

My Review

Authors first describe Splogosphere definition vs Blogosphere. They promote their method for detecting Splog (Spam Blogs). Their method consist two observation and first of all whole blog content must be analyzed:

  1. Detecting spam blogs/authentic blogs through a word list.
  2. Detecting spam blogs thought links structures and link farm which are exists.

Although authors claimed that this approach is 97% accurate for detecting spam logs, one drawback is that whole blog posts and link must be analyzed in order to detect Splog from Blogs and it is time consuming task for rapid growing blogs and posts.

Another drawback is that authors do not publish their Splogs word list, this is weakness of their spam detection model.

Authors continue their work on ping servers. they divide spam pings (Spings) into two categories. Pings come from non-blogs and pings come from splogs. In this section authors distinguished Splog from Blog through studying their pining nature through time of day and hour. Also they believe that many “.info” domain are sploggy since web search engines rank URL token higher. Also Splogs do no re-ping same URL again.

So 3 approach mentioned for detecting splogs:

Step Advantage Disadvantage
1. At Update ping servers Fast Need a period of time for studying blog behavior
2. Before indexing content Accurate and Offline Need whole content of blog, suggested world list is vage
3. After indexing content Offline Need a period of time for studying link farms

 Cite this article as

Critical Review on “Characterizing the Splogosphere” by P.Hayati, 3nd Mar, 2008. Available Online – https://pi3ch.wordpress.com/2008/03/02/review-characterizing-the-splogosphere/