Review: Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics

March 15, 2008
Authors: Yu-Ru Lin Hari Sundaram Yun Chi Junichi Tatemura Belle L. Tseng
Year: 2007
Published in: The International World Wide Web Conference Committee (IW3C2)
Link: http://www2007.org/workshops/paper_112.pdf
Importance: High

Abstract

This paper focuses on spam blog (splog) detection. Blogs are
highly popular, new media social communication mechanisms.
The presence of splogs degrades blog search results as well as
wastes network resources. In our approach we exploit unique blog
temporal dynamics to detect splogs.
There are three key ideas in our splog detection framework. We
first represent the blog temporal dynamics using self-similarity
matrices defined on the histogram intersection similarity measure
of the time, content, and link attributes of posts. Second, we show
via a novel visualization that the blog temporal characteristics
reveal attribute correlation, depending on type of the blog (normal
blogs and splogs). Third, we propose the use of temporal
structural properties computed from self-similarity matrices across
different attributes. In a splog detector, these novel features are
combined with content based features. We extract a content based
feature vector from different parts of the blog – URLs, post
content, etc. The dimensionality of the feature vector is reduced
by Fisher linear discriminant analysis. We have tested an SVM
based splog detector using proposed features on real world
datasets, with excellent results (90% accuracy).

My Review

Intrestinly these two paper are the same which published in two different conferences under different title. you can read review on first paper here:

Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties

but authors try to cover drawbacks of previous paper here. for example they provide more figures and reason for vague part of previous one. The content of paper is better organized and more clear. What I interested from this paper is their equation and mathematic formulation of Splog features.

Advertisements

Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties

March 8, 2008
Authors: Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng
Year: 2006
Published in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings
Link: http://trec.nist.gov/pubs/trec15/papers/nec.blog.final.pdf
Importance: High

Abstract

Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and email spam, we identify some unique characteristics of splog. Second, we propose a new online task that captures the unique characteristics of splog, in addition to tasks based on the traditional IR evaluation framework. The new task introduces a novel time-sensitive detection evaluation to indicate how quickly a detector can identify splogs. Third, we propose a splog detection algorithm that combines traditional content features with temporal and link regularity features that are unique to blogs. Finally, we develop an annotation tool to generate ground truth on a sampled subset of the TREC-Blog dataset. We conducted experiments on both offline (traditional splog detection) and our proposed online splog detection task. Experiments based on the annotated ground truth set show excellent results on both offline and online splog detection tasks.

My Review

Authors suggest new method for detecting splogs based on online approach beside other offline approaches.

Methods for deceiving search engines by splogs:

  1. relevancy – via keyword stuffing
  2. popularity – via link farm
  3. recency – via frequent posts

Splog characteristic

  1. Machine-generated content
  2. No value addition
  3. Hidden agenda, usually economic goal

Number 2 and 1 are not good classified, their definition cover each other.

Authors claimed that Splogs are different from Web Spams since:

  1. Blogs’ dynamic content
  2. Non-endorsement links

Above two reasons are not sufficient for differing splogs from web spasm. Mistakenly or not, Authors mentioned to two reasons for differing blogs from other websites. Also above two characteristics are well known in some kind of websites such as news website that updated frequently and get user opinions for each news article.

As paper goes authors mention to these facts that splogs content sometimes are copied from non-spam blogs so current web spam detection can not detect them. More or less this claim is true but they should provide good reasons or statistical surveys for proving this claim.

Personally thought that splog can be detected into 4 ways:

  1. Copied contents. We can found them by employing text comparing algorithms.
  2. Repetitive content. By employing current web spam detection algorithms.
  3. Link ring. By employing link farms detection algorithms.
  4. Sping. By employing Spam ping detection algorithms
  5. Design of blog. Blog visitors many times found whether or not blog is spam by looking at design of blog. Currently there is no work on this issue. Any I like to work on it.

In the rest of paper authors demonstrate their detection method, they used 5 feature for detecting splogs which are:

  1. Tokenized URLs
  2. Blog and post titles
  3. Anchor text
  4. Blog homepage content
  5. Post content

There is nothing new with these feature that authors claimed before, and they do not clearly mention to their detection method difference.  or at least I did not figure it out.


Review: Characterizing the Splogosphere

March 2, 2008
Authors: Pranam Kolari, Akshay Java, and Tim Finin
Year: 2006
Published in: Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference
Link: http://ebiquity.umbc.edu/paper/html/id/299/Characterizing-the-Splogosphere
Importance: Very High

Abstract

Weblogs or blogs collectively constitute the Blogosphere, forming an influential and interesting subset on the Web. As with most Internet-enabled applications, the ease of content creation and distribution makes the blogosphere spam prone. Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting ads or raising the PageRank of target sites. These splogs make up the splogosphere, and are now inundating blog search engines and update ping servers. In this work we characterize splogs by comparing them against authentic blogs. Our analysis is based on a dataset made publicly available by BlogPulse, and employs a machine learning model that detects splogs with an accuracy of 90%. To round off this analysis and to better understand splogs, we also present our study of a popular blog update ping server, and show how they are overwhelmed by pings sent by splogs. This overall study will facilitate finding effective new techniques to detect and weed out splogs from the blogosphere.

My Review

Authors first describe Splogosphere definition vs Blogosphere. They promote their method for detecting Splog (Spam Blogs). Their method consist two observation and first of all whole blog content must be analyzed:

  1. Detecting spam blogs/authentic blogs through a word list.
  2. Detecting spam blogs thought links structures and link farm which are exists.

Although authors claimed that this approach is 97% accurate for detecting spam logs, one drawback is that whole blog posts and link must be analyzed in order to detect Splog from Blogs and it is time consuming task for rapid growing blogs and posts.

Another drawback is that authors do not publish their Splogs word list, this is weakness of their spam detection model.

Authors continue their work on ping servers. they divide spam pings (Spings) into two categories. Pings come from non-blogs and pings come from splogs. In this section authors distinguished Splog from Blog through studying their pining nature through time of day and hour. Also they believe that many “.info” domain are sploggy since web search engines rank URL token higher. Also Splogs do no re-ping same URL again.

So 3 approach mentioned for detecting splogs:

Step Advantage Disadvantage
1. At Update ping servers Fast Need a period of time for studying blog behavior
2. Before indexing content Accurate and Offline Need whole content of blog, suggested world list is vage
3. After indexing content Offline Need a period of time for studying link farms

 Cite this article as

Critical Review on “Characterizing the Splogosphere” by P.Hayati, 3nd Mar, 2008. Available Online – https://pi3ch.wordpress.com/2008/03/02/review-characterizing-the-splogosphere/


Review: Towards Spam Detection at Ping Servers

March 2, 2008
Authors: Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi
Year: 2007
Published in: Proceedings of the International Conference on Weblogs and Social Media (ICWSM 2007)
Link: http://ebiquity.umbc.edu/paper/html/id/342/Towards-Spam-Detection-at-Ping-Servers
Importance: Very High

Abstract

Spam blogs, or splogs, are blogs featuring plagiarized or auto-generated content. They create link farms to promote affiliates, and are motivated by the profitability of hosting ads. Splogs infiltrate the blogosphere at ping servers, systems that aggregate blog update pings. Over the past year, our work has focused on detecting and eliminating splogs. As techniques used by spammers have evolved, we have learned how splog signatures are tied to tools that create them, that they are beginning to be a problem across languages, and that they require a much quicker assessment. Though we continue to address these specific challenges, we discuss our larger goal in this work, of developing a scalable meta-ping filter that detects and eliminates update pings from splogs. This will considerably reduce computational requirements and manual efforts at downstream services (search engines) and involve the community in detecting spam blogs.

My Review

I like to review this paper in a question & answer manner. This paper answer to some questions as below:

Why spammer use blogs?

  1. Blog are more relevance in web search engines
  2. Ping servers quickly notify  blog new data to search engines
  3. Hosting blogs available for free

How search engines filter Splog contents?

With two methods: Pre-Indexing and Post-Indexing

What is the proposed method for Splog detection in this paper?

Authors provide new system called Meta-Ping server, in pre-indexing time. This system provide search engins a blacklist of blogs.

What are the approaches for making meta-ping server?

Four filtering method will be used on Meta-Ping server:

  1. URL based filtering
  2. Blacklist based filtering
  3. Blog home-page based filtering,
  4. Feed based filtering

Cite this article as
Critical Review on “Towards Spam Detection at Ping Servers” by P.Hayati, 2nd Mar, 2008. Available Online – https://pi3ch.wordpress.com/2008/03/02/review-towards-spam-detection-at-ping-servers/


Review: Blog Open Track Task: Spam Blog Classification

February 19, 2008
Authors: Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi and Justin Martineau, James Mayfield.
Year: 2006
Published in: TREC 2006 Blog Track Notebook
Link: http://ebiquity.umbc.edu/paper/html/id/318/Blog-Track-Open-Task-Spam-Blog-Classification
Important: Very High

Abstract

Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting
a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open task submission details
how splogs impact Opinion Identification, and proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

My Review

In this paper, the authors first make it is clear that spam blogs (SPlogs) are different from web spam classification because of three reasons:
Search Engine Coverage: More search engine coverage especially for blogs
Quicker Assessment: Blogs updated sonly in compare with web page
Genre of Blog content: Current web spam detection tools are not suitable for nature of blog contents which comes from personal opinions, …

Authors mentioned to two models of blog spam detections:

  1. Local models
  2. Link-based models

They emphasis on first model which is one that detect spam thought single web page and it does not seek other links or other data source for detecting spam content. Main advantage of this model is quick assessment of spam content.

Techniques used in local model:

  1. Words
  2. Word N-Grams
  3. Tokenized Anchors
  4. Tokenized URLs

Other techniques:

  1. Ping servers discard blogs which ping too frequently
  2. Comment spam model: Akismet
  3. URL/IP blacklist

Splog categories:

  1. Non-blogs
  2. Keyword-stuffing
  3. Post-stitching
  4. Post-plagiarism
  5. Post-weaving
  6. Link-spam
  7. Other

Suggestions for authors:
It is important to how to create bag-of-word in order to detect spam contents. For this task, we can include highly advertisement keyword in to this list since many splogs use these keywords to get high traffic. Also, keywords that are based on times in year, e.g. greeting cards, happy valentine, … during specificity time of year.

Important Terms

  • Local models
  • Link-based models
  • Words
  • Word N-Grams
  • Tokenized Anchors
  • Tokenized URLs
  • Non-blogs
  • Keyword-stuffing
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Link-spam

Useful References
Java, A.; Kolari, P.; Finin, T.; Mayfield, J.; Joshi, A.; and Martineau, J. 2006. The UMBC/JHU blogvox system. In Proceedings of the Fifteenth Text Retrieval Conference.