Review: Review Spam Detection

April 29, 2008
Author: Nitin Jindal and Bing Liu
Year: 2007
Published in: The International World Wide Web Conference Committee
Importance: High


It is now a common practice for e-commerce Web sites to enable
their customers to write reviews of products that they have
purchased. Such reviews provide valuable sources of information
on these products. They are used by potential customers to find
opinions of existing users before deciding to purchase a product.
They are also used by product manufacturers to identify problems
of their products and to find competitive intelligence information
about their competitors. Unfortunately, this importance of reviews
also gives good incentive for spam, which contains false positive
or malicious negative opinions. In this paper, we make an attempt
to study review spam and spam detection. To the best of our
knowledge, there is still no reported study on this problem

My Review

This only two pages paper talks about review spams which is new to the word of web spam. According to author claim review spams are different from web and email spam so we need new methods for detecting review spam. Also review spams are hard to detect even manually.

Why hard to detect?

  1. Similarity to real reviews.
  2. Not enough meta-data for analysing.

Mainly, author tries to detect duplicate reviews in this paper and they provide a model based on shingle method. other type of review as author said are hard to detect and the outcome of their work is small.


Personally think that since still it is hard to detect review spam manually we should improve spam prevention methods such as CAPTCHA in order to disallow review spams (Sreview). So for the time being I have no idea on detecting review spam after postage.

Good reference

Web Data mining book by Bing Liu


Review: Developing a Framework for Assessing Information Quality on the World Wide Web

March 19, 2008
Authors: Shirlee-ann Knight and Janice Burn
Year: 2005
Published in: Informing Science Journal Volume 8
Importance: Medium 


The rapid growth of the Internet as an environment for information exchange and the lack of enforceable standards regarding the information it contains has lead to numerous information quality problems. A major issue is the inability of Search Engine technology to wade through the vast expanse of questionable content and return “quality” results to a user’s query. This paper attempts to address some of the issues involved in determining what quality is, as it pertains to information retrieval on the Internet. The IQIP model is presented as an approach to managing the choice and implementation of quality related algorithms of an Internet crawling Search Engine.

My Review

In this paper authors discuss about the problem of Information quality in WWW from search engines perspectives. They clearly define the problem and current solutions. Their proposed model (IQIP) consist four parts:

  • Identify: user, environment and task
  • Quantify: Prioritise information quality dimensions
  • Implement: implement chosen IQ dimension into Web Crawler
  • Perfect: improve crawler through feedback

Their proposed model can be used for attacking spam in WWW. As my supervisor (Dr. Potdar) suggested we make use of this model in anti-spam methods. Simple example:

  • Identify: here we study spammers behavior their subjects, behavior, …
  • Environment: contain study of Splog, Sforums, Spam pages, …
  • Task: Spam detection based on spam characteristic which previously exists in literature.

More review coming soon….

Review: Learning Fast Classifiers for Image Spam

March 17, 2008
Authors: Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach
Year: 2007
Published in: CEAS 2007
Importance: Very High


Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which cre-
ates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

My Review

In introduction part of this paper Authors mentioned that cornerstone of many anti-spam email systems are content filtering often based on naive bayes. Spammer outsmart content filter by:

  • Obscuring text
  • Obfuscating words with symbols
  • including neutral text

to confuse filters. and now spammer use new method which is advertsing in images attachments insted of text so this way neutralizing text filter methods. So authors suggest new method better to sya including new feature for detecting these spam images.

Features that get from images for spam analysis:

  • File Format
  • File Size
  • Image Metadata: comments, number of images (frames), bits per pixel, progressive flag, color table entries, indexvalue, transparent color, logical height and width, components, bands, …
  • Image Size
  • Average Color
  • Color Saturation
  • Edge Detection
  • Prevalent Color Coverage
  • Random Pixel Test

Authors provide very interesting equition for selecting those feature that is both accurate and fast Selected features are:

  • Average Color
  • Color Saturation
  • Edge Detection
  • File Format
  • File Size
  • Image Metadata
  • Image Size
  • Prevelent Color Coverage
  • Random Pixel Test

Important Terms

  • Maximum Entropy
  • Navie Bayes
  • SpamArchive
  • ID3 Decision Tree
  • Just In Time decision tree

Review: Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics

March 15, 2008
Authors: Yu-Ru Lin Hari Sundaram Yun Chi Junichi Tatemura Belle L. Tseng
Year: 2007
Published in: The International World Wide Web Conference Committee (IW3C2)
Importance: High


This paper focuses on spam blog (splog) detection. Blogs are
highly popular, new media social communication mechanisms.
The presence of splogs degrades blog search results as well as
wastes network resources. In our approach we exploit unique blog
temporal dynamics to detect splogs.
There are three key ideas in our splog detection framework. We
first represent the blog temporal dynamics using self-similarity
matrices defined on the histogram intersection similarity measure
of the time, content, and link attributes of posts. Second, we show
via a novel visualization that the blog temporal characteristics
reveal attribute correlation, depending on type of the blog (normal
blogs and splogs). Third, we propose the use of temporal
structural properties computed from self-similarity matrices across
different attributes. In a splog detector, these novel features are
combined with content based features. We extract a content based
feature vector from different parts of the blog – URLs, post
content, etc. The dimensionality of the feature vector is reduced
by Fisher linear discriminant analysis. We have tested an SVM
based splog detector using proposed features on real world
datasets, with excellent results (90% accuracy).

My Review

Intrestinly these two paper are the same which published in two different conferences under different title. you can read review on first paper here:

Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties

but authors try to cover drawbacks of previous paper here. for example they provide more figures and reason for vague part of previous one. The content of paper is better organized and more clear. What I interested from this paper is their equation and mathematic formulation of Splog features.

Review: Is Britney Spears Spam?

March 12, 2008
Authors: Aaron Zinman, Judith Donath 
Year: 2007
Published in: In Proceedings of Fourth Conference on Email and Anti-Spam
Importance: High


We seek to redefine spam and the role of the spam filter in the context of Social Networking Services (SNS). SNS, such as MySpace and Facebook, are increasing in popularity. They enable and encourage users to communicate with previously unknown network members on an unprecedented scale. The problem we address with our work is that users of these sites risk being overwhelmed with unsolicited communications not just from e-mail spammers, but also from a large pool of well intending, yet subjectively uninteresting people. Those who wish to remain open to meeting new people must spend a large amount of time estimating deception and utility in unknown contacts. Our goal is to assist the user in making these determinations. This requires identifying clear cases of
undesirable spam and helping them to assess the more ambiguous ones. Our approach is to present an analysis of the salient features of the sender’s profile and network that contains otherwise hard to perceive cues about their likely intentions. As with traditional spam analysis, much of our work focuses on detecting deception: finding profiles that mimic ordinary users but which are actually commercial and usually undesirable entities. We address this within the larger context of making more legible the key cues presented by any unknown contact. We have developed a research prototype that categorizes senders into broader categories than spam/not spam using features unique to SNS. We discuss our initial experiment, and its results and implications.

My Review

Authors purpose a detection method for social network website user in determining spam users. They use the name Britney Spears to demonstrate a sample spam user.

They interestingly defined the different problem and spam behavior of Social Network websites against other web spam categories.

  1. A friend request from spam users in Social network websites is content-less, so many of content-based detection algorithms can not employed here
  2. In social network websites, simply, filtering based on categories can not help us to detect spam user, since many spam user profiles are deceptive and also how can we define one category as spam and one is not?

They try to categories users on social network website based on two main categories:

  1. Sociability
  2. Promotion

And combination of these two categories

  1. Low sociability and low promotion: New user, low-effort spammer
  2. Low sociability and high promotion: spammer, Britney Spears is here 😉
  3. High sociability and low promotion: Many active users
  4. High sociability and high promotion: Spammer, local band (real users)

For scoring user in these two categories they used to group of features:

  1. Profile-based features:
  2. Network-based features:

Profile-based features include:

  • number of friends
  • number of youtube movies
  • number of details
  • number of comments
  • number of thanks
  • number of survey
  • number of ‘I’
  • number of ‘you’
  • missing picture
  • mp3 player present
  • static url to profile available
  • has a school section
  • has blurbs
  • the page is personalized through CSS
  • has a networking section
  • has a company section
  • has blog entries

Network-based feature include:

  • percent of our comments that are from our top n
  • percent of our top n comments that are from us
  • percent of our comments’ images that are unique
  • percent of our comments’ hrefs that are unique
  • percent of our comments to our top n that have unique hrefs
  • percent of our comments to our top n that have unique images
  • average number of posters that use the same images in our
  • comments to our top n
  • average number of posters that use the same images in our comments
  • average number of posters that use the same hrefs in our comments
  • average number of posters that use the same hrefs in our comments to our top n
  • total number of comments from anyone to our top n
  • total number of images in comments
  • total number of hrefs in comments
  • total number of images in our comments to our top n
  • total number of hrefs in our comments to our top n
  • percent of our comments that have images
  • percent of our comments that have hrefs
  • percent of our comments in our top n that have hrefs
  • percent of our comments in our top n that have images
  • number of independent images in our comments
  • number of independent hrefs in our comments
  • number of independent images in our comments to our top n
  • number of independent hrefs in our comments to our top n

Although they did not provide practical method for their suggestion detection method, their works was great and unique in web spam field.

Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties

March 8, 2008
Authors: Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng
Year: 2006
Published in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings
Importance: High


Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and email spam, we identify some unique characteristics of splog. Second, we propose a new online task that captures the unique characteristics of splog, in addition to tasks based on the traditional IR evaluation framework. The new task introduces a novel time-sensitive detection evaluation to indicate how quickly a detector can identify splogs. Third, we propose a splog detection algorithm that combines traditional content features with temporal and link regularity features that are unique to blogs. Finally, we develop an annotation tool to generate ground truth on a sampled subset of the TREC-Blog dataset. We conducted experiments on both offline (traditional splog detection) and our proposed online splog detection task. Experiments based on the annotated ground truth set show excellent results on both offline and online splog detection tasks.

My Review

Authors suggest new method for detecting splogs based on online approach beside other offline approaches.

Methods for deceiving search engines by splogs:

  1. relevancy – via keyword stuffing
  2. popularity – via link farm
  3. recency – via frequent posts

Splog characteristic

  1. Machine-generated content
  2. No value addition
  3. Hidden agenda, usually economic goal

Number 2 and 1 are not good classified, their definition cover each other.

Authors claimed that Splogs are different from Web Spams since:

  1. Blogs’ dynamic content
  2. Non-endorsement links

Above two reasons are not sufficient for differing splogs from web spasm. Mistakenly or not, Authors mentioned to two reasons for differing blogs from other websites. Also above two characteristics are well known in some kind of websites such as news website that updated frequently and get user opinions for each news article.

As paper goes authors mention to these facts that splogs content sometimes are copied from non-spam blogs so current web spam detection can not detect them. More or less this claim is true but they should provide good reasons or statistical surveys for proving this claim.

Personally thought that splog can be detected into 4 ways:

  1. Copied contents. We can found them by employing text comparing algorithms.
  2. Repetitive content. By employing current web spam detection algorithms.
  3. Link ring. By employing link farms detection algorithms.
  4. Sping. By employing Spam ping detection algorithms
  5. Design of blog. Blog visitors many times found whether or not blog is spam by looking at design of blog. Currently there is no work on this issue. Any I like to work on it.

In the rest of paper authors demonstrate their detection method, they used 5 feature for detecting splogs which are:

  1. Tokenized URLs
  2. Blog and post titles
  3. Anchor text
  4. Blog homepage content
  5. Post content

There is nothing new with these feature that authors claimed before, and they do not clearly mention to their detection method difference.  or at least I did not figure it out.

Review: Next Generation Solutions for Spam a Predictive Approach

March 8, 2008
Authors: Proofpoint MLX
Year: 2008
Published in: Proofpoint MLX Whitepaper
Importance: High


Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an anti-spam solution will always be a step behind the
spammers. Proofpoint MLX™ technology leverages machine learning techniques to provide a revolutionary spam detection system that analyzes millions of messages to automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention.

My Review

In this commercial whitepaper authors classify anti-spam filter into three generations
1st generation – Basic filtering:

  • Signature Based: Compare messages to known spam
  • Challange/responce: Require sender to respond
  • Text pattern matching: Search for spam keywords
  • RBLs, collaborative: Check messages against RBLs and other public anti-spam resources


  • Lose false positive
  • Low effectiveness
  • Easily flood by evolving techniques

2nd generation – heuristics/Bayesian

  • Linear models
  • Simple word match
  • Heuristic rules: Apply rules of thumb to assign a spam score


  • High false positive
  • High administration
  • Effectiveness decay over time

3th generation – Machine learning

  • Logistic regression
  • Super verctor machine
  • Integrated reputation


  • Immune to evolving attacks
  • hight effectiveness without decay
  • low false positive
  • low administration

Authors believe that developing anit-spam methodologies today needs intelligent approachs which adapt itself automatically with new spams.
Proofpoint MLX detection method use logistic regression to find dependecies between spam attributes then assign weight to each attribute and cacluate the net effect weigth each attributes.

I am not much interested in this paper, so I summarized many part. Image-based spam detection is another issue in which detected by this new method. Spam system which is used to sent spammy image useing below techniques to bypass current methods:

  • Randomized Image Borders
  • Randomized Pixels Between Paragraphs
  • Randomized Colortable Entries to Obfuscate the Image
  • Animated GIF with Embedded Spam Image
  • Image Segmentation
  • OCR-resistant Images
  • Combining Image-based and Text-based Techniques

Suggested method use below techniques to filter image-based spams:

  • Fuzzy matching for obfuscated images
  • Dynamic spam  image detection
  • Animated GIF spam detection
  • Dynamic botnet protection

Authors talk about layering system (logical regression) which filter email but they do not every mention to time of this task. Personally thought that it takes a lot of cpu and memory.