Review: Improving Cloaking Detection Using Search Query Popularity and Monetizability

May 16, 2008
Authors: Kumar Chellapilla, David Maxwell Chickering
Year: 2006
Published in: 2 nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb)
Importance: High


Cloaking is a search engine spamming technique used by some
Web sites to deliver one page to a search engine for indexing
while serving an entirely different page to users browsing the site.
In this paper, we show that the degree of cloaking among search
results depends on query properties such as popularity and
monetizability. We propose estimating query popularity and
monetizability by analyzing search engine query logs and online
advertising click-through logs, respectively. We also present a
new measure for detecting cloaked URLs that uses a normalized
term frequency ratio between multiple downloaded copies of Web
pages. Experiments are conducted using 10,000 search queries
and 3 million associated search result URLs. Experimental results
indicate that while only 73.1% of the cloaked popular search
URLs are spam, over 98.5% of the cloaked monetizable search
URLs are spam. Further, on average, the search results for top 2%
most cloaked queries are 10x more likely to be cloaking than
those for the bottom 98% of the queries.

My Review

the view presented in this paper is that authors using new method to detect cloacking. From authors’ point of view clocking is a kind of hidding technique which use by spam pages to view different page to the search engine crawlers than real user. In this case spam page designers can trick crawler by some tricky methods to get higher positions in search engine ranking. Authors claims that monetizability is a goal of web spam. They used MSN serach engine as their primary data set.

An outcome of their work is 98% accurate for monetizable cloacking spam and 75% for popular query spam.


Review: Spam Double-Funnel: Connecting Web Spammers with Advertisers

May 12, 2008
Authors: Yi-Min Wang, Ming Ma, Yuan Niu, Hao Chen
Year: 2007
Published in: Proceedings of the 16th international conference on World Wide Web
Importance: High


Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords. one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

My Reiview

In this paper authors intrestingly model end-to-end spamming business with five-layer double funnel. These layer are as follow:

  1. Doorway: Spammer setup
  2. Redirection domain: Spammers set up redirection doamin.
  3. Aggregators: insulate below layers from spam pages.
  4. Sydicator: buy traffic from aggregators.
  5. Advertiser: pay sydicators to dispayer their ads.

Five top spam keywords:

  1. Phentermine
  2. Viagra
  3. Cialis
  4. Tramadol
  5. Xanax

More coming soon…

Review: Fighting Spam on Social Web Sites

May 9, 2008
Authors: Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina
Year: 2007
Published in: IEEE Computer Society
Importance: High


In recent years, social Web sites have become important components of the
Web. With their success, however, has come a growing influx of spam. If left
unchecked, spam threatens to undermine resource sharing, interactivity, and
openness. This article surveys three categories of potential countermeasures —
those based on detection, demotion, and prevention. Although many of these
countermeasures have been proposed before for email and Web spam, the
authors find that their applicability to social Web sites differs. How should we
evaluate spam countermeasures for social Web sites, and what future challenges
might we face?

My Review

One of the beauitiful papers that I read recently was this paper. Authors very good classify current litrature in web spam filed the try to demonestrate how each anti-spam method works on their example of social community website (social bookmarking).

More dynamic content more avenue for spamming!

3 main anti-spam strategies:

  1. Detection-based: text classification, link analysis, user behavior anaysis, …
  2. Prevention-based: CAPTCHA, Account fee, Proof of work, …
  3. Demotion-based: Spam-hardened queries, rank-based, …

it is well understood that few works has been done in 2 and 3.

2 ingredient for method evaluation:

  1. Spam Model: capture whether content is spam or not
  2. Spam Metric: provide quantitative assessment of how spam affect a particular interface.

2 Spam models:

  1. Synthetic spam model: making assumption and define malicious behavior
  2. Trace-driven spam model: based on real data and positive/negative example of spam content

Since content of social community website are updated very rapidly authors used synthetic model to develop detection model.

All and all, their work was worth because of classification not in proposing a new method acctually.

Review: Opinion Spam and Analysis

May 5, 2008
Authors: Nitin Jindal and Bing Liu
Year: 2008
Published in: Proceedings of the international conference on Web search and web data mining
Importance: High


Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them

My Review

As authors claimed spam reviews are currently unique area and they are different from other type of web spams as disscussed in Review: Review Spam Detection by the same authors. they classify review spams into 3 types:

  • Type 1: Deliberately mislead reviews. Hard to detect.
  • Type 2: Reviews on brands only not product.
  • Type 3: non-reviews suc has ads, question and answers, …

Their test enviorment is with 5.8 million reviews.

As discussed in previous post, Type 1 reviews are hard to detect. Authors propose new way to study this problem. They first find which reviews are harmful. harmful mean those review that are different from others reviews in a product page.

In their model, 36 feature for reviews, reviewers and products proposed, first these features used to detect type 2 and 3 (duplicate) reviews then they used this as a trainning set for detecting type 1 reviews.

Their result based on AUC ecaluation is 98% for type 2 and 3 spams and 78% for type 1.


  1. they try to detect good and bad products based on product rating, may be this assumtion is not suitable since we have some spam ratings. although they mentioned to this drawback using data mining method for dicovering which product is good which is bad is recommanded.
  2. They do not mentioned to cost/benefit of their model.
  3. Other type of features which are not accesible from user frontpage may be make the results better.

Important terms

Review: Is Britney Spears Spam?

March 12, 2008
Authors: Aaron Zinman, Judith Donath 
Year: 2007
Published in: In Proceedings of Fourth Conference on Email and Anti-Spam
Importance: High


We seek to redefine spam and the role of the spam filter in the context of Social Networking Services (SNS). SNS, such as MySpace and Facebook, are increasing in popularity. They enable and encourage users to communicate with previously unknown network members on an unprecedented scale. The problem we address with our work is that users of these sites risk being overwhelmed with unsolicited communications not just from e-mail spammers, but also from a large pool of well intending, yet subjectively uninteresting people. Those who wish to remain open to meeting new people must spend a large amount of time estimating deception and utility in unknown contacts. Our goal is to assist the user in making these determinations. This requires identifying clear cases of
undesirable spam and helping them to assess the more ambiguous ones. Our approach is to present an analysis of the salient features of the sender’s profile and network that contains otherwise hard to perceive cues about their likely intentions. As with traditional spam analysis, much of our work focuses on detecting deception: finding profiles that mimic ordinary users but which are actually commercial and usually undesirable entities. We address this within the larger context of making more legible the key cues presented by any unknown contact. We have developed a research prototype that categorizes senders into broader categories than spam/not spam using features unique to SNS. We discuss our initial experiment, and its results and implications.

My Review

Authors purpose a detection method for social network website user in determining spam users. They use the name Britney Spears to demonstrate a sample spam user.

They interestingly defined the different problem and spam behavior of Social Network websites against other web spam categories.

  1. A friend request from spam users in Social network websites is content-less, so many of content-based detection algorithms can not employed here
  2. In social network websites, simply, filtering based on categories can not help us to detect spam user, since many spam user profiles are deceptive and also how can we define one category as spam and one is not?

They try to categories users on social network website based on two main categories:

  1. Sociability
  2. Promotion

And combination of these two categories

  1. Low sociability and low promotion: New user, low-effort spammer
  2. Low sociability and high promotion: spammer, Britney Spears is here 😉
  3. High sociability and low promotion: Many active users
  4. High sociability and high promotion: Spammer, local band (real users)

For scoring user in these two categories they used to group of features:

  1. Profile-based features:
  2. Network-based features:

Profile-based features include:

  • number of friends
  • number of youtube movies
  • number of details
  • number of comments
  • number of thanks
  • number of survey
  • number of ‘I’
  • number of ‘you’
  • missing picture
  • mp3 player present
  • static url to profile available
  • has a school section
  • has blurbs
  • the page is personalized through CSS
  • has a networking section
  • has a company section
  • has blog entries

Network-based feature include:

  • percent of our comments that are from our top n
  • percent of our top n comments that are from us
  • percent of our comments’ images that are unique
  • percent of our comments’ hrefs that are unique
  • percent of our comments to our top n that have unique hrefs
  • percent of our comments to our top n that have unique images
  • average number of posters that use the same images in our
  • comments to our top n
  • average number of posters that use the same images in our comments
  • average number of posters that use the same hrefs in our comments
  • average number of posters that use the same hrefs in our comments to our top n
  • total number of comments from anyone to our top n
  • total number of images in comments
  • total number of hrefs in comments
  • total number of images in our comments to our top n
  • total number of hrefs in our comments to our top n
  • percent of our comments that have images
  • percent of our comments that have hrefs
  • percent of our comments in our top n that have hrefs
  • percent of our comments in our top n that have images
  • number of independent images in our comments
  • number of independent hrefs in our comments
  • number of independent images in our comments to our top n
  • number of independent hrefs in our comments to our top n

Although they did not provide practical method for their suggestion detection method, their works was great and unique in web spam field.

Review: Next Generation Solutions for Spam a Predictive Approach

March 8, 2008
Authors: Proofpoint MLX
Year: 2008
Published in: Proofpoint MLX Whitepaper
Importance: High


Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an anti-spam solution will always be a step behind the
spammers. Proofpoint MLX™ technology leverages machine learning techniques to provide a revolutionary spam detection system that analyzes millions of messages to automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention.

My Review

In this commercial whitepaper authors classify anti-spam filter into three generations
1st generation – Basic filtering:

  • Signature Based: Compare messages to known spam
  • Challange/responce: Require sender to respond
  • Text pattern matching: Search for spam keywords
  • RBLs, collaborative: Check messages against RBLs and other public anti-spam resources


  • Lose false positive
  • Low effectiveness
  • Easily flood by evolving techniques

2nd generation – heuristics/Bayesian

  • Linear models
  • Simple word match
  • Heuristic rules: Apply rules of thumb to assign a spam score


  • High false positive
  • High administration
  • Effectiveness decay over time

3th generation – Machine learning

  • Logistic regression
  • Super verctor machine
  • Integrated reputation


  • Immune to evolving attacks
  • hight effectiveness without decay
  • low false positive
  • low administration

Authors believe that developing anit-spam methodologies today needs intelligent approachs which adapt itself automatically with new spams.
Proofpoint MLX detection method use logistic regression to find dependecies between spam attributes then assign weight to each attribute and cacluate the net effect weigth each attributes.

I am not much interested in this paper, so I summarized many part. Image-based spam detection is another issue in which detected by this new method. Spam system which is used to sent spammy image useing below techniques to bypass current methods:

  • Randomized Image Borders
  • Randomized Pixels Between Paragraphs
  • Randomized Colortable Entries to Obfuscate the Image
  • Animated GIF with Embedded Spam Image
  • Image Segmentation
  • OCR-resistant Images
  • Combining Image-based and Text-based Techniques

Suggested method use below techniques to filter image-based spams:

  • Fuzzy matching for obfuscated images
  • Dynamic spam  image detection
  • Animated GIF spam detection
  • Dynamic botnet protection

Authors talk about layering system (logical regression) which filter email but they do not every mention to time of this task. Personally thought that it takes a lot of cpu and memory.

Review: Google Patent on Web Spam

March 6, 2008
Author: Bill Slawski
Year: –
Published in: Available on author’s blog.
Importance: Medium

My Review

What I found in this blog article is that Google may cluster pages to find spam and manipulative documents. Clusters contains both interlink and doorway pages. So, for determining whether or not page is manipulative, Google consider both local and not-local pages and grow cluster when it find more doorway pages. After making cluster manipulative signal of each inter and outer documents is counted. Clearly not mentioned to these signals both some of them include:

  1. The text of document (repeated, long, …)
  2. Meta Tags (repeated, long, …)
  3. Redirect (each script redirect page to other page)
  4. Similarly colored text and background
  5. History of document (new owner)
  6. Anchor text (links more than text)

To sum up, all of these signals are counted which are resulted to overall signal. Base on a threshold (as discussed in article: ) if a page marked as manipulative below action will be take:

  • Lowering page rank
  • Page removes entirely
  • Other treats and ways