New homepage

January 29, 2010

My new homepage is accessibale on:

From now on you can find my blog posts on my new homepage.


My new homepage

September 8, 2008

My new homepage is accessibale on:

From now on you can find my blog posts on my new homepage.

Review: Improving Cloaking Detection Using Search Query Popularity and Monetizability

May 16, 2008
Authors: Kumar Chellapilla, David Maxwell Chickering
Year: 2006
Published in: 2 nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb)
Importance: High


Cloaking is a search engine spamming technique used by some
Web sites to deliver one page to a search engine for indexing
while serving an entirely different page to users browsing the site.
In this paper, we show that the degree of cloaking among search
results depends on query properties such as popularity and
monetizability. We propose estimating query popularity and
monetizability by analyzing search engine query logs and online
advertising click-through logs, respectively. We also present a
new measure for detecting cloaked URLs that uses a normalized
term frequency ratio between multiple downloaded copies of Web
pages. Experiments are conducted using 10,000 search queries
and 3 million associated search result URLs. Experimental results
indicate that while only 73.1% of the cloaked popular search
URLs are spam, over 98.5% of the cloaked monetizable search
URLs are spam. Further, on average, the search results for top 2%
most cloaked queries are 10x more likely to be cloaking than
those for the bottom 98% of the queries.

My Review

the view presented in this paper is that authors using new method to detect cloacking. From authors’ point of view clocking is a kind of hidding technique which use by spam pages to view different page to the search engine crawlers than real user. In this case spam page designers can trick crawler by some tricky methods to get higher positions in search engine ranking. Authors claims that monetizability is a goal of web spam. They used MSN serach engine as their primary data set.

An outcome of their work is 98% accurate for monetizable cloacking spam and 75% for popular query spam.

Review: Spam Double-Funnel: Connecting Web Spammers with Advertisers

May 12, 2008
Authors: Yi-Min Wang, Ming Ma, Yuan Niu, Hao Chen
Year: 2007
Published in: Proceedings of the 16th international conference on World Wide Web
Importance: High


Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords. one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

My Reiview

In this paper authors intrestingly model end-to-end spamming business with five-layer double funnel. These layer are as follow:

  1. Doorway: Spammer setup
  2. Redirection domain: Spammers set up redirection doamin.
  3. Aggregators: insulate below layers from spam pages.
  4. Sydicator: buy traffic from aggregators.
  5. Advertiser: pay sydicators to dispayer their ads.

Five top spam keywords:

  1. Phentermine
  2. Viagra
  3. Cialis
  4. Tramadol
  5. Xanax

More coming soon…

Review: A Taxonomy of JavaScript Redirection Spam

May 12, 2008
Authors: Kumar Chellapilla, Alexey Maykov
Year: 2007
Published in: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Importance: High


Redirection spam presents a web page with false content to a crawler for indexing, but automatically redirects the browser to a different web page. Redirection is usually immediate (on page load) but may also be triggered by a timer or a harmless user event such as a mouse move. JavaScript redirection is the most notorious of redirection techniques and is hard to detect as many of the prevalent crawlers are script-agnostic. In this paper, we study common JavaScript redirection spam techniques on the web. Our findings indicate that obfuscation techniques are very prevalent among JavaScript redirection spam pages. These obfuscation techniques limit the effectiveness of static analysis and static feature based systems. Based on our findings, we recommend a robust counter measure using a light weight JavaScript parser and engine.

My Review

This paper only demonestrate the problem of redirect spam without any solutions. Authors categoriez redirection spam into 3 categories:

  1. HTTP status code
  2. META refresh
  3. Javascript

They mentioned that finding type 1 and 2 is very simple. But few works has been done on type 3.

They used blog spot as their data set and try to find amount of type 3 redirection spam.

The main problem of type 3 redirection spams back to nature of javascript. by using varioty techinques spammers can hide their redirection page. (e.g. one can encrpyt redirection script).


As paper suggestion, there are other category of redirection. such as server side redirection. so future study in thies area would be intrested.

server side

Review: Fighting Spam on Social Web Sites

May 9, 2008
Authors: Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina
Year: 2007
Published in: IEEE Computer Society
Importance: High


In recent years, social Web sites have become important components of the
Web. With their success, however, has come a growing influx of spam. If left
unchecked, spam threatens to undermine resource sharing, interactivity, and
openness. This article surveys three categories of potential countermeasures —
those based on detection, demotion, and prevention. Although many of these
countermeasures have been proposed before for email and Web spam, the
authors find that their applicability to social Web sites differs. How should we
evaluate spam countermeasures for social Web sites, and what future challenges
might we face?

My Review

One of the beauitiful papers that I read recently was this paper. Authors very good classify current litrature in web spam filed the try to demonestrate how each anti-spam method works on their example of social community website (social bookmarking).

More dynamic content more avenue for spamming!

3 main anti-spam strategies:

  1. Detection-based: text classification, link analysis, user behavior anaysis, …
  2. Prevention-based: CAPTCHA, Account fee, Proof of work, …
  3. Demotion-based: Spam-hardened queries, rank-based, …

it is well understood that few works has been done in 2 and 3.

2 ingredient for method evaluation:

  1. Spam Model: capture whether content is spam or not
  2. Spam Metric: provide quantitative assessment of how spam affect a particular interface.

2 Spam models:

  1. Synthetic spam model: making assumption and define malicious behavior
  2. Trace-driven spam model: based on real data and positive/negative example of spam content

Since content of social community website are updated very rapidly authors used synthetic model to develop detection model.

All and all, their work was worth because of classification not in proposing a new method acctually.

Review: Opinion Spam and Analysis

May 5, 2008
Authors: Nitin Jindal and Bing Liu
Year: 2008
Published in: Proceedings of the international conference on Web search and web data mining
Importance: High


Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them

My Review

As authors claimed spam reviews are currently unique area and they are different from other type of web spams as disscussed in Review: Review Spam Detection by the same authors. they classify review spams into 3 types:

  • Type 1: Deliberately mislead reviews. Hard to detect.
  • Type 2: Reviews on brands only not product.
  • Type 3: non-reviews suc has ads, question and answers, …

Their test enviorment is with 5.8 million reviews.

As discussed in previous post, Type 1 reviews are hard to detect. Authors propose new way to study this problem. They first find which reviews are harmful. harmful mean those review that are different from others reviews in a product page.

In their model, 36 feature for reviews, reviewers and products proposed, first these features used to detect type 2 and 3 (duplicate) reviews then they used this as a trainning set for detecting type 1 reviews.

Their result based on AUC ecaluation is 98% for type 2 and 3 spams and 78% for type 1.


  1. they try to detect good and bad products based on product rating, may be this assumtion is not suitable since we have some spam ratings. although they mentioned to this drawback using data mining method for dicovering which product is good which is bad is recommanded.
  2. They do not mentioned to cost/benefit of their model.
  3. Other type of features which are not accesible from user frontpage may be make the results better.

Important terms