May 16, 2008
Authors: Kumar Chellapilla, David Maxwell Chickering
Published in: 2 nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb)
Cloaking is a search engine spamming technique used by some
Web sites to deliver one page to a search engine for indexing
while serving an entirely different page to users browsing the site.
In this paper, we show that the degree of cloaking among search
results depends on query properties such as popularity and
monetizability. We propose estimating query popularity and
monetizability by analyzing search engine query logs and online
advertising click-through logs, respectively. We also present a
new measure for detecting cloaked URLs that uses a normalized
term frequency ratio between multiple downloaded copies of Web
pages. Experiments are conducted using 10,000 search queries
and 3 million associated search result URLs. Experimental results
indicate that while only 73.1% of the cloaked popular search
URLs are spam, over 98.5% of the cloaked monetizable search
URLs are spam. Further, on average, the search results for top 2%
most cloaked queries are 10x more likely to be cloaking than
those for the bottom 98% of the queries.
the view presented in this paper is that authors using new method to detect cloacking. From authors’ point of view clocking is a kind of hidding technique which use by spam pages to view different page to the search engine crawlers than real user. In this case spam page designers can trick crawler by some tricky methods to get higher positions in search engine ranking. Authors claims that monetizability is a goal of web spam. They used MSN serach engine as their primary data set.
An outcome of their work is 98% accurate for monetizable cloacking spam and 75% for popular query spam.
May 12, 2008
Authors: Yi-Min Wang, Ming Ma, Yuan Niu, Hao Chen
Published in: Proceedings of the 16th international conference on World Wide Web
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords. one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
In this paper authors intrestingly model end-to-end spamming business with five-layer double funnel. These layer are as follow:
- Doorway: Spammer setup
- Redirection domain: Spammers set up redirection doamin.
- Aggregators: insulate below layers from spam pages.
- Sydicator: buy traffic from aggregators.
- Advertiser: pay sydicators to dispayer their ads.
Five top spam keywords:
More coming soon…
May 5, 2008
Authors: Nitin Jindal and Bing Liu
Published in: Proceedings of the international conference on Web search and web data mining
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them
As authors claimed spam reviews are currently unique area and they are different from other type of web spams as disscussed in Review: Review Spam Detection by the same authors. they classify review spams into 3 types:
- Type 1: Deliberately mislead reviews. Hard to detect.
- Type 2: Reviews on brands only not product.
- Type 3: non-reviews suc has ads, question and answers, …
Their test enviorment is Amazon.com with 5.8 million reviews.
As discussed in previous post, Type 1 reviews are hard to detect. Authors propose new way to study this problem. They first find which reviews are harmful. harmful mean those review that are different from others reviews in a product page.
In their model, 36 feature for reviews, reviewers and products proposed, first these features used to detect type 2 and 3 (duplicate) reviews then they used this as a trainning set for detecting type 1 reviews.
Their result based on AUC ecaluation is 98% for type 2 and 3 spams and 78% for type 1.
- they try to detect good and bad products based on product rating, may be this assumtion is not suitable since we have some spam ratings. although they mentioned to this drawback using data mining method for dicovering which product is good which is bad is recommanded.
- They do not mentioned to cost/benefit of their model.
- Other type of features which are not accesible from user frontpage may be make the results better.
March 8, 2008
Authors: Proofpoint MLX
Published in: Proofpoint MLX Whitepaper
Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an anti-spam solution will always be a step behind the
spammers. Proofpoint MLX™ technology leverages machine learning techniques to provide a revolutionary spam detection system that analyzes millions of messages to automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention.
In this commercial whitepaper authors classify anti-spam filter into three generations
1st generation – Basic filtering:
- Signature Based: Compare messages to known spam
- Challange/responce: Require sender to respond
- Text pattern matching: Search for spam keywords
- RBLs, collaborative: Check messages against RBLs and other public anti-spam resources
- Lose false positive
- Low effectiveness
- Easily flood by evolving techniques
2nd generation – heuristics/Bayesian
- Linear models
- Simple word match
- Heuristic rules: Apply rules of thumb to assign a spam score
- High false positive
- High administration
- Effectiveness decay over time
3th generation – Machine learning
- Logistic regression
- Super verctor machine
- Integrated reputation
- Immune to evolving attacks
- hight effectiveness without decay
- low false positive
- low administration
Authors believe that developing anit-spam methodologies today needs intelligent approachs which adapt itself automatically with new spams.
Proofpoint MLX detection method use logistic regression to find dependecies between spam attributes then assign weight to each attribute and cacluate the net effect weigth each attributes.
I am not much interested in this paper, so I summarized many part. Image-based spam detection is another issue in which detected by this new method. Spam system which is used to sent spammy image useing below techniques to bypass current methods:
- Randomized Image Borders
- Randomized Pixels Between Paragraphs
- Randomized Colortable Entries to Obfuscate the Image
- Animated GIF with Embedded Spam Image
- Image Segmentation
- OCR-resistant Images
- Combining Image-based and Text-based Techniques
Suggested method use below techniques to filter image-based spams:
- Fuzzy matching for obfuscated images
- Dynamic spam image detection
- Animated GIF spam detection
- Dynamic botnet protection
Authors talk about layering system (logical regression) which filter email but they do not every mention to time of this task. Personally thought that it takes a lot of cpu and memory.
March 6, 2008
Author: Bill Slawski
Published in: Available on author’s blog.
What I found in this blog article is that Google may cluster pages to find spam and manipulative documents. Clusters contains both interlink and doorway pages. So, for determining whether or not page is manipulative, Google consider both local and not-local pages and grow cluster when it find more doorway pages. After making cluster manipulative signal of each inter and outer documents is counted. Clearly not mentioned to these signals both some of them include:
- The text of document (repeated, long, …)
- Meta Tags (repeated, long, …)
- Redirect (each script redirect page to other page)
- Similarly colored text and background
- History of document (new owner)
- Anchor text (links more than text)
To sum up, all of these signals are counted which are resulted to overall signal. Base on a threshold (as discussed in article: ) if a page marked as manipulative below action will be take:
- Lowering page rank
- Page removes entirely
- Other treats and ways