My new homepage is accessibale on:
From now on you can find my blog posts on my new homepage.
Published in: 2 nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb) Link: http://airweb.cse.lehigh.edu/2006/chellapilla.pdf Importance: High
Cloaking is a search engine spamming technique used by some
Web sites to deliver one page to a search engine for indexing
while serving an entirely different page to users browsing the site.
In this paper, we show that the degree of cloaking among search
results depends on query properties such as popularity and
monetizability. We propose estimating query popularity and
monetizability by analyzing search engine query logs and online
advertising click-through logs, respectively. We also present a
new measure for detecting cloaked URLs that uses a normalized
term frequency ratio between multiple downloaded copies of Web
pages. Experiments are conducted using 10,000 search queries
and 3 million associated search result URLs. Experimental results
indicate that while only 73.1% of the cloaked popular search
URLs are spam, over 98.5% of the cloaked monetizable search
URLs are spam. Further, on average, the search results for top 2%
most cloaked queries are 10x more likely to be cloaking than
those for the bottom 98% of the queries.
the view presented in this paper is that authors using new method to detect cloacking. From authors’ point of view clocking is a kind of hidding technique which use by spam pages to view different page to the search engine crawlers than real user. In this case spam page designers can trick crawler by some tricky methods to get higher positions in search engine ranking. Authors claims that monetizability is a goal of web spam. They used MSN serach engine as their primary data set.
An outcome of their work is 98% accurate for monetizable cloacking spam and 75% for popular query spam.
Published in: Proceedings of the 16th international conference on World Wide Web
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords. one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
In this paper authors intrestingly model end-to-end spamming business with five-layer double funnel. These layer are as follow:
- Doorway: Spammer setup
- Redirection domain: Spammers set up redirection doamin.
- Aggregators: insulate below layers from spam pages.
- Sydicator: buy traffic from aggregators.
- Advertiser: pay sydicators to dispayer their ads.
Five top spam keywords:
More coming soon…
Year: 2007 Published in: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Link: http://airweb.cse.lehigh.edu/2007/papers/paper_115.pdf Importance: High
This paper only demonestrate the problem of redirect spam without any solutions. Authors categoriez redirection spam into 3 categories:
- HTTP status code
- META refresh
They mentioned that finding type 1 and 2 is very simple. But few works has been done on type 3.
They used blog spot as their data set and try to find amount of type 3 redirection spam.
As paper suggestion, there are other category of redirection. such as server side redirection. so future study in thies area would be intrested.
Published in: Proceedings of the international conference on Web search and web data mining Link: http://portal.acm.org/citation.cfm?id=1341560
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them
As authors claimed spam reviews are currently unique area and they are different from other type of web spams as disscussed in Review: Review Spam Detection by the same authors. they classify review spams into 3 types:
- Type 1: Deliberately mislead reviews. Hard to detect.
- Type 2: Reviews on brands only not product.
- Type 3: non-reviews suc has ads, question and answers, …
Their test enviorment is Amazon.com with 5.8 million reviews.
As discussed in previous post, Type 1 reviews are hard to detect. Authors propose new way to study this problem. They first find which reviews are harmful. harmful mean those review that are different from others reviews in a product page.
In their model, 36 feature for reviews, reviewers and products proposed, first these features used to detect type 2 and 3 (duplicate) reviews then they used this as a trainning set for detecting type 1 reviews.
Their result based on AUC ecaluation is 98% for type 2 and 3 spams and 78% for type 1.
- they try to detect good and bad products based on product rating, may be this assumtion is not suitable since we have some spam ratings. although they mentioned to this drawback using data mining method for dicovering which product is good which is bad is recommanded.
- They do not mentioned to cost/benefit of their model.
- Other type of features which are not accesible from user frontpage may be make the results better.
- Lift curve
- AUC – Area under ROC curve
- R – http://www.r-project.org