April 29, 2008
Author: Nitin Jindal and Bing Liu
Published in: The International World Wide Web Conference Committee
It is now a common practice for e-commerce Web sites to enable
their customers to write reviews of products that they have
purchased. Such reviews provide valuable sources of information
on these products. They are used by potential customers to find
opinions of existing users before deciding to purchase a product.
They are also used by product manufacturers to identify problems
of their products and to find competitive intelligence information
about their competitors. Unfortunately, this importance of reviews
also gives good incentive for spam, which contains false positive
or malicious negative opinions. In this paper, we make an attempt
to study review spam and spam detection. To the best of our
knowledge, there is still no reported study on this problem
This only two pages paper talks about review spams which is new to the word of web spam. According to author claim review spams are different from web and email spam so we need new methods for detecting review spam. Also review spams are hard to detect even manually.
Why hard to detect?
- Similarity to real reviews.
- Not enough meta-data for analysing.
Mainly, author tries to detect duplicate reviews in this paper and they provide a model based on shingle method. other type of review as author said are hard to detect and the outcome of their work is small.
Personally think that since still it is hard to detect review spam manually we should improve spam prevention methods such as CAPTCHA in order to disallow review spams (Sreview). So for the time being I have no idea on detecting review spam after postage.
Web Data mining book by Bing Liu
March 19, 2008
Authors: Shirlee-ann Knight and Janice Burn
Published in: Informing Science Journal Volume 8
The rapid growth of the Internet as an environment for information exchange and the lack of enforceable standards regarding the information it contains has lead to numerous information quality problems. A major issue is the inability of Search Engine technology to wade through the vast expanse of questionable content and return “quality” results to a user’s query. This paper attempts to address some of the issues involved in determining what quality is, as it pertains to information retrieval on the Internet. The IQIP model is presented as an approach to managing the choice and implementation of quality related algorithms of an Internet crawling Search Engine.
In this paper authors discuss about the problem of Information quality in WWW from search engines perspectives. They clearly define the problem and current solutions. Their proposed model (IQIP) consist four parts:
- Identify: user, environment and task
- Quantify: Prioritise information quality dimensions
- Implement: implement chosen IQ dimension into Web Crawler
- Perfect: improve crawler through feedback
Their proposed model can be used for attacking spam in WWW. As my supervisor (Dr. Potdar) suggested we make use of this model in anti-spam methods. Simple example:
- Identify: here we study spammers behavior their subjects, behavior, …
- Environment: contain study of Splog, Sforums, Spam pages, …
- Task: Spam detection based on spam characteristic which previously exists in literature.
More review coming soon….
March 17, 2008
Authors: Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach
Published in: CEAS 2007
Importance: Very High
Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content ﬁlters diﬃcult. New techniques are needed to ﬁlter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classiﬁcation as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classiﬁcation based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which cre-
ates features at classiﬁcation time as needed by the classiﬁer. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classiﬁcation practical by providing both high accuracy features and a method to learn fast classiﬁers.
In introduction part of this paper Authors mentioned that cornerstone of many anti-spam email systems are content filtering often based on naive bayes. Spammer outsmart content filter by:
- Obscuring text
- Obfuscating words with symbols
- including neutral text
to confuse filters. and now spammer use new method which is advertsing in images attachments insted of text so this way neutralizing text filter methods. So authors suggest new method better to sya including new feature for detecting these spam images.
Features that get from images for spam analysis:
- File Format
- File Size
- Image Metadata: comments, number of images (frames), bits per pixel, progressive flag, color table entries, indexvalue, transparent color, logical height and width, components, bands, …
- Image Size
- Average Color
- Color Saturation
- Edge Detection
- Prevalent Color Coverage
- Random Pixel Test
Authors provide very interesting equition for selecting those feature that is both accurate and fast Selected features are:
- Average Color
- Color Saturation
- Edge Detection
- File Format
- File Size
- Image Metadata
- Image Size
- Prevelent Color Coverage
- Random Pixel Test
- Maximum Entropy
- Navie Bayes
- ID3 Decision Tree
- Just In Time decision tree
March 15, 2008
Authors: Yu-Ru Lin Hari Sundaram Yun Chi Junichi Tatemura Belle L. Tseng
Published in: The International World Wide Web Conference Committee (IW3C2)
This paper focuses on spam blog (splog) detection. Blogs are
highly popular, new media social communication mechanisms.
The presence of splogs degrades blog search results as well as
wastes network resources. In our approach we exploit unique blog
temporal dynamics to detect splogs.
There are three key ideas in our splog detection framework. We
first represent the blog temporal dynamics using self-similarity
matrices defined on the histogram intersection similarity measure
of the time, content, and link attributes of posts. Second, we show
via a novel visualization that the blog temporal characteristics
reveal attribute correlation, depending on type of the blog (normal
blogs and splogs). Third, we propose the use of temporal
structural properties computed from self-similarity matrices across
different attributes. In a splog detector, these novel features are
combined with content based features. We extract a content based
feature vector from different parts of the blog – URLs, post
content, etc. The dimensionality of the feature vector is reduced
by Fisher linear discriminant analysis. We have tested an SVM
based splog detector using proposed features on real world
datasets, with excellent results (90% accuracy).
Intrestinly these two paper are the same which published in two different conferences under different title. you can read review on first paper here:
Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties
but authors try to cover drawbacks of previous paper here. for example they provide more figures and reason for vague part of previous one. The content of paper is better organized and more clear. What I interested from this paper is their equation and mathematic formulation of Splog features.
March 8, 2008
Authors: Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng
Published in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings
Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and email spam, we identify some unique characteristics of splog. Second, we propose a new online task that captures the unique characteristics of splog, in addition to tasks based on the traditional IR evaluation framework. The new task introduces a novel time-sensitive detection evaluation to indicate how quickly a detector can identify splogs. Third, we propose a splog detection algorithm that combines traditional content features with temporal and link regularity features that are unique to blogs. Finally, we develop an annotation tool to generate ground truth on a sampled subset of the TREC-Blog dataset. We conducted experiments on both offline (traditional splog detection) and our proposed online splog detection task. Experiments based on the annotated ground truth set show excellent results on both offline and online splog detection tasks.
Authors suggest new method for detecting splogs based on online approach beside other offline approaches.
Methods for deceiving search engines by splogs:
- relevancy – via keyword stuffing
- popularity – via link farm
- recency – via frequent posts
- Machine-generated content
- No value addition
- Hidden agenda, usually economic goal
Number 2 and 1 are not good classified, their definition cover each other.
Authors claimed that Splogs are different from Web Spams since:
- Blogs’ dynamic content
- Non-endorsement links
Above two reasons are not sufficient for differing splogs from web spasm. Mistakenly or not, Authors mentioned to two reasons for differing blogs from other websites. Also above two characteristics are well known in some kind of websites such as news website that updated frequently and get user opinions for each news article.
As paper goes authors mention to these facts that splogs content sometimes are copied from non-spam blogs so current web spam detection can not detect them. More or less this claim is true but they should provide good reasons or statistical surveys for proving this claim.
Personally thought that splog can be detected into 4 ways:
- Copied contents. We can found them by employing text comparing algorithms.
- Repetitive content. By employing current web spam detection algorithms.
- Link ring. By employing link farms detection algorithms.
- Sping. By employing Spam ping detection algorithms
- Design of blog. Blog visitors many times found whether or not blog is spam by looking at design of blog. Currently there is no work on this issue. Any I like to work on it.
In the rest of paper authors demonstrate their detection method, they used 5 feature for detecting splogs which are:
- Tokenized URLs
- Blog and post titles
- Anchor text
- Blog homepage content
- Post content
There is nothing new with these feature that authors claimed before, and they do not clearly mention to their detection method difference. or at least I did not figure it out.
March 8, 2008
Authors: Proofpoint MLX
Published in: Proofpoint MLX Whitepaper
Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an anti-spam solution will always be a step behind the
spammers. Proofpoint MLX™ technology leverages machine learning techniques to provide a revolutionary spam detection system that analyzes millions of messages to automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention.
In this commercial whitepaper authors classify anti-spam filter into three generations
1st generation – Basic filtering:
- Signature Based: Compare messages to known spam
- Challange/responce: Require sender to respond
- Text pattern matching: Search for spam keywords
- RBLs, collaborative: Check messages against RBLs and other public anti-spam resources
- Lose false positive
- Low effectiveness
- Easily flood by evolving techniques
2nd generation – heuristics/Bayesian
- Linear models
- Simple word match
- Heuristic rules: Apply rules of thumb to assign a spam score
- High false positive
- High administration
- Effectiveness decay over time
3th generation – Machine learning
- Logistic regression
- Super verctor machine
- Integrated reputation
- Immune to evolving attacks
- hight effectiveness without decay
- low false positive
- low administration
Authors believe that developing anit-spam methodologies today needs intelligent approachs which adapt itself automatically with new spams.
Proofpoint MLX detection method use logistic regression to find dependecies between spam attributes then assign weight to each attribute and cacluate the net effect weigth each attributes.
I am not much interested in this paper, so I summarized many part. Image-based spam detection is another issue in which detected by this new method. Spam system which is used to sent spammy image useing below techniques to bypass current methods:
- Randomized Image Borders
- Randomized Pixels Between Paragraphs
- Randomized Colortable Entries to Obfuscate the Image
- Animated GIF with Embedded Spam Image
- Image Segmentation
- OCR-resistant Images
- Combining Image-based and Text-based Techniques
Suggested method use below techniques to filter image-based spams:
- Fuzzy matching for obfuscated images
- Dynamic spam image detection
- Animated GIF spam detection
- Dynamic botnet protection
Authors talk about layering system (logical regression) which filter email but they do not every mention to time of this task. Personally thought that it takes a lot of cpu and memory.