March 15, 2008
Authors: Yu-Ru Lin Hari Sundaram Yun Chi Junichi Tatemura Belle L. Tseng
Year: 2007
Published in: The International World Wide Web Conference Committee (IW3C2)
Link: http://www2007.org/workshops/paper_112.pdf
Importance: High
Abstract
This paper focuses on spam blog (splog) detection. Blogs are
highly popular, new media social communication mechanisms.
The presence of splogs degrades blog search results as well as
wastes network resources. In our approach we exploit unique blog
temporal dynamics to detect splogs.
There are three key ideas in our splog detection framework. We
first represent the blog temporal dynamics using self-similarity
matrices defined on the histogram intersection similarity measure
of the time, content, and link attributes of posts. Second, we show
via a novel visualization that the blog temporal characteristics
reveal attribute correlation, depending on type of the blog (normal
blogs and splogs). Third, we propose the use of temporal
structural properties computed from self-similarity matrices across
different attributes. In a splog detector, these novel features are
combined with content based features. We extract a content based
feature vector from different parts of the blog – URLs, post
content, etc. The dimensionality of the feature vector is reduced
by Fisher linear discriminant analysis. We have tested an SVM
based splog detector using proposed features on real world
datasets, with excellent results (90% accuracy).
My Review
Intrestinly these two paper are the same which published in two different conferences under different title. you can read review on first paper here:
Review: The Splog Detection Task and A Solution Based on Temporal and Link Properties
but authors try to cover drawbacks of previous paper here. for example they provide more figures and reason for vague part of previous one. The content of paper is better organized and more clear. What I interested from this paper is their equation and mathematic formulation of Splog features.
Leave a Comment » |
Spam | Tagged: Self-Similarity Detection, Splog |
Permalink
Posted by Pedi
March 8, 2008
Authors: Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng
Year: 2006
Published in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings
Link: http://trec.nist.gov/pubs/trec15/papers/nec.blog.final.pdf
Importance: High
Abstract
Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and email spam, we identify some unique characteristics of splog. Second, we propose a new online task that captures the unique characteristics of splog, in addition to tasks based on the traditional IR evaluation framework. The new task introduces a novel time-sensitive detection evaluation to indicate how quickly a detector can identify splogs. Third, we propose a splog detection algorithm that combines traditional content features with temporal and link regularity features that are unique to blogs. Finally, we develop an annotation tool to generate ground truth on a sampled subset of the TREC-Blog dataset. We conducted experiments on both offline (traditional splog detection) and our proposed online splog detection task. Experiments based on the annotated ground truth set show excellent results on both offline and online splog detection tasks.
My Review
Authors suggest new method for detecting splogs based on online approach beside other offline approaches.
Methods for deceiving search engines by splogs:
- relevancy – via keyword stuffing
- popularity – via link farm
- recency – via frequent posts
Splog characteristic
- Machine-generated content
- No value addition
- Hidden agenda, usually economic goal
Number 2 and 1 are not good classified, their definition cover each other.
Authors claimed that Splogs are different from Web Spams since:
- Blogs’ dynamic content
- Non-endorsement links
Above two reasons are not sufficient for differing splogs from web spasm. Mistakenly or not, Authors mentioned to two reasons for differing blogs from other websites. Also above two characteristics are well known in some kind of websites such as news website that updated frequently and get user opinions for each news article.
As paper goes authors mention to these facts that splogs content sometimes are copied from non-spam blogs so current web spam detection can not detect them. More or less this claim is true but they should provide good reasons or statistical surveys for proving this claim.
Personally thought that splog can be detected into 4 ways:
- Copied contents. We can found them by employing text comparing algorithms.
- Repetitive content. By employing current web spam detection algorithms.
- Link ring. By employing link farms detection algorithms.
- Sping. By employing Spam ping detection algorithms
- Design of blog. Blog visitors many times found whether or not blog is spam by looking at design of blog. Currently there is no work on this issue. Any I like to work on it.
In the rest of paper authors demonstrate their detection method, they used 5 feature for detecting splogs which are:
- Tokenized URLs
- Blog and post titles
- Anchor text
- Blog homepage content
- Post content
There is nothing new with these feature that authors claimed before, and they do not clearly mention to their detection method difference. or at least I did not figure it out.
4 Comments |
Spam | Tagged: Spam detection method, Splog |
Permalink
Posted by Pedi