Year: 2007 Published in: CEAS 2007
Link: http://www.ceas.cc/2007/papers/paper-06.pdf
Importance: Very High
Abstract
Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which cre-
ates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.
My Review
In introduction part of this paper Authors mentioned that cornerstone of many anti-spam email systems are content filtering often based on naive bayes. Spammer outsmart content filter by:
- Obscuring text
- Obfuscating words with symbols
- including neutral text
to confuse filters. and now spammer use new method which is advertsing in images attachments insted of text so this way neutralizing text filter methods. So authors suggest new method better to sya including new feature for detecting these spam images.
Features that get from images for spam analysis:
- File Format
- File Size
- Image Metadata: comments, number of images (frames), bits per pixel, progressive flag, color table entries, indexvalue, transparent color, logical height and width, components, bands, …
- Image Size
- Average Color
- Color Saturation
- Edge Detection
- Prevalent Color Coverage
- Random Pixel Test
Authors provide very interesting equition for selecting those feature that is both accurate and fast Selected features are:
- Average Color
- Color Saturation
- Edge Detection
- File Format
- File Size
- Image Metadata
- Image Size
- Prevelent Color Coverage
- Random Pixel Test
Important Terms
- Maximum Entropy
- Navie Bayes
- SpamArchive
- ID3 Decision Tree
- Just In Time decision tree
March 18, 2008 at 9:30 am
Quality Assessment Metrics can be useful in spam detection as well. This is a very good paper
http://inform.nu/Articles/Vol8/v8p159-172Knig.pdf
Vidy
March 21, 2008 at 6:55 am
Hi Pedram
When you write a review you should include the following
1. Summary of key contributions of the paper.
2. Drawbacks or Limitations
3. Possibilities of Improvement, how it can be enhanced
In this review you have only conducted Step 1 i.e. summarized the paper.
When you read a few more papers, you will realize that there is even a better approach to handle this.
Just from reading your summary and hte papers abstract, I think there is one way in which this algorithm can be improved.
Currently the authors are only analyzing the image and using edge detection to identify spam
It would be a good idea to look at the possibility of comparing text on the image with a spam text database that the text filters normally use and identify the correlation, if it is beyond a specific threshold, we can say that it is spam
so when u read just think about how the proposed solution lacks some features and give some directions on its possible future enhancements.
cheers
vidy