Review: Learning Fast Classifiers for Image Spam

March 17, 2008
Authors: Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach
Year: 2007
Published in: CEAS 2007
Link: http://www.ceas.cc/2007/papers/paper-06.pdf
Importance: Very High

Abstract

Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which cre-
ates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

My Review

In introduction part of this paper Authors mentioned that cornerstone of many anti-spam email systems are content filtering often based on naive bayes. Spammer outsmart content filter by:

  • Obscuring text
  • Obfuscating words with symbols
  • including neutral text

to confuse filters. and now spammer use new method which is advertsing in images attachments insted of text so this way neutralizing text filter methods. So authors suggest new method better to sya including new feature for detecting these spam images.

Features that get from images for spam analysis:

  • File Format
  • File Size
  • Image Metadata: comments, number of images (frames), bits per pixel, progressive flag, color table entries, indexvalue, transparent color, logical height and width, components, bands, …
  • Image Size
  • Average Color
  • Color Saturation
  • Edge Detection
  • Prevalent Color Coverage
  • Random Pixel Test

Authors provide very interesting equition for selecting those feature that is both accurate and fast Selected features are:

  • Average Color
  • Color Saturation
  • Edge Detection
  • File Format
  • File Size
  • Image Metadata
  • Image Size
  • Prevelent Color Coverage
  • Random Pixel Test

Important Terms

  • Maximum Entropy
  • Navie Bayes
  • SpamArchive
  • ID3 Decision Tree
  • Just In Time decision tree
Advertisements

Review: A Learning Approach to Spam Detection based on Social Networks

March 7, 2008
Authors: Ho-Yu Lam, Dit-Yan Yeung
Year: 2007
Published in: CEAS 2007
Link: http://www.ceas.cc/2007/papers/paper-81.pdf
Importance: High

Abstract

The massive increase of spam is posing a very serious threat to email which has become an important means of communication. Not only does it annoy users, but it also consumes much of the bandwidth of the Internet. Most spam filters in existence are based on the content of email one way or the other. While these anti-spam tools have proven very useful, they do not prevent the bandwidth from being wasted and spammers are learning to bypass them via clever manipulation of the spam content. A very different approach to spam detection is based on the behavior of email senders. In this paper, we propose a learning approach to spam sender detection based on features extracted from social networks constructed from email exchange logs. Legitimacy scores are assigned to senders based on their likelihood of being a legitimate sender. Moreover, we also explore various spam filtering and resisting possibilities.

My Review

The term “social network” which is stated in this paper refer to email transaction logs. Email transaction logs in SMTP server which contains sender address, ip address, sender email client, …. are parsed offline and construct email social networks. I like to mentioned to this term since it has different meaning from usual realization of this term.

In this paper, authors first mention to type of email spam:

  • Unsolicited commercial email (UCE) – emails without recipient’s prior consent.
  • Unsolicited bulk email (UBE) – emails which distribute virus and spywares.

Email spam detection are based on two approaches:

  1. Spam text detection
  2. Whitelist and blacklist

Their suggested method is based on spam detection whitelist and blacklist. They provide learning method for creating better black/whitelist.

Detection method based on 7 features, each feature is countend, normilized and weighted. then each of them is compaired with other valid feature data. What I mean by valid feature data is those data that are classified before as spam or non-spam. By compairing similarity between these futures a sender can be considered as spam / non-spam.

One drawback of this method is that may be some website which sent mass emails to users (such as mycareer, ebay, …) may fall into spam senders. So there should be some other policies for these legitimate senders.

Important Terms

  • In/out-count
  • In/Out-degree
  • Communication Reciprocity
  • Communication Interaction Average
  • Clustering Coefficient

Cite this article as
Critical review on  “A Learning Approach to Spam Detection based on Social Networks” by P.Hayati. 8th Mar 2008. Available online: https://pi3ch.wordpress.com/2008/03/07/review-a-learning-approach-to-spam-detection-based-on-social-networks/