Review: Next Generation Solutions for Spam a Predictive Approach

March 8, 2008
Authors: Proofpoint MLX
Year: 2008
Published in: Proofpoint MLX Whitepaper
Link: http://whitepapers.techrepublic.com.com/whitepaper.aspx?docid=291451
Importance: High

Abstract

Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an anti-spam solution will always be a step behind the
spammers. Proofpoint MLX™ technology leverages machine learning techniques to provide a revolutionary spam detection system that analyzes millions of messages to automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention.

My Review

In this commercial whitepaper authors classify anti-spam filter into three generations
1st generation – Basic filtering:

  • Signature Based: Compare messages to known spam
  • Challange/responce: Require sender to respond
  • Text pattern matching: Search for spam keywords
  • RBLs, collaborative: Check messages against RBLs and other public anti-spam resources

Results:

  • Lose false positive
  • Low effectiveness
  • Easily flood by evolving techniques

2nd generation – heuristics/Bayesian

  • Linear models
  • Simple word match
  • Heuristic rules: Apply rules of thumb to assign a spam score

Results:

  • High false positive
  • High administration
  • Effectiveness decay over time

3th generation – Machine learning

  • Logistic regression
  • Super verctor machine
  • Integrated reputation

Results:

  • Immune to evolving attacks
  • hight effectiveness without decay
  • low false positive
  • low administration

Authors believe that developing anit-spam methodologies today needs intelligent approachs which adapt itself automatically with new spams.
Proofpoint MLX detection method use logistic regression to find dependecies between spam attributes then assign weight to each attribute and cacluate the net effect weigth each attributes.

I am not much interested in this paper, so I summarized many part. Image-based spam detection is another issue in which detected by this new method. Spam system which is used to sent spammy image useing below techniques to bypass current methods:

  • Randomized Image Borders
  • Randomized Pixels Between Paragraphs
  • Randomized Colortable Entries to Obfuscate the Image
  • Animated GIF with Embedded Spam Image
  • Image Segmentation
  • OCR-resistant Images
  • Combining Image-based and Text-based Techniques

Suggested method use below techniques to filter image-based spams:

  • Fuzzy matching for obfuscated images
  • Dynamic spam  image detection
  • Animated GIF spam detection
  • Dynamic botnet protection

Authors talk about layering system (logical regression) which filter email but they do not every mention to time of this task. Personally thought that it takes a lot of cpu and memory.