Review: Behavior-based Email Analysis with Application to Spam Detection

February 23, 2008
Author: Shlomo Hershkop
Year: 2006
Published in: As Ph.D thesis in Columbia University Graduate School of Arts and Sciences
Link: http://www1.cs.columbia.edu/~sh553/publications/final-thesis.pdf
Important: Very High

Abstract

Email is the “killer network application”. Email is ubiquitous and pervasive.
In a relatively short time frame, the Internet has become irrevocably and deeply
entrenched in our modern society primarily due to the power of its communication
substrate linking people and organizations around the globe. Much work on email
technology has focused on making email easy to use, permitting a wide variety of
information and information types to be conveniently, reliably, and efficiently sent
throughout the Internet. However, the analysis of the vast storehouse of email
content accumulated or produced by individual users has received relatively little
attention other than for specific tasks such as spam and virus filtering. As one paper
in the literature puts it, ”the state of the art is still a messy desktop” (Denning, 1982 ).
The Problem: Email clients provide only partial information – users have to
manage much on their own, making it hard to search or prioritize large amounts
of email. Our thesis is that advanced data mining can provide new opportunities
for applications to increase email productivity and extract new information from
email archives.
My Review

This Ph.D thesis provide new way for classifying emails. This new method called Behavior-Based classify emails based on user past email usage. In another way of saying, user past behavior on email message would consider for new incoming emails so Spam emails can be easily detected. Spam emails have different behavior in compare with users past behavior on email.

Author classified many Machine Learning Models for email address which are very interesting methods for classifying better to say, data mining on email message. Personaly thought that this models can be developted and extend for other usages in web. Currently some of below models are used in we which are first suggested for emails data mining (e.g. N-Gram, TF-IDF, … ). Behavioral-based models are highly intrested in blogs, forums, … spam detection models which I am going to study on them in future.

Machine Learning Models on Email
1. Naive Bayes
2. N-Gram
3. Limited N-Gram
4. Text Based Na¨ive Bayes
5. TF-IDF
6. Biased Text Tokens
7. Behavioral-based Models
7.1. Sending Usage Model
7.2. Similar User Model
7.3. User Clique Model
7.4. VIP Communication Model
7.5. Organizational Level Clique Model
7.6. URL Model
8. Attachment Models
8.1. Attachment Incident
8.2. Birth rate
8.3. Lifespan
8.4. Incident rate
8.5. Death rate
8.6. Prevalence
8.7. Threat
8.8. Spread
9. Histogram Distance Metrics

More coming soon… (it’s long thesis! 230 pages)

Important Keywords

Preemption

Legislation

Protocol reimplementation

Filtering (White List, Black List, Content-Based)

Furture Investigation On

SPF – http://www.openspf.org/RFC_4408http://support.easydns.com/tutorials/SPF/spfrec.php

Advertisements