Review: Behavior-based Email Analysis with Application to Spam Detection

February 23, 2008
Author: Shlomo Hershkop
Year: 2006
Published in: As Ph.D thesis in Columbia University Graduate School of Arts and Sciences
Link: http://www1.cs.columbia.edu/~sh553/publications/final-thesis.pdf
Important: Very High

Abstract

Email is the “killer network application”. Email is ubiquitous and pervasive.
In a relatively short time frame, the Internet has become irrevocably and deeply
entrenched in our modern society primarily due to the power of its communication
substrate linking people and organizations around the globe. Much work on email
technology has focused on making email easy to use, permitting a wide variety of
information and information types to be conveniently, reliably, and efficiently sent
throughout the Internet. However, the analysis of the vast storehouse of email
content accumulated or produced by individual users has received relatively little
attention other than for specific tasks such as spam and virus filtering. As one paper
in the literature puts it, ”the state of the art is still a messy desktop” (Denning, 1982 ).
The Problem: Email clients provide only partial information – users have to
manage much on their own, making it hard to search or prioritize large amounts
of email. Our thesis is that advanced data mining can provide new opportunities
for applications to increase email productivity and extract new information from
email archives.
My Review

This Ph.D thesis provide new way for classifying emails. This new method called Behavior-Based classify emails based on user past email usage. In another way of saying, user past behavior on email message would consider for new incoming emails so Spam emails can be easily detected. Spam emails have different behavior in compare with users past behavior on email.

Author classified many Machine Learning Models for email address which are very interesting methods for classifying better to say, data mining on email message. Personaly thought that this models can be developted and extend for other usages in web. Currently some of below models are used in we which are first suggested for emails data mining (e.g. N-Gram, TF-IDF, … ). Behavioral-based models are highly intrested in blogs, forums, … spam detection models which I am going to study on them in future.

Machine Learning Models on Email
1. Naive Bayes
2. N-Gram
3. Limited N-Gram
4. Text Based Na¨ive Bayes
5. TF-IDF
6. Biased Text Tokens
7. Behavioral-based Models
7.1. Sending Usage Model
7.2. Similar User Model
7.3. User Clique Model
7.4. VIP Communication Model
7.5. Organizational Level Clique Model
7.6. URL Model
8. Attachment Models
8.1. Attachment Incident
8.2. Birth rate
8.3. Lifespan
8.4. Incident rate
8.5. Death rate
8.6. Prevalence
8.7. Threat
8.8. Spread
9. Histogram Distance Metrics

More coming soon… (it’s long thesis! 230 pages)

Important Keywords

Preemption

Legislation

Protocol reimplementation

Filtering (White List, Black List, Content-Based)

Furture Investigation On

SPF – http://www.openspf.org/RFC_4408 – http://support.easydns.com/tutorials/SPF/spfrec.php


Review: Web Spam Taxonomy

February 22, 2008
Authors: Zoltan Gyongyi, Hector Garcia-Molina.
Year: 2005
Published in:
Link: http://airweb.cse.lehigh.edu/2005/gyongyi.pdf
Important: Very High

Abstract

Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased
dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.

My review

In this paper authors presents a comprehensive collection of spams in cyber world. They describe each section of spamming very clear along with examples that made me continue reading paper without stop. Begin from impact of spams: 1. decrease of search engine result 2. increase cost of search query
One drawback of this paper is that authors only consider spam as defined as below:
all types of actions intended to boost ranking (either relevance, or importance, or both), without improving the true value of a page, are considered spamming.
but personally thought that there are some kind of other spams in web that are not classify as spam by above definition. such as spam page which accessible when user enter URLs incorrectly. e.g. enter gmal.com instead of Gmail.com you will see a spam page.Interestingly, gmal.com is an unregistred domain name so it can not be indexed in search engines. there is not classification for this kind of spam pages which are exists in market.

Authors describe each section of spamming and target algorithms in each section.
Two techniques associated with web spam:
Boost technique: achieve high relevant/important of page, influence search engine ranking
Hide technique: hide boosting technique from eyes of human

Important Terms

Search Engine Optimizer – SEO

Boost technique

Hide technique

Term spamming

TFIDF

HITS

Cloaking