Review: Behavior-based Email Analysis with Application to Spam Detection

February 23, 2008
Author: Shlomo Hershkop
Year: 2006
Published in: As Ph.D thesis in Columbia University Graduate School of Arts and Sciences
Link: http://www1.cs.columbia.edu/~sh553/publications/final-thesis.pdf
Important: Very High

Abstract

Email is the “killer network application”. Email is ubiquitous and pervasive.
In a relatively short time frame, the Internet has become irrevocably and deeply
entrenched in our modern society primarily due to the power of its communication
substrate linking people and organizations around the globe. Much work on email
technology has focused on making email easy to use, permitting a wide variety of
information and information types to be conveniently, reliably, and efficiently sent
throughout the Internet. However, the analysis of the vast storehouse of email
content accumulated or produced by individual users has received relatively little
attention other than for specific tasks such as spam and virus filtering. As one paper
in the literature puts it, ”the state of the art is still a messy desktop” (Denning, 1982 ).
The Problem: Email clients provide only partial information – users have to
manage much on their own, making it hard to search or prioritize large amounts
of email. Our thesis is that advanced data mining can provide new opportunities
for applications to increase email productivity and extract new information from
email archives.
My Review

This Ph.D thesis provide new way for classifying emails. This new method called Behavior-Based classify emails based on user past email usage. In another way of saying, user past behavior on email message would consider for new incoming emails so Spam emails can be easily detected. Spam emails have different behavior in compare with users past behavior on email.

Author classified many Machine Learning Models for email address which are very interesting methods for classifying better to say, data mining on email message. Personaly thought that this models can be developted and extend for other usages in web. Currently some of below models are used in we which are first suggested for emails data mining (e.g. N-Gram, TF-IDF, … ). Behavioral-based models are highly intrested in blogs, forums, … spam detection models which I am going to study on them in future.

Machine Learning Models on Email
1. Naive Bayes
2. N-Gram
3. Limited N-Gram
4. Text Based Na¨ive Bayes
5. TF-IDF
6. Biased Text Tokens
7. Behavioral-based Models
7.1. Sending Usage Model
7.2. Similar User Model
7.3. User Clique Model
7.4. VIP Communication Model
7.5. Organizational Level Clique Model
7.6. URL Model
8. Attachment Models
8.1. Attachment Incident
8.2. Birth rate
8.3. Lifespan
8.4. Incident rate
8.5. Death rate
8.6. Prevalence
8.7. Threat
8.8. Spread
9. Histogram Distance Metrics

More coming soon… (it’s long thesis! 230 pages)

Important Keywords

Preemption

Legislation

Protocol reimplementation

Filtering (White List, Black List, Content-Based)

Furture Investigation On

SPF – http://www.openspf.org/RFC_4408http://support.easydns.com/tutorials/SPF/spfrec.php

Advertisements

Review: Web Spam Taxonomy

February 22, 2008
Authors: Zoltan Gyongyi, Hector Garcia-Molina.
Year: 2005
Published in:
Link: http://airweb.cse.lehigh.edu/2005/gyongyi.pdf
Important: Very High

Abstract

Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased
dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.

My review

In this paper authors presents a comprehensive collection of spams in cyber world. They describe each section of spamming very clear along with examples that made me continue reading paper without stop. Begin from impact of spams: 1. decrease of search engine result 2. increase cost of search query
One drawback of this paper is that authors only consider spam as defined as below:
all types of actions intended to boost ranking (either relevance, or importance, or both), without improving the true value of a page, are considered spamming.
but personally thought that there are some kind of other spams in web that are not classify as spam by above definition. such as spam page which accessible when user enter URLs incorrectly. e.g. enter gmal.com instead of Gmail.com you will see a spam page.Interestingly, gmal.com is an unregistred domain name so it can not be indexed in search engines. there is not classification for this kind of spam pages which are exists in market.

Authors describe each section of spamming and target algorithms in each section.
Two techniques associated with web spam:
Boost technique: achieve high relevant/important of page, influence search engine ranking
Hide technique: hide boosting technique from eyes of human

Important Terms

Search Engine Optimizer – SEO

Boost technique

Hide technique

Term spamming

TFIDF

HITS

Cloaking


Review: Blog Open Track Task: Spam Blog Classification

February 19, 2008
Authors: Pranam Kolari, Tim Finin, Akshay Java, Anupam Joshi and Justin Martineau, James Mayfield.
Year: 2006
Published in: TREC 2006 Blog Track Notebook
Link: http://ebiquity.umbc.edu/paper/html/id/318/Blog-Track-Open-Task-Spam-Blog-Classification
Important: Very High

Abstract

Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting
a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open task submission details
how splogs impact Opinion Identification, and proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

My Review

In this paper, the authors first make it is clear that spam blogs (SPlogs) are different from web spam classification because of three reasons:
Search Engine Coverage: More search engine coverage especially for blogs
Quicker Assessment: Blogs updated sonly in compare with web page
Genre of Blog content: Current web spam detection tools are not suitable for nature of blog contents which comes from personal opinions, …

Authors mentioned to two models of blog spam detections:

  1. Local models
  2. Link-based models

They emphasis on first model which is one that detect spam thought single web page and it does not seek other links or other data source for detecting spam content. Main advantage of this model is quick assessment of spam content.

Techniques used in local model:

  1. Words
  2. Word N-Grams
  3. Tokenized Anchors
  4. Tokenized URLs

Other techniques:

  1. Ping servers discard blogs which ping too frequently
  2. Comment spam model: Akismet
  3. URL/IP blacklist

Splog categories:

  1. Non-blogs
  2. Keyword-stuffing
  3. Post-stitching
  4. Post-plagiarism
  5. Post-weaving
  6. Link-spam
  7. Other

Suggestions for authors:
It is important to how to create bag-of-word in order to detect spam contents. For this task, we can include highly advertisement keyword in to this list since many splogs use these keywords to get high traffic. Also, keywords that are based on times in year, e.g. greeting cards, happy valentine, … during specificity time of year.

Important Terms

  • Local models
  • Link-based models
  • Words
  • Word N-Grams
  • Tokenized Anchors
  • Tokenized URLs
  • Non-blogs
  • Keyword-stuffing
  • Post-stitching
  • Post-plagiarism
  • Post-weaving
  • Link-spam

Useful References
Java, A.; Kolari, P.; Finin, T.; Mayfield, J.; Joshi, A.; and Martineau, J. 2006. The UMBC/JHU blogvox system. In Proceedings of the Fifteenth Text Retrieval Conference.


Just another Research Blog!

February 19, 2008
<?php
echo “Hello world!”;
?>

As you see in title of blog, my name is Pedram Hayati (AKA Pedi) and in this blog I am going to gather research materials on Web Spam which is my Master thesis in Curtin University of Technology, Perth, WA.

This blog idea mainly comes from my supervisor suggestion Dr. Vidy Potdar for online literature review on my thesis.

Your contribution, comments, feedbacks, … are highly welcome!