Review: Developing a Framework for Assessing Information Quality on the World Wide Web

March 19, 2008
Authors: Shirlee-ann Knight and Janice Burn
Year: 2005
Published in: Informing Science Journal Volume 8
Link: http://inform.nu/Articles/Vol8/v8p159-172Knig.pdf
Importance: Medium 

 Abstract

The rapid growth of the Internet as an environment for information exchange and the lack of enforceable standards regarding the information it contains has lead to numerous information quality problems. A major issue is the inability of Search Engine technology to wade through the vast expanse of questionable content and return “quality” results to a user’s query. This paper attempts to address some of the issues involved in determining what quality is, as it pertains to information retrieval on the Internet. The IQIP model is presented as an approach to managing the choice and implementation of quality related algorithms of an Internet crawling Search Engine.

My Review

In this paper authors discuss about the problem of Information quality in WWW from search engines perspectives. They clearly define the problem and current solutions. Their proposed model (IQIP) consist four parts:

  • Identify: user, environment and task
  • Quantify: Prioritise information quality dimensions
  • Implement: implement chosen IQ dimension into Web Crawler
  • Perfect: improve crawler through feedback

Their proposed model can be used for attacking spam in WWW. As my supervisor (Dr. Potdar) suggested we make use of this model in anti-spam methods. Simple example:

  • Identify: here we study spammers behavior their subjects, behavior, …
  • Environment: contain study of Splog, Sforums, Spam pages, …
  • Task: Spam detection based on spam characteristic which previously exists in literature.

More review coming soon….


Review: Learning Fast Classifiers for Image Spam

March 17, 2008
Authors: Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach
Year: 2007
Published in: CEAS 2007
Link: http://www.ceas.cc/2007/papers/paper-06.pdf
Importance: Very High

Abstract

Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which cre-
ates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

My Review

In introduction part of this paper Authors mentioned that cornerstone of many anti-spam email systems are content filtering often based on naive bayes. Spammer outsmart content filter by:

  • Obscuring text
  • Obfuscating words with symbols
  • including neutral text

to confuse filters. and now spammer use new method which is advertsing in images attachments insted of text so this way neutralizing text filter methods. So authors suggest new method better to sya including new feature for detecting these spam images.

Features that get from images for spam analysis:

  • File Format
  • File Size
  • Image Metadata: comments, number of images (frames), bits per pixel, progressive flag, color table entries, indexvalue, transparent color, logical height and width, components, bands, …
  • Image Size
  • Average Color
  • Color Saturation
  • Edge Detection
  • Prevalent Color Coverage
  • Random Pixel Test

Authors provide very interesting equition for selecting those feature that is both accurate and fast Selected features are:

  • Average Color
  • Color Saturation
  • Edge Detection
  • File Format
  • File Size
  • Image Metadata
  • Image Size
  • Prevelent Color Coverage
  • Random Pixel Test

Important Terms

  • Maximum Entropy
  • Navie Bayes
  • SpamArchive
  • ID3 Decision Tree
  • Just In Time decision tree