Global Navigation
ITAS HOME

LOCAL SEARCH


   

Viewgraphs with graphics can be found at

http://www-db.stanford.edu/pub/gio/personal/WIPE-NRC.ppt

Presentation below was delivered by Michel Bilello, Stanford University

Advanced Techniques for Automatic Web Filtering

James Z. Wang

PNC Tech. Career Dev. Professor

Penn State University

Joint Work: Jia Li, Assist. Prof., Penn State Statistics

Gio Wiederhold, Prof., Stanford Computer Science

http://wang.ist.psu.edu/

Outline

  • The problem
  • Related approaches
  • Filtering based on image content
  • Goals and methods
  • The WIPE system
  • Experimental results
  • Web site classification by image content
  • Conclusions and future work

The Size and Content of the Web

02/99: 16 million total Web servers

  • Estimated total number of pages on the Web: 800 million
  • 15 Terabytes of text - comparable to text of Library of Congress)
  • Year 2001: 3 to 5 billion pages

Lawrence, Giles, Nature, 1999.

Outline

  • The problem
  • Related approaches
  • Filtering based on image content
  • Goals and methods
  • The WIPE system
  • Experimental results
  • Web site classification by image content
  • Conclusions and future work

Pornography-free Web sites

  • E.g. Yahoo!Kids, disney.com
  • Useful in protecting those children too young to know how to use the Web browser
  • It is difficult to control access to other sites

Text-based Filtering

  • E.g. NetNanny, Cyber Patrol, CyberSitter
  • Methods:
  • Store more than 10,000 IPs
  • Blocking based on keywords
  • Block all image access
  • Problems:
  • Internet is dynamic
  • Keywords are not enough (e.g. text incorporated in images)
  • Images are needed for all net users

Classification of Web Community

  • Flake, Lawrence, Giles, ACM KDD, 2000
  • Graph clustering based on max flow – min cut analysis of the Web connectedness

Outline

  • The problem
  • Related approaches
  • Filtering based on image content
  • Goals and methods
  • The WIPE system
  • Experimental results
  • Web site classification by image content
  • Conclusions and future work

Goals and Methods

  • The problem comes from images, we deal with images
  • Goals: use machine learning and image retrieval to classify Web images and Web sites
  • Requirements: high accuracy and high speed
  • Challenges: non-uniform image background, textual noise in foreground, wide range of image quality, wide range of camera positions, wide range of composition…

The WIPE System

  • Inspired by the UC Berkeley’s FNP System
  • Detailed analysis of images
  • Skin filter and human figure grouper
  • Speed: 6 mins CPU time per image
  • Accuracy: 52% sensitivity and 96% specificity
  • Stanford WIPE System
  • Wavelet-based feature extraction + image classification + integrated region matching + machine leaning
  • Speed: < 1 second CPU time per image
  • Accuracy: 96% sensitivity and 91% specificity

System Flow

Wavelet Principle

Type Classification

Graphs:

Manually-generated images with smooth tones.

Type Classification

Photographs:

Images with continuous tones.

Photo Classification

Experimental Results

  • Tested on a set of over 10,000 photographic images
  • Speed: Less than one second of response time on a Pentium III PC
  • Accuracy

Comment on Accuracy

  • The algorithm can be adjusted to trade off specificity for higher sensitivity
  • In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher
  • Icons and graphs can be classified with almost 100% accuracy higher specificity
  • Combine text and image classification higher sensitivity and higher speed

False Classifications

Benign Images

False Classifications

Objectionable Images

Web site Classification

by Image Content

  • An objectionable site will have many such images
  • For a given objectionable Web site, we denote p as the chance of an image on the Web site to be an objectionable image
  • p is the percentage of objectionable images over all images provided by the site
  • We assume some distributions of p over all Web sites (e.g., Gaussian, shifted Gaussian)
  • Classification levels could be provided as a service to filtering software producers

Flow in Web site classification

Web site Classification

  • Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Web site classification if
  • We download 20-35 images for each site
  • We classify a Web site as objectionable if 20-25% of downloaded images are objectionable
  • Using text and IP addresses as criteria, the accuracy can be further improved
  • skip IPs for museums, dog-shows, beach towns, sport events

Outline

  • The problem
  • Related approaches
  • Filtering based on image content
  • Goals and methods
  • The WIPE system
  • Experimental results
  • Web site classification by image content
  • Conclusions and future work

Conclusions and Future Work

  • Perfect filtering is never possible
  • Effective filtering based on image content is feasible with the current technology
  • Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed
  • Objectionable Web sites are automatically identifiable, a service for the community?
  • The technology can still be improved through further research.

References

 

The National Academies

|

Current Projects | Publications | Directories | Search | Site Map | Feedback

Copyright 2003 National Academy of Sciences. All rights reserved.
500 Fifth Street, N.W., Washington, DC 20001
Terms of Use & Privacy Statement