Written Testimony from James Wang

ITAS HOME

LOCAL SEARCH

Viewgraphs with graphics can be found at

http://www-db.stanford.edu/pub/gio/personal/WIPE-NRC.ppt

Presentation below was delivered by Michel Bilello, Stanford University

Advanced Techniques for Automatic Web Filtering

James Z. Wang

PNC Tech. Career Dev. Professor

Penn State University

Joint Work: Jia Li, Assist. Prof., Penn State Statistics

Gio Wiederhold, Prof., Stanford Computer Science

http://wang.ist.psu.edu/

Outline

The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Web site classification by image content
Conclusions and future work

The Size and Content of the Web

02/99: 16 million total Web servers

Estimated total number of pages on the Web: 800 million
15 Terabytes of text - comparable to text of Library of Congress)
Year 2001: 3 to 5 billion pages

Lawrence, Giles, Nature, 1999.

Outline

The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Web site classification by image content
Conclusions and future work

Pornography-free Web sites

E.g. Yahoo!Kids, disney.com
Useful in protecting those children too young to know how to use the Web browser
It is difficult to control access to other sites

Text-based Filtering

E.g. NetNanny, Cyber Patrol, CyberSitter
Methods:
Store more than 10,000 IPs
Blocking based on keywords
Block all image access
Problems:
Internet is dynamic
Keywords are not enough (e.g. text incorporated in images)
Images are needed for all net users

Classification of Web Community

Flake, Lawrence, Giles, ACM KDD, 2000
Graph clustering based on max flow – min cut analysis of the Web connectedness

Outline

The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Web site classification by image content
Conclusions and future work

Goals and Methods

The problem comes from images, we deal with images
Goals: use machine learning and image retrieval to classify Web images and Web sites
Requirements: high accuracy and high speed
Challenges: non-uniform image background, textual noise in foreground, wide range of image quality, wide range of camera positions, wide range of composition…

The WIPE System

Inspired by the UC Berkeley’s FNP System
Detailed analysis of images
Skin filter and human figure grouper
Speed: 6 mins CPU time per image
Accuracy: 52% sensitivity and 96% specificity
Stanford WIPE System
Wavelet-based feature extraction + image classification + integrated region matching + machine leaning
Speed: < 1 second CPU time per image
Accuracy: 96% sensitivity and 91% specificity

System Flow

Wavelet Principle

Type Classification

Graphs:

Manually-generated images with smooth tones.

Type Classification

Photographs:

Images with continuous tones.

Photo Classification

Experimental Results

Tested on a set of over 10,000 photographic images
Speed: Less than one second of response time on a Pentium III PC
Accuracy

Comment on Accuracy

The algorithm can be adjusted to trade off specificity for higher sensitivity
In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher
Icons and graphs can be classified with almost 100% accuracy higher specificity
Combine text and image classification higher sensitivity and higher speed

False Classifications

Benign Images

False Classifications

Objectionable Images

Web site Classification

by Image Content

An objectionable site will have many such images
For a given objectionable Web site, we denote p as the chance of an image on the Web site to be an objectionable image
p is the percentage of objectionable images over all images provided by the site
We assume some distributions of p over all Web sites (e.g., Gaussian, shifted Gaussian)
Classification levels could be provided as a service to filtering software producers

Flow in Web site classification

Web site Classification

Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Web site classification if
We download 20-35 images for each site
We classify a Web site as objectionable if 20-25% of downloaded images are objectionable
Using text and IP addresses as criteria, the accuracy can be further improved
skip IPs for museums, dog-shows, beach towns, sport events

Outline

The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Web site classification by image content
Conclusions and future work

Conclusions and Future Work

Perfect filtering is never possible
Effective filtering based on image content is feasible with the current technology
Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed
Objectionable Web sites are automatically identifiable, a service for the community?
The technology can still be improved through further research.

References

http://www-db.stanford.edu/IMAGE (papers)
http://wang.ist.psu.edu/cgi-bin/zwang/wipe2_show.cgi(demo)
http://www-db.stanford.edu/pub/gio/inprogress.html - COPA(testimony)
jwang@ist.psu.edu(James Wang)
gio@cs.stanford.edu (Gio Wiederhold)
michel@db.stanford.edu(Michel Bilello)

The National Academies

|

Current Projects

|

|

|

|

|

Copyright 2003 National Academy of Sciences. All rights reserved.
500 Fifth Street, N.W., Washington, DC 20001
Terms of Use & Privacy Statement