ITAS HOME
LOCAL SEARCH
|
|
Viewgraphs with graphics can be found at http://www-db.stanford.edu/pub/gio/personal/WIPE-NRC.ppt Presentation below was delivered by Michel Bilello, Stanford University Advanced Techniques for Automatic Web Filtering James Z. Wang PNC Tech. Career Dev. Professor Penn State University Joint Work: Jia Li, Assist. Prof., Penn State Statistics Gio Wiederhold, Prof., Stanford Computer Science http://wang.ist.psu.edu/ Outline - The problem
- Related approaches
- Filtering based on image content
- Goals and methods
- The WIPE system
- Experimental results
- Web site classification by image content
- Conclusions and future work
The Size and Content of the Web 02/99: 16 million total Web servers - Estimated total number of pages on the Web: 800 million
- 15 Terabytes of text - comparable to text of Library of Congress)
- Year 2001: 3 to 5 billion pages
Lawrence, Giles, Nature, 1999. Outline - The problem
- Related approaches
- Filtering based on image content
- Goals and methods
- The WIPE system
- Experimental results
- Web site classification by image content
- Conclusions and future work
Pornography-free Web sites - E.g. Yahoo!Kids, disney.com
- Useful in protecting those children too young to know how to use the Web browser
- It is difficult to control access to other sites
Text-based Filtering - E.g. NetNanny, Cyber Patrol, CyberSitter
- Methods:
- Store more than 10,000 IPs
- Blocking based on keywords
- Block all image access
- Problems:
- Internet is dynamic
- Keywords are not enough (e.g. text incorporated in images)
- Images are needed for all net users
Classification of Web Community - Flake, Lawrence, Giles, ACM KDD, 2000
- Graph clustering based on max flow – min cut analysis of the Web connectedness
Outline - The problem
- Related approaches
- Filtering based on image content
- Goals and methods
- The WIPE system
- Experimental results
- Web site classification by image content
- Conclusions and future work
Goals and Methods - The problem comes from images, we deal with images
- Goals: use machine learning and image retrieval to classify Web images and Web sites
- Requirements: high accuracy and high speed
-
Challenges: non-uniform image background, textual noise in foreground,
wide range of image quality, wide range of camera positions, wide range
of composition…
The WIPE System - Inspired by the UC Berkeley’s FNP System
- Detailed analysis of images
- Skin filter and human figure grouper
- Speed: 6 mins CPU time per image
- Accuracy: 52% sensitivity and 96% specificity
- Stanford WIPE System
- Wavelet-based feature extraction + image classification + integrated region matching + machine leaning
- Speed: < 1 second CPU time per image
- Accuracy: 96% sensitivity and 91% specificity
System Flow Wavelet Principle Type Classification Graphs: Manually-generated images with smooth tones. Type Classification Photographs: Images with continuous tones. Photo Classification Experimental Results - Tested on a set of over 10,000 photographic images
- Speed: Less than one second of response time on a Pentium III PC
- Accuracy
Comment on Accuracy - The algorithm can be adjusted to trade off specificity for higher sensitivity
- In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher
- Icons and graphs can be classified with almost 100% accuracy higher specificity
- Combine text and image classification higher sensitivity and higher speed
False Classifications Benign Images False Classifications Objectionable Images Web site Classification by Image Content - An objectionable site will have many such images
- For a given objectionable Web site, we denote p as the chance of an image on the Web site to be an objectionable image
- p is the percentage of objectionable images over all images provided by the site
- We assume some distributions of p over all Web sites (e.g., Gaussian, shifted Gaussian)
- Classification levels could be provided as a service to filtering software producers
Flow in Web site classification Web site Classification - Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Web site classification if
- We download 20-35 images for each site
- We classify a Web site as objectionable if 20-25% of downloaded images are objectionable
- Using text and IP addresses as criteria, the accuracy can be further improved
- skip IPs for museums, dog-shows, beach towns, sport events
Outline - The problem
- Related approaches
- Filtering based on image content
- Goals and methods
- The WIPE system
- Experimental results
- Web site classification by image content
- Conclusions and future work
Conclusions and Future Work - Perfect filtering is never possible
- Effective filtering based on image content is feasible with the current technology
- Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed
- Objectionable Web sites are automatically identifiable, a service for the community?
- The technology can still be improved through further research.
References  |