| Skip to main content | Skip to navigation |

Effective Web-Scale Crawling Through Website Analysis

  • Ivan Gonzalez, Carnegie Mellon University, USA
  • Adam Marcus, Rensselaer Polytechnic Institute, USA
  • Daniel Meredith, IBM Almaden Research Center, USA
  • Linda Nguyen, IBM Almaden Research Center, USA

Full text:

Track: Posters

The web crawler space is often delimited into two general areas--that of full web crawling and that of focus, or site/page specific crawling. The following paper presents a general overview and experimental results of a self-focusing crawler. The system begins as a full web crawl, which has a specified set of features which are of interest to the crawler client. The crawl then systematically samples and analyzes web sites as it moves through the general web, biasing its efforts toward sites with the provided relevant attributes. This crawl employs lightweight heuristics and a unique architecture which allows it to accurately score unknown webpages from a known site while not requiring a record for every page on the World Wide Web.

Other items being presented by these speakers

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner

Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!