GoGetIt!: Structure-Driven Crawler Generation by Example

  • Marcio Vidal, Universidade Federal do Amazonas, Brazil
  • Altigran Soares da Silva, Universidade Federal do Amazonas, Brazil
  • Edleno Silva de Moura, Universidade Federal do Amazonas, Brazil
  • Joao Marcos Bastos Cavalcanti, Universidade Federal do Amazonas, Brazil

Full text:

Track: Posters

Many applications on the Web are targeted at processing collections of similar pages obtained from Web sites with the ultimate goal of taking advantage of the valuable information these pages implicitly contain. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page contents. However, there are important situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their contents. In this paper, we present GoGetIt!, a tool for generating structure-driven crawlers that requires a minimum effort from the users. The tool takes as input a sample page and an entry point to a Web site and generates a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have performed, structure-driven crawlers generated by GoGetIt! were able to collect all pages that match the samples given, including those pages added after their generation.

