| Skip to main content | Skip to navigation |

Towards Practical Genre Classification of Web Documents

  • George Ferizis, CSIRO ICT Centre, Canberra, Australia
  • Peter Bailey, CSIRO ICT Centre, Canberra, Australia

Full text:

Track: Posters

Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis, it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented. The accuracy and computation time of this method is then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis.

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner

Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!