AIIA 2007 START Conference Manager    

Yet Another n-gram Based Language Classification

Andrija Tomovic and Predrag Janicic

The 10th Congress of the Italian Association for Artificial Intelligence (AIIA 2007)
Roma, Italy, September 10-13, 2007


Abstract

Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This is, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpora, we used a EU set of documents in 20 languages. Experimental comparison show that our approach gives better results than four other popular approaches.


  
START Conference Manager (V2.54.4)
Maintainer: rrgerber@softconf.com