Turning Yahoo into an Automatic Web-Page Classifier
Dunja Mladenic
The paper describes an approach to automatic Web-page
classification based on the Yahoo hierarchy.
Machine learning techniques developed for learning on text data
are used here on the hierarchical classification structure.
The high number of features is reduced by taking into account the hierarchical
structure and using feature subset selection based on the method known
from information retrieval. Documents are represented
as feature-vectors that include n-grams instead of including only
single words (unigrams) as commonly used when learning on text data.
Based on the hierarchical structure the problem is divided
into subproblems,
each representing one on the categories included in the Yahoo hierarchy.
The result of learning is a set of independent classifiers, each used
to predict the probability that a new example is a member of
the corresponding category.
Experimental evaluation on real-world data shows that the proposed approach
gives good results.
For more than a half of testing examples a correct category
is among the 3 categories with the highest predicted probability.