Word sequences as features in text-learning
Dunja Mladenic, Marko Grobelnik
This paper proposes an efficient algorithm for the generation of new features
that enrich the known bag-of-words document representation.
New features are generated based on word sequences of different length.
Learning is performed using Naive Bayesian classifier on feature-vectors,
where only highly scored features are used.
The performance of enriched document representation is evaluated
on the problem of automatic document categorization using Yahoo text hierarchy.
Our experiments show that using word sequences of length up to 3 instead of
using only single words improves the performance,
while longer sequences in average have no influence to the performance.