The utility performs Hierarchical-K-Means clustering procedure on the input file ("-i") in the Bag-Of-Words format ".Bow". It produces 3 types of output files:
With the parameter "-docs" the number of clustered documents is determined
(value "-1" means all documents"). The parameter "-clusts" determines the final
number of clusters. The parameter "-rseed" determines the value of
random-number-generator seed, where value 0 means nondeterministic value. The
parameter "-mncdocs" determines maximal number of documents per leaf cluster. The
parameter "-ctrials" determines the number of different runs/trials of K-Means
algorithm in a search for the best solution. The parameter "-ceps" determines
convergence epsilon value which
influences the stopping criterium for the K-Means algorithm. The parameter "-cutww"
determines the percentage of the sum
of the weights for the best words in the centroids which appear in the textual
output file. The parameter "-mnwfq" determines the minimal document-frequency of
the words
which are used for the document representation. The parameter "-propwgt"
determines the mode of propagating the word-weights when executing the top-down
2-means clustering algorithm.
usage: BowHKMeans.exe
-i:Input-File (default:'')
-op:Output-BowPartition-File (default:'KMeans.BowPart')
-ot:Output-Txt-File (default:'KMeans.Txt')
-ox:Output-Xml-File (default:'KMeans.Xml')
-docs:Documents (default:-1)
-clusts:Clusters (default:10)
-rseed:RNG-Seed (default:1)
-mncdocs:Minimal-Documents-Per-Cluster (default:100)
-ctrials:Clustering-Trials (default:1)
-ceps:Convergence-Epsilon (default:10)
-cutww:Cut-Word-Weight-Sum-Percentage (default:0.5)
-mnwfq:Minimal-Word-Frequency (default:5)
-propwgt:Propagate-Weights (default:'F')
Example:
BowHKMeans.exe -i:Reuters21578.Bow -docs:1000 -mncdocs:100The above example call clusters first 1000 documents (-docs:) from Reuters21578.Bow (-i:) into hierarchical-2-means manner with the constraint that the clusters with less than 100 documents (-mncdocs:) are not split further. Files HKMeans.Txt (textual description of results), HKMeans.Xml (results in XML form) and HKMeans.BowPart (binary representation of partition) are created.