Machine Learning Task: Multiclass classification
This is a cnae-9 database. It is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. The original texts were preprocessed to obtain the current data set: initially, it was kept only letters, and then was removed prepositions of the texts. Next, the words were transformed into their canonical form. Finally, each document was represented as a vector, where the weight of each word is its frequency in the document. This data set is highly sparse.
Available at OpenML: https://openml.org/d/1468