Open Access Open Access  Restricted Access Subscription or Fee Access

Methods of Unsupervised Semantic Analysis of Small and Medium-Sized Corpora

S. Dolgikh, O. Sliusarenko

Abstract



Analysis and description of text corpora can present a number of technical challenges, especially in the case of corpora built by automated content extraction that may not allow for readily available annotations and other semantic information about the texts. In this work we describe and test an approach to analysis of the semantic content of corpora based on the methods of unsupervised feature extraction, dimensionality reduction and concept learning. With model corpora represented by texts in English newsgroups, we demonstrate how characteristic semantic types can be identified with methods of unsupervised machine learning and clustering. The results can be an instrumental addition to methods of analysis of semantic context of text corpora where the prior description such as annotations may not be available or is scarce. The approach and methods demonstrated in this work are in no way limited to the English language and can be applied to corpora in any language where the appropriate vectorization and preprocessing methods are available.

Keywords


Natural Language Processing, semantic analysis, unsupervised learning, statistical machine learning, clustering.

Full Text:

PDF


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information. Also: DOI is paid service which provided by a third party. We never mentioned that we go for this for our any journal. However, journal have no objection if author go directly for this paid DOI service.