Open Access Open Access  Restricted Access Subscription or Fee Access

Applying Balancing Techniques to Classify Biomedical Documents: An Empirical Study

Rubén Romero González, Eva Lorenzo Iglesias, Lourdes Borrajo Diz

Abstract


In the last decade several text mining methods have been proposed to automate the process of searching and classifying information in on-line biomedical publications. However, results are not enough good mainly because of the unbalanced nature of the documents, with only a very small number of relevant papers to each user query. Due to most data mining and machine learning algorithms have a great difficult dealing with unbalanced data,this problem is taking center stage. Classification techniques such as support-vector machines (SVMs) have excellent performance for balanced data, but may fail when applied to unbalanced datasets. One of the most common techniques for dealing with this problem consists of changing the basic sampling methods including under-sampling, over-sampling and re-sampling. This article discusses the issues associated with classifying of unbalanced data, and analyze the effects of these balancing strategies on four different SVM kernels (lineal, sigmoid, exponential and polynomial kernels) using the TREC Genomics 2005 biomedical text public corpus. The experimental results show that normalized lineal and sigmoid kernels and the under-sampling balancing technique outperform the other approaches tested. Empirical tests are conducted using a new software tool named BioClass which is presented here.

Keywords


Biomedical text mining, classification techniques, Support vector machine, kernels, unbalanced data.

Full Text:

PDF


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information. Also: DOI is paid service which provided by a third party. We never mentioned that we go for this for our any journal. However, journal have no objection if author go directly for this paid DOI service.