Open Access Open Access  Restricted Access Subscription or Fee Access

Statistics-Based Data Preprocessing Methods and Machine Learning Algorithms for Big Data Analysis

Azizur Rahman


Big data analytics is a very fast growing research domain which embedded the combination of computational (i.e. computer-intensive) and inferential (i.e. statistics-oriented) thinking. Information is increasingly gathered into big data environment such as distinct protein-coding data for identifying various critical diseases and its cure. Data pre-processing techniques are used to make the data clean, noise free and consistent to model in various real life purposes. This paper examines a range of statistics-based data pre-processing methods and machine learning algorithms to assess their performances in the big data analysis setting. Tuberculosis affected protein’s amino acid sequences data from the National Center for Biotechnology Information (NCBI) database is utilized for empirical results. Findings reveal that statistics-based pre-processing methods are effective to make the big data useable for significant modelling and analysis with novel machine learning algorithms such as the hidden Markov chain model, Box-Cox and linear transformation, and they also maintain the performance of those algorithms. Although there are significant differences observed between predictive outcomes and performances of the algorithms, results further demonstrate that the hidden Markov chain model produced more accurate, exact and faster analysis with reliable estimates.


Big Data, Computational Algorithms, Hidden Markov Chain Model, Normalization, Statistical Thinking, Tuberculosis.

Full Text:


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information. Also: DOI is paid service which provided by a third party. We never mentioned that we go for this for our any journal. However, journal have no objection if author go directly for this paid DOI service.