English Sentiment Classification using An Ochiai Similarity Measure and The One-dimensional Vectors in a Parallel Network Environment

Vo Ngoc Phu, Vo Thi Ngoc Tran


Sentiment analysis is very significant in everyday life, for example, in political activities, commodity production, and commercial activities. A novel model for large-scale data set opinion analysis in this work has been proposed successfully. We use An OCHIAI coefficient (OC) of the clustering technologies of a data mining field to cluster one document of our English testing data set, which is 6,000,000 documents comprising the 3,000,000 positive and the 3,000,000 negative, into either the positive polarity or the negative polarity based on our English training data set which is 5,000,000 documents including the 2,500,000 positive and the 2,500,000 negative. Any opinion lexicons are not used in this study in English. We do not use any multi-dimensional vector based on both a vector space modelling (VSM) and the sentiment lexicons. We only use many one-dimensional vectors based on VSM. One one-dimensional vector is clustered into either the positive or the negative if this vector is very close to either the positive or the negative by using many similarity coefficients of the OC. Therefore, we see that this vector is clearly very similar to either the positive or the negative. One document of the testing data set is clustered into the sentiments (positive, negative, or neutral) based on many one-dimensional vectors. We firstly tested the proposed model in a sequential environment and then, this novel model was secondly tested in a distributed network system. We have had 87.58% which is the accuracy of the testing data set in this research.The execution time of the model in the parallel network environment is faster than thatl in the sequential system. This survey used many similarity coefficients of the data mining field. Many applications and surveys can significantly use the results of this research.


English sentiment classification; distributed system; parallel system; OCHIAI similarity measure; Cloudera; Hadoop Map and Hadoop Reduce; clustering technology

