Open Access Open Access  Restricted Access Subscription or Fee Access

Position Score Weighting Technique for Mining Web Content Outliers

W.R. Wan Zulkifeli, N. Mustapha, A. Mustapha

Abstract


The existing mining web content outlier methods used stemming algorithm to preprocess the web documents and leave the domain dictionary in their root words. The stemming algorithm was usually used to reduce derived words to their stem, base or root form. The stemming algorithm sometimes does not leave a real word after removing the stem and it caused a problem to match words in the full word profile with the domain dictionary. Therefore this study uses stemmed domain dictionary and applies it with Term Frequency with Position Score (TF.PS) weighting technique which is derived from TF.IDF weighting technique from Information Retrieval (IR) in dissimilarity measure phase to see the efficiency of these technique for determining the outliers in the web content. The dataset is from The 20 Newsgroups Dataset. The result for stemmed domain dictionary with TF.PS weighting technique achieves up to 98.19% of accuracy and 90% of F1-Measure which is higher than previous techniques.

Keywords


information retrieval, outliers, web content, weighting technique

Full Text:

PDF


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information. Also: DOI is paid service which provided by a third party. We never mentioned that we go for this for our any journal. However, journal have no objection if author go directly for this paid DOI service.