Comparison of K-Nearest Neighbor and NaÃ¯ve Bayes algorithms for hoax classification in Indonesian health news

Awang Hendrianto Pratomo; Faiz Rachmad; Frans Richard Kodong

doi:10.31763/businta.v8i2.796

Authors

Awang Hendrianto Pratomo Universitas Pembangunan Nasional Veteran Yogyakarta
Faiz Rachmad Universitas Pembangunan Nasional Veteran Yogyakarta
Frans Richard Kodong Universitas Pembangunan Nasional Veteran Yogyakarta

DOI:

https://doi.org/10.31763/businta.v8i2.796

Keywords:

news, hoax news classification, comparison, K-Nearest Neighbor

Abstract

The categorization of health-related hoaxes is paramount in determining if they report facts. This paper analyzes the accuracy of the K-Nearest Neighbor (KNN) and the NaÃ¯ve Bayes Classifier as two algorithms for health news hoaxes classification. Text mining was employed by feature extraction employing the TF-IDF method from the news headlines to classify the clusters. A prototype model was used to develop the system. Models assessment included confusion matrices and k-fold cross-validation. K=3 KNN model attained an average accuracy of 82.91%, precision of 85.3% and recall of 79.38% with no predictors included. The best performance was recorded for using the Naive Bayes model at fixation of K=3 KNN model at an average accuracy of 86.42%, precision level of 88.10% and recall high of 84.05%. These findings suggest that the KNN surfaces in the last model level rather than in the absence of the Naive Bayes model concerning classifying the hoax position of health news visible through the confusion evaluative matrix. Although related studies have been conducted in the past, this study is dissimilar in terms of its preprocessing methods, size of the data, and outcomes. The dataset consists of 1219 hoaxes labelled and 1227 facts labelled news headlines

References

Y. Tsfati, H. G. Boomgaarden, J. StrÃ¶mbÃ¤ck, R. Vliegenthart, A. Damstra, and E. Lindgren, â€œCauses and consequences of mainstream media dissemination of fake news: literature review and synthesis,â€ Ann. Int. Commun. Assoc., vol. 44, no. 2, pp. 157â€“173, Apr. 2020, doi: 10.1080/23808985.2020.1759443.

M. D. Molina, S. S. Sundar, T. Le, and D. Lee, â€œâ€˜Fake Newsâ€™ Is Not Simply False Information: A Concept Explication and Taxonomy of Online Content,â€ Am. Behav. Sci., vol. 65, no. 2, pp. 180â€“212, Feb. 2021, doi: 10.1177/0002764219878224.

X. Zhang and A. A. Ghorbani, â€œAn overview of online fake news: Characterization, detection, and discussion,â€ Inf. Process. Manag., vol. 57, no. 2, p. 102025, Mar. 2020, doi: 10.1016/j.ipm.2019.03.004.

N. Rabb, L. Cowen, J. P. de Ruiter, and M. Scheutz, â€œCognitive cascades: How to model (and potentially counter) the spread of fake news,â€ PLoS One, vol. 17, no. 1, p. e0261811, Jan. 2022, doi: 10.1371/journal.pone.0261811.

J. P. Dillard, R. Li, and C. Yang, â€œFear of Zika: Information Seeking as Cause and Consequence,â€ Health Commun., vol. 36, no. 13, pp. 1785â€“1795, Nov. 2021, doi: 10.1080/10410236.2020.1794554.

M. R. Delavar, â€œHybrid machine learning approaches for classification and detection of fractures in carbonate reservoir,â€ J. Pet. Sci. Eng., vol. 208, p. 109327, Jan. 2022, doi: 10.1016/j.petrol.2021.109327.

M. Zgurovsky, V. Sineglazov, and E. Chumachenko, Artificial Intelligence Systems Based on Hybrid Neural Networks, vol. 904. Cham: Springer International Publishing, p. 512, 2021, doi: 10.1007/978-3-030-48453-8.

N. Almugren and H. Alshamlan, â€œA Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification,â€ IEEE Access, vol. 7, pp. 78533â€“78548, 2019, doi: 10.1109/ACCESS.2019.2922987.

H. Singh, V. Sharma, and D. Singh, â€œComparative analysis of proficiencies of various textures and geometric features in breast mass classification using k-nearest neighbor,â€ Vis. Comput. Ind. Biomed. Art, vol. 5, no. 1, p. 3, Dec. 2022, doi: 10.1186/s42492-021-00100-1.

B. T. Pham, B. Pradhan, D. Tien Bui, I. Prakash, and M. B. Dholakia, â€œA comparative study of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India),â€ Environ. Model. Softw., vol. 84, pp. 240â€“250, Oct. 2016, doi: 10.1016/j.envsoft.2016.07.005.

P. Tsangaratos and I. Ilia, â€œComparison of a logistic regression and NaÃ¯ve Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size,â€ CATENA, vol. 145, pp. 164â€“179, Oct. 2016, doi: 10.1016/j.catena.2016.06.004.

W. Chen, X. Yan, Z. Zhao, H. Hong, D. T. Bui, and B. Pradhan, â€œSpatial prediction of landslide susceptibility using data mining-based kernel logistic regression, naive Bayes and RBFNetwork models for the Long County area (China),â€ Bull. Eng. Geol. Environ., vol. 78, no. 1, pp. 247â€“266, Feb. 2019, doi: 10.1007/s10064-018-1256-z.

S. Xu, â€œBayesian NaÃ¯ve Bayes classifiers to text classification,â€ J. Inf. Sci., vol. 44, no. 1, pp. 48â€“59, Feb. 2018, doi: 10.1177/0165551516677946.

M. Ontivero-Ortega, A. Lage-Castellanos, G. Valente, R. Goebel, and M. Valdes-Sosa, â€œFast Gaussian NaÃ¯ve Bayes for searchlight classification analysis,â€ Neuroimage, vol. 163, pp. 471â€“479, Dec. 2017, doi: 10.1016/j.neuroimage.2017.09.001.

A. Jamain and D. J. Hand, â€œThe Naive Bayes Mystery: A classification detective story,â€ Pattern Recognit. Lett., vol. 26, no. 11, pp. 1752â€“1760, Aug. 2005, doi: 10.1016/j.patrec.2005.02.001.

S. Chen, G. I. Webb, L. Liu, and X. Ma, â€œA novel selective naÃ¯ve Bayes algorithm,â€ Knowledge-Based Syst., vol. 192, p. 105361, Mar. 2020, doi: 10.1016/j.knosys.2019.105361.

H. Hassani, C. Beneki, S. Unger, M. T. Mazinani, and M. R. Yeganegi, â€œText Mining in Big Data Analytics,â€ Big Data Cogn. Comput., vol. 4, no. 1, p. 1, Jan. 2020, doi: 10.3390/bdcc4010001.

K. StÃ¶ger, D. Schneeberger, P. Kieseberg, and A. Holzinger, â€œLegal aspects of data cleansing in medical AI,â€ Comput. Law Secur. Rev., vol. 42, p. 105587, Sep. 2021, doi: 10.1016/j.clsr.2021.105587.

C. P. Chai, â€œComparison of text preprocessing methods,â€ Nat. Lang. Eng., vol. 29, no. 3, pp. 509â€“553, May 2023, doi: 10.1017/S1351324922000213.

M. Siino, I. Tinnirello, and M. La Cascia, â€œIs text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,â€ Inf. Syst., vol. 121, p. 102342, Mar. 2024, doi: 10.1016/j.is.2023.102342.

M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, â€œImproving Text Preprocessing For Student Complaint Document Classification Using Sastrawi,â€ IOP Conf. Ser. Mater. Sci. Eng., vol. 874, no. 1, p. 012017, Jun. 2020, doi: 10.1088/1757-899X/874/1/012017.

S. Choo and W. Kim, â€œA study on the evaluation of tokenizer performance in natural language processing,â€ Appl. Artif. Intell., vol. 37, no. 1, Dec. 2023, doi: 10.1080/08839514.2023.2175112.

S. Chanda and S. Pal, â€œThe Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media,â€ SN Comput. Sci., vol. 4, no. 5, p. 494, Jun. 2023, doi: 10.1007/s42979-023-01942-7.

K. Divya, B. Siddhartha, N. Niveditha, and B. Divya, â€œAn Interpretation of Lemmatization and Stemming in Natural Language Processing,â€ J. Univ. Shanghai Sci. Technol., vol. 22, no. 10, pp. 350â€“357, 2020. [Online]. Available at: https://jusst.org/an-interpretation-of-lemmatization-and-stemming-in-natural-language-processing/.

U. Buatoom, W. Kongprawechnon, and T. Theeramunkong, â€œDocument Clustering Using K-Means with Term Weighting as Similarity-Based Constraints,â€ Symmetry (Basel)., vol. 12, no. 6, p. 967, Jun. 2020, doi: 10.3390/sym12060967.

S. Zhang, â€œChallenges in KNN Classification,â€ IEEE Trans. Knowl. Data Eng., vol. 34, no. 10, pp. 4663â€“4675, Oct. 2022, doi: 10.1109/TKDE.2021.3049250.

N. Deepa, J. Sathya Priya, and T. Devi, â€œTowards applying internet of things and machine learning for the risk prediction of COVID-19 in pandemic situation using Naive Bayes classifier for improving accuracy,â€ Mater. Today Proc., vol. 62, pp. 4795â€“4799, 2022, doi: 10.1016/j.matpr.2022.03.345.

R. Yacouby and D. Axman, â€œProbabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models,â€ in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Nov. 2020, pp. 79â€“91, doi: 10.18653/v1/2020.eval4nlp-1.9.

W. H. Bangyal et al., â€œDetection of Fake News Text Classification on COVID-19 Using Deep Learning Approaches,â€ Comput. Math. Methods Med., vol. 2021, pp. 1â€“14, Nov. 2021, doi: 10.1155/2021/5514220.

H. D. Cahyono, A. Mahadewa, A. Wijayanto, D. W. Wardani, and H. Setiadi, â€œFast NaÃ¯ve Bayes classifiers for COVID-19 news in social networks,â€ Indones. J. Electr. Eng. Comput. Sci., vol. 34, no. 2, pp. 1033â€“1041, 2024, doi: 10.11591/ijeecs.v34.i2.pp1033-1041.

Comparison of K-Nearest Neighbor and NaÃ¯ve Bayes algorithms for hoax classification in Indonesian health news

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section