Comparison of K-Nearest Neighbor and Naïve Bayes algorithms for hoax classification in Indonesian health news

Authors

  • Awang Hendrianto Pratomo Universitas Pembangunan Nasional Veteran Yogyakarta
  • Faiz Rachmad Universitas Pembangunan Nasional Veteran Yogyakarta
  • Frans Richard Kodong Universitas Pembangunan Nasional Veteran Yogyakarta

DOI:

https://doi.org/10.31763/businta.v8i2.796

Keywords:

news, hoax news classification, comparison, K-Nearest Neighbor

Abstract

The categorization of health-related hoaxes is paramount in determining if they report facts. This paper analyzes the accuracy of the K-Nearest Neighbor (KNN) and the Naïve Bayes Classifier as two algorithms for health news hoaxes classification. Text mining was employed by feature extraction employing the TF-IDF method from the news headlines to classify the clusters. A prototype model was used to develop the system. Models assessment included confusion matrices and k-fold cross-validation. K=3 KNN model attained an average accuracy of 82.91%, precision of 85.3% and recall of 79.38% with no predictors included. The best performance was recorded for using the Naive Bayes model at fixation of K=3 KNN model at an average accuracy of 86.42%, precision level of 88.10% and recall high of 84.05%. These findings suggest that the KNN surfaces in the last model level rather than in the absence of the Naive Bayes model concerning classifying the hoax position of health news visible through the confusion evaluative matrix. Although related studies have been conducted in the past, this study is dissimilar in terms of its preprocessing methods, size of the data, and outcomes. The dataset consists of 1219 hoaxes labelled and 1227 facts labelled news headlines

References

Y. Tsfati, H. G. Boomgaarden, J. Strömbäck, R. Vliegenthart, A. Damstra, and E. Lindgren, “Causes and consequences of mainstream media dissemination of fake news: literature review and synthesis,” Ann. Int. Commun. Assoc., vol. 44, no. 2, pp. 157–173, Apr. 2020, doi: 10.1080/23808985.2020.1759443.

M. D. Molina, S. S. Sundar, T. Le, and D. Lee, “‘Fake News’ Is Not Simply False Information: A Concept Explication and Taxonomy of Online Content,” Am. Behav. Sci., vol. 65, no. 2, pp. 180–212, Feb. 2021, doi: 10.1177/0002764219878224.

X. Zhang and A. A. Ghorbani, “An overview of online fake news: Characterization, detection, and discussion,” Inf. Process. Manag., vol. 57, no. 2, p. 102025, Mar. 2020, doi: 10.1016/j.ipm.2019.03.004.

N. Rabb, L. Cowen, J. P. de Ruiter, and M. Scheutz, “Cognitive cascades: How to model (and potentially counter) the spread of fake news,” PLoS One, vol. 17, no. 1, p. e0261811, Jan. 2022, doi: 10.1371/journal.pone.0261811.

J. P. Dillard, R. Li, and C. Yang, “Fear of Zika: Information Seeking as Cause and Consequence,” Health Commun., vol. 36, no. 13, pp. 1785–1795, Nov. 2021, doi: 10.1080/10410236.2020.1794554.

M. R. Delavar, “Hybrid machine learning approaches for classification and detection of fractures in carbonate reservoir,” J. Pet. Sci. Eng., vol. 208, p. 109327, Jan. 2022, doi: 10.1016/j.petrol.2021.109327.

M. Zgurovsky, V. Sineglazov, and E. Chumachenko, Artificial Intelligence Systems Based on Hybrid Neural Networks, vol. 904. Cham: Springer International Publishing, p. 512, 2021, doi: 10.1007/978-3-030-48453-8.

N. Almugren and H. Alshamlan, “A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification,” IEEE Access, vol. 7, pp. 78533–78548, 2019, doi: 10.1109/ACCESS.2019.2922987.

H. Singh, V. Sharma, and D. Singh, “Comparative analysis of proficiencies of various textures and geometric features in breast mass classification using k-nearest neighbor,” Vis. Comput. Ind. Biomed. Art, vol. 5, no. 1, p. 3, Dec. 2022, doi: 10.1186/s42492-021-00100-1.

B. T. Pham, B. Pradhan, D. Tien Bui, I. Prakash, and M. B. Dholakia, “A comparative study of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India),” Environ. Model. Softw., vol. 84, pp. 240–250, Oct. 2016, doi: 10.1016/j.envsoft.2016.07.005.

P. Tsangaratos and I. Ilia, “Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size,” CATENA, vol. 145, pp. 164–179, Oct. 2016, doi: 10.1016/j.catena.2016.06.004.

W. Chen, X. Yan, Z. Zhao, H. Hong, D. T. Bui, and B. Pradhan, “Spatial prediction of landslide susceptibility using data mining-based kernel logistic regression, naive Bayes and RBFNetwork models for the Long County area (China),” Bull. Eng. Geol. Environ., vol. 78, no. 1, pp. 247–266, Feb. 2019, doi: 10.1007/s10064-018-1256-z.

S. Xu, “Bayesian Naïve Bayes classifiers to text classification,” J. Inf. Sci., vol. 44, no. 1, pp. 48–59, Feb. 2018, doi: 10.1177/0165551516677946.

M. Ontivero-Ortega, A. Lage-Castellanos, G. Valente, R. Goebel, and M. Valdes-Sosa, “Fast Gaussian Naïve Bayes for searchlight classification analysis,” Neuroimage, vol. 163, pp. 471–479, Dec. 2017, doi: 10.1016/j.neuroimage.2017.09.001.

A. Jamain and D. J. Hand, “The Naive Bayes Mystery: A classification detective story,” Pattern Recognit. Lett., vol. 26, no. 11, pp. 1752–1760, Aug. 2005, doi: 10.1016/j.patrec.2005.02.001.

S. Chen, G. I. Webb, L. Liu, and X. Ma, “A novel selective naïve Bayes algorithm,” Knowledge-Based Syst., vol. 192, p. 105361, Mar. 2020, doi: 10.1016/j.knosys.2019.105361.

H. Hassani, C. Beneki, S. Unger, M. T. Mazinani, and M. R. Yeganegi, “Text Mining in Big Data Analytics,” Big Data Cogn. Comput., vol. 4, no. 1, p. 1, Jan. 2020, doi: 10.3390/bdcc4010001.

K. Stöger, D. Schneeberger, P. Kieseberg, and A. Holzinger, “Legal aspects of data cleansing in medical AI,” Comput. Law Secur. Rev., vol. 42, p. 105587, Sep. 2021, doi: 10.1016/j.clsr.2021.105587.

C. P. Chai, “Comparison of text preprocessing methods,” Nat. Lang. Eng., vol. 29, no. 3, pp. 509–553, May 2023, doi: 10.1017/S1351324922000213.

M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers,” Inf. Syst., vol. 121, p. 102342, Mar. 2024, doi: 10.1016/j.is.2023.102342.

M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing For Student Complaint Document Classification Using Sastrawi,” IOP Conf. Ser. Mater. Sci. Eng., vol. 874, no. 1, p. 012017, Jun. 2020, doi: 10.1088/1757-899X/874/1/012017.

S. Choo and W. Kim, “A study on the evaluation of tokenizer performance in natural language processing,” Appl. Artif. Intell., vol. 37, no. 1, Dec. 2023, doi: 10.1080/08839514.2023.2175112.

S. Chanda and S. Pal, “The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media,” SN Comput. Sci., vol. 4, no. 5, p. 494, Jun. 2023, doi: 10.1007/s42979-023-01942-7.

K. Divya, B. Siddhartha, N. Niveditha, and B. Divya, “An Interpretation of Lemmatization and Stemming in Natural Language Processing,” J. Univ. Shanghai Sci. Technol., vol. 22, no. 10, pp. 350–357, 2020. [Online]. Available at: https://jusst.org/an-interpretation-of-lemmatization-and-stemming-in-natural-language-processing/.

U. Buatoom, W. Kongprawechnon, and T. Theeramunkong, “Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints,” Symmetry (Basel)., vol. 12, no. 6, p. 967, Jun. 2020, doi: 10.3390/sym12060967.

S. Zhang, “Challenges in KNN Classification,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 10, pp. 4663–4675, Oct. 2022, doi: 10.1109/TKDE.2021.3049250.

N. Deepa, J. Sathya Priya, and T. Devi, “Towards applying internet of things and machine learning for the risk prediction of COVID-19 in pandemic situation using Naive Bayes classifier for improving accuracy,” Mater. Today Proc., vol. 62, pp. 4795–4799, 2022, doi: 10.1016/j.matpr.2022.03.345.

R. Yacouby and D. Axman, “Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models,” in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Nov. 2020, pp. 79–91, doi: 10.18653/v1/2020.eval4nlp-1.9.

W. H. Bangyal et al., “Detection of Fake News Text Classification on COVID-19 Using Deep Learning Approaches,” Comput. Math. Methods Med., vol. 2021, pp. 1–14, Nov. 2021, doi: 10.1155/2021/5514220.

H. D. Cahyono, A. Mahadewa, A. Wijayanto, D. W. Wardani, and H. Setiadi, “Fast Naïve Bayes classifiers for COVID-19 news in social networks,” Indones. J. Electr. Eng. Comput. Sci., vol. 34, no. 2, pp. 1033–1041, 2024, doi: 10.11591/ijeecs.v34.i2.pp1033-1041.

Downloads

Published

2024-12-09

How to Cite

Pratomo, A. H., Rachmad, F., & Kodong, F. R. (2024). Comparison of K-Nearest Neighbor and Naïve Bayes algorithms for hoax classification in Indonesian health news. Bulletin of Social Informatics Theory and Application, 8(2), 345–355. https://doi.org/10.31763/businta.v8i2.796

Issue

Section

Articles