Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Higher education in Indonesia are under Directorate of Higher Education and accredited by the National Accreditation Board for Higher Education. The higher education institution is categorized into two types: public and private higher education [1]. There are four types of higher education institutions: universities, institutes, academies and polytechnics. Universities in Indonesia are largely offered by the private sector. Out of around 3,500 institutions, only around 150 institutions are public (established and operated by the government) [2]. Islamic University of Indonesia is a private university in Indonesia. There are 9 faculties, and 46 program under faculty. Department of statistics is one of the program under Mathematics and Natural Science Faculty.


I. Introduction
Higher education in Indonesia are under Directorate of Higher Education and accredited by the National Accreditation Board for Higher Education.The higher education institution is categorized into two types: public and private higher education [1].There are four types of higher education institutions: universities, institutes, academies and polytechnics.Universities in Indonesia are largely offered by the private sector.Out of around 3,500 institutions, only around 150 institutions are public (established and operated by the government) [2].Islamic University of Indonesia is a private university in Indonesia.There are 9 faculties, and 46 program under faculty.Department of statistics is one of the program under Mathematics and Natural Science Faculty.
One of the standard quality from higher education is based on the student body it means the comparison student and lecturer.Expectation for any higher education is afford alumni with high quality.One of criteria for a high quality can describe by graduation status of the student in a institution.The graduation status can affect the quality of alumni especially for Statistics Department.This situation will affect the quality of the Department of statistics, especially courses in Indonesia measured by the accreditation conducted by the National Accreditation Board of Higher Education (BAN PT).
Quality is measured by seven major standards, is one of its students.Especially with regard to the evaluation of the student standard component of the assessment is the grade point average and the duration of the study [3].Based on the assessment matrix instrument of accreditation that the percentage of students who graduate on time is one element of the accreditation assessment [4].This issue is extremely important for the management of department considering the percentage of students graduating on time is one of the elements of accreditation set by the National Accreditation Board.
One of method that can used to solve this problem is detected factor that implied Graduation Status.In this paper used five variable that supposed to be a factor that implied Graduation Status of student which are grade point average (GPA), Concentration in High School, Sex, participation in assistance and city of residence.
Classification method used in this research method is Naive Bayes Algorithm and Naïve Bayes is the best classifier against several common classifiers in term of accuracy and computational efficiency [5].
The remaining parts of this paper are organized in the following structures.In section 2, some backgrounds of Naïve Bayes Algorithm and accuracy classification will be reviewed.The methodology and result which are used accuracy for classification are presented in section 3.In section 4, numerical experimental Naïve Bayes algorithm is presented.Finally, discussions and conclusions are presented in section 5.

II. Methodology
This section discusses on techniques that have been performed in this study.

A. Data
This research was conducted in the Department of Statistics Faculty of Mathematics and Natural Sciences, Islamic University of Indonesia, the data used in this research is registration data from student in department of statistics Islamic University of Indonesia and alumni data since 1997 -2012.The number of data used is 404 alumni since 1997 -2011 and 116 active student class of year 2012.In this paper used 5 variable which is graduation status, gender, city of residence, concentration in high school, assistances, and Grade Point Average (GPA).The category for each variable as follow :

Naïve Bayes Algorithm
One of the statistical classifier are Bayesian Classifier.They can predict class membership probabilities, such as the probability that a given training sample belongs to a particular class [6].So many studies comparing by performance decision tree and selected neural network classifier with A simple Bayesian classifier known as the Naïve Bayes Classifier.Attribute value on a class is assumed independent in Naïve Bayes Classifier or it called class conditional independence.The Naïve Bayes.Naive Bayes is based on Bayes theorem which has a similar classification capability with decision tree and neural network.For large data implementation Naive Bayes proved to have high accuracy and speed.
Bayes Theorem has the following general form: We know how frequently some particular evidence is observed, given a known outcome.We can use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence The Naïve Bayes classifier, works as follows [5]:   are the mean and standard deviation, respectively.
In order to classify an unknown data input X, (X | ) P(C ) ii PC is evaluated for each class Ci.

Classification Accuracy
The classification accuracy Ai of an individual program i depends on the number of samples correctly classified (true positives plus true negatives) and is evaluated by the formula.

A. Descritive Statistics
In this section describe the descriptive statistics from the data.Alumni in Statistics Department are spread throughout Indonesia as follows : Based on Figure 1., the highest city residence of alumni is from Central Java and Yogyakarta.In this paper, for city residence divided by two region are city residence from Central Java and Yogyakarta, and outside Central Java and Yogyakarta.
The percentage for the graduation status of a student department of Statistics as follows : From Figure 2, it can be conclude that from 1995 until 2011 there are several data that showed a 100% that the student did not completed their study in a less than equal to four years.Based on figure 3, it can be conclude the student that female student has an higher number for Graduation status not on time and on time than a male student.It's possible happen, because there are many female student than male student so it implied this condition.

B. Naïve Bayes Classifier
In applying the Naive Bayes Classifier the selected dataset contains two categories of data : Graduation on Time and Graduation Not On Time.30% data (404 data) are used to builds the training dataset for the classifier.The other (116 data) data are used as the testing dataset to test the classifier.The data used in this paper as follow :  The prior probability for graduation status in table 3 show that the biggest probability is graduation not on time for this case.It implied the conditional probability for each variable as follow : Based on table 4, it can be conclude that the biggest probability in each variable there are male for on time status, cumlaude GPA in on time status, city of residence that outside Yogyakarta and Central Java in not on time status, and science concentration in on time status.That probability will implied the accuracy of the classification that showed in the next section as follow.

C. Accuracy of Classification
According to the table 5, it can be seen from the 302 Graduation Student with "On Time" status there are 250 students classified correctly, and 52 other students are not appropriate.Thus, it can be said that the Naive Bayes algorithm successfully classifies Graduation status student in Statistics Department UII with a percentage of 81,18% accuracy.

IV. Conclusion
In this paper, Naïve Bayes Classifier has been discussed as the best classifier in this problem.Thus, the level of accuracy in classification models Naive Bayes algorithm 81,18%.
Given a set of labelled training samples and their associated class label.As usual, each training sample is represented by an n-dimensional attribute vector, 12 (x , x ,..., x ) the training sample from n attributes, respectively, 12 , ,..., .Suppose that there are m classes, 12 , ,..., .m C C C Given a training sample, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X.That is, the Naïve Bayes Classifier predicts that training sample X belongs to the class Ci if and only if .The class Ci for which ( | ) i P C X is maximized is called the maximum posteriori hypothesis.By Bayes theorem (equation (1X) is equal for all classes, only (X | ) P(C ) ii PC maximized.If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, of training samples of class Ci in D. If data sets with many attributes given, it would be extremely computationally hard to compute (X | ) i PC .In case to reduce computation for evaluating (X | ) i PC , the Naïve Bayes assumption of class conditional is independence.To make presumes that the values of the attributes are conditionally independent of one another, given the class label of the training sample.Thus, ( | ) ( | ) ( | ) ... ( | ) Experimental ResultIn this section describe the results of classification for Statistics Department Student based on Graduation Status including Naive Bayes models, and the results of the classification.First of all, a datasets with 404 alumni classified in two different categories is used for evaluation.

Table 1 .
Variabel Category Sik is the number of training samples of class Ci having value xk and Si is the number of training samples belonging to Ci.

Table 2 .
Data TrainingThe data in table 2 proceed by using R software and Weka 3.8 to solve the problem.Then, in this section the prior probability of each class in graduation status (Y) and the conditional probability for each independent variable (except : Graduation Status) to Y showed in output R program.The output program from R as follow : Ayundyah Kesumawati & Din Waikabu (Implementation Naïve Bayes Algorithm)

Table 3 .
Prior Probability for Graduation Status

Table 4 .
Conditional Probability for Graduation Status

Table 5 .
Accuracy of Classification