Congestion has now become a problem that occurs in almost all big cities in Indonesia. The problem of traffic jams generally occurs in areas with high intensity of activity and land use. Given the increasing level of congestion that is happening, the capital city of DKI Jakarta is one of the most densely populated cities with high population activity. Population activities are also offset by the use of transportation. Both by public and private vehicles. Traffic jams are one of the problems that are still unsolved. West Palmerah Street is one of the roads with quite a lot of traffic jams. To prove it, he did some simple research. The method used is descriptive method, where the research begins with collecting the data needed at this time through several surveys. And the calculation is done by looking for the degree of saturation (DS) and vehicle speed at three checkpoints, the DS at the pre-market checkpoint is 0.89, the DS at the market checkpoint is 1.05, and the DS at the market checkpoint is 0.89. Then the movement speed was also obtained at the pre-market observation point of 32.05 km/hour, at the market review point of 27.5975 km/hour, and at the post-market observation point of 33.35 km/hour. The results prove that there is indeed a traffic delay in front of the market. This figure is due to the large number of angkots that stop and the narrowing of the traffic lane in front of the market due to the presence of street vendors and motorbikes stopping on the sidewalks with buying and selling activities on the sidewalks. Therefore, it is necessary to apply the best operational solutions to improve traffic flow on these roads.


Keywords : Congestion in Jakarta, Classification, K-nearest neighbors, Na�ve Bayes, Decision Tree.




Data mining is a method of determining certain patterns from a large amount of data. Data mining has many techniques, one of which is a classification technique. Classification is a data learning technique for generating value predictions from a series of attributes (Wahyuningsih & Utari, 2018) . Classification is widely used to predict classes on certain labels by classifying data (building models) based on training sets and values (class labels) when classifying certain attributes. Classification is divided into five categories based on differences in mathematical concepts, namely statistical based, distance based, decision tree based, neural network based, and rule based. Classification has many algorithms, but in this study using decision tree, KNN and Na�ve Bayes algorithms (Sartika & Sensuse, 2017) . Of the three algorithms, the decision tree is one of the most commonly used methods, especially in data classification.

In case studies of sentiment analysis of BPJS service users using the KNN, Na�ve Bayes and Decision Tree methods it proves that the Decision Tree method has a high level of accuracy in data classification (Puspita & Widodo, 2021) . In a comparative case study of the K-Nearest Neighbor Data Mining Method with Na�ve Bayes for the classification of Congestion in Jakarta, the KNN method is proven to have high accuracy compared to Na�ve Bayes (Rahman et al., 2018) . Compared to the Na�ve Bayes method, this method rarely has a high level of accuracy, so this study will compare the three algorithms based on their level of accuracy, which method is the best for classification.

Based on the existing problems, specifically to compare the three decision tree methods, KNN and Na�ve Bayes, a study was carried out with the title " Classification of Congestion in Jakarta Using the KNN, Na�ve Bayes and Decision Tree Methods " using the rapid method. Mining software to find the highest accuracy value of the three methods that will be implemented in data classification is a comparative analysis of traffic jam accuracy using KNN, naive Bayes and decision tree classification data. The purpose of this study is to compare the three best methods used in the classification of congestion with maximum accuracy results.

A study that discusses the Na�ve Bayes, KNN and Decision Tree methods for sentiment analysis of traffic jams with the problem of traffic conditions in the city of Jakarta which are so dense and congestion is increasing, that residents who want to work need more comfortable transportation (Riadi & Kom , 2017) . This research uses social media Twitter to get random data for up to 127 dates. Using the Naive Bayes Classifier, KNN and Decision Tree methods with several stages, namely emoticon conversion, cleaning, case stacking, tokenization and stemming (Romadloni et al., 2019) . The results obtained with the decision tree method have the highest accuracy compared to KNN and Na�ve Bayes, where the decision tree has 100% accuracy, 100% accuracy, 100% sensitivity and 100% specificity. The KNN method has 80% accuracy, 100% accuracy, 50% sensitivity, 100% specificity, and the Naive Bayes method has 80% accuracy, 66.67% accuracy, 100% sensitivity and 66.67% specificity.

Research on the classification of traffic jams uses a comparison of the K-Nearest Neighbor and Na�ve Bayes data mining methods. Monitoring and processing of the surrounding environment, including water resources, is necessary to create traffic jams that comply with congestion standards (Rahman et al., 2018) . The accuracy results are 82.42% for K-Nearest Neighbor and Na�ve Bayes of 70.32%, it can be concluded that KNearest Neighbor is the best method for determining congestion.

In the research on sentiment analysis of BPJS users using the KNN, Decision Tree and Na�ve Bayes methods, discussing people who use BPJS services, which often raises pros and cons, for this reason, data mining sentiment analysis research was carried out on BPJS. Twitter users with 1,000 entries are filtered down to 903 due to duplicate data. Implement the KNN, Decision Tree and Na�ve Bayes methods to compare the level of accuracy of the three methods used (Puspita & Widodo, 2021) . This study used rapid miner software version 9.9, where the results obtained were that the KNN method had an accuracy rate of 95.58%, a decision tree was 96.13% and the Naive Bayes method was 89.14%, so it can be concluded that the best method for decision making decision tree is used.

Data Mining is the process of obtaining information to obtain new information (Harahap, 2019) . The research conducted this time uses data mining techniques that implement the K-nearest Neighbors, Na�ve Bayes and Decision Tree methods to compare the results of the maximum accuracy of the three methods used. Data mining is a data source and use operation that is used to find relationships or patterns from large data sets to obtain new information (Cahyanti et al., 2020) .

The K-Nearest Neighbor algorithm is a classification method for a dataset based on previously classified training data (Siregar et al., 2019) . The KNN classification algorithm is a method for classifying objects based on training data that has the shortest distance (Romadloni et al., 2019) . The working principle of the KNN algorithm is to determine and find the shortest distance to the nearest neighbor value in the training data with the data to be tested. The best k value for this algorithm depends on the data value, where usually a high k value reduces the effect of errors or noise on the classification process, but creates suboptimal boundaries between classifications (Sukmana et al., 2020) . This research will carry out a computational process to obtain accurate data results using the KNN method. The formula for finding the distance using the Euclidean formula:

where x1 is sample data; d is distance; x2 is test data; p is the data dimension, i is the data variable.

Naive Bayes Classifier is a data mining method for data classification. The operation of the Naive Bayes Classifier method uses probabilistic calculations. Naive Bayes is one of the algorithms included in the classification technique (Zulfauzi & Alamsyah, 2020) . The basic concept of Naive Bayes uses the Bayes theorem, which is a theorem used in statistics that is used to calculate probabilities. The Naive Bayes Classifier calculates the probability of one class from each group of attributes and determines the most optimal class ( Lestari et al., 2021) . The Naive Bayes classifier function calculates and looks for the highest probability value to classify test data into the correct category. A simple probability prediction technique based on the application of the Bayes theorem or Bayes rule is a technique implemented in the Na�ve Bayes algorithm. Naive Bayes Formula:

where X is data with unknown class; H is the hypothesis that data X is class specific; P(H|X) is the probability of the hypothesis H under condition X ; P(H) is the probability of the hypothesis H (prior probability); P(X|H) is the probability of X based on the conditions in hypothesis H; P(X) is the probability of X

The data classification process can use several methods, one of which is a decision tree. The decision tree is one of the commonly used algorithms for decision making (Pamuji & Ramadhan, 2021) . The decision tree is an algorithm that is good for classification or prediction (Muningsih, 2022) . The Decision Tree Model is in the form of a tree which consists of several parts, namely the root node, internal node, and terminal node. The root node from searching query data and the internal node that reaches the end node is the classification process in this decision tree method. The concept of entropy to be used to determine which attribute in the decision tree to split, the higher the sample entropy, the less pure the sample is. The formula for calculating sample entropy is:

where p1, p2, p3 �.. , pn respectively represent class 1, class 2,�.. class n proportions in the output.




In this study several stages were used which are presented in the form of Figure 1 Research Stages.


The first stage of this research begins with mining data on Twitter using Orange Software and of course the Twitter website. The second stage is the study of literature as a collection of information relating to the preparation of the final project. Collecting information to support this research in the form of journals, books, references and other reliable sources. Not spared from discussions and consultations, as well as research methods during the preparation of this diploma thesis, discussions and consultations with supervisors and various experts in this field. The data processing process at Rapid Miner includes several steps, starting from data sets, pre-processing, data separation into training data and data testing, model fitting/classification, prediction/model application, and the resulting process. The data processing carried out will produce a result or result that will be discussed and produce a conclusion in the research process carried out.




In this study, the overloaded csv data type dataset was used for the classification process as well as to compare the results of the accuracy of the three methods used, namely Naive Bayes, Decision Tree and KNN. The results of the data obtained in Table 1.


Table 1

Traffic jam dataset on Twitter


Pre-processing and Labeling

The data obtained in this study need to be processed first. Knowing the nature of the textual data previously collected, the data labeling process was carried out. The attribute identified in this study is pitability, an attribute that indicates whether bottlenecks can be overcome. The labeling process can be done by setting the color on the label to facilitate the research process. Several pre-processing methods are used, namely data validation to obtain good data with proper accuracy, to review the type of data obtained, and to identify data so as to achieve a maximum level of accuracy. Make inconsistent data consistent by replacing all missing operators. Data validation identifies and eliminates data that is not used, as well as inconsistent data and missing data, where raw data becomes data that is ready to be processed and can be analyzed through data cleaning and data filtering processes. in the data validation process (Teak, 2021) . This study uses data integration and transformation methods to increase the accuracy of the three methods used. The Reduce Data Size and Decretize methods are used to remove duplicate data using the delete duplicate operator. The initial data condition of 1,000 becomes clean data through a process of data validation, data integration and transformation, as well as data size reduction and discretization so that the data can be analyzed to obtain new data information.


Keyword Determination in Orange: Jakarta Traffic jams



Process Preprocessing Data


Word Cloud



NLTK process in Google Colabs


Data Upload Process Using Pandas file *.csv



Stopword process


Case Folding Process



Accuracy Measurement with Confusion Matrix

Confusion Matrix is a classification method based on the results of the classification that has been done, where the accuracy of the classification affects the performance of the classification. The confusion matrix provides comparative information on the classification results carried out by the system (model) with the actual classification results (Fikri et al., 2020) .

The confusion matrix describes the performance of the classification model on a set of test data whose true values are known. Confusion Matrix is used to calculate accuracy.


Confusion Matrix

Confusion Matrix performance can be measured using the TP, FP, FN, and TN values. True Positive is positive data that is predicted to be correct. True Negative is negative data that is predicted to be true.

Calculating accuracy using the equation

Naive Bayes Algorithm Accuracy Results


Confusion Matrix Na�ve. Bayes

The accuracy result is 63.60%, with class precision for pred. zero (pred. negative) is 64.70% and pred one ( pred.positive ) is 57.98%. Accuracy results are obtained using equation 4, where the true positive values are 788, true negatives are 138, false negatives are 430, and false positives are 100. Accuracy results can be proven by:










Performance Vector itself is a form of description of the table of analysis results obtained in the research conducted. The True Positive value is 788, which is a positive data value which means that water is safe to drink and is predicted to have the correct value. The False Positive value is 100, where the data is negative (water is not drinkable) but is predicted as positive data. The False Negative value is 430, positive data but predicted as negative data. The True Negative value is 138, which is negative data that is predicted to be true.


Decision Tree Algorithm Accuracy Results

Confusion Matrix Decision Tree

The accuracy result is 80.84%, with class precision for pred. zero (pred. negative) is 79.71% and pred one ( pred.positive ) is 83.53%. The accuracy results are obtained using equation 4, where the true positive values are 817, true negatives are 360, false negatives are 208, and false positives are 71.

Performance Vector itself is a form of description of the table of analysis results obtained in the research conducted. The True Positive value is 817, which is a positive data value which means that water is safe to drink and is predicted to have the correct value. The False Positive value is 71, where the data is negative, but it is predicted as positive data. The False Negative value is 208, positive data but predicted as negative data. The True Negative value is 360, which is negative data that is predicted to be true.



Accuracy results of the K-nearest neighbors algorithm


Confusion Matrix KNN

Accuracy results were obtained at 86.88%, where the class precision for pred. zero (pred. negative) is 85.74% and pred one ( pred.positive ) is 89.19%. The accuracy results are obtained using equation 4, where the true positive values are 836, true negatives are 429, false negatives are 139, and false positives are 52.


Performance Vector is a form of description of the table of analysis results obtained in the research conducted. The True Positive (TP) value has a value of 836, which is a positive data value. The False Positive value is 52, where the data is negative (water is not drinkable) but is predicted as positive data. The False Negative value is 139, positive data but predicted as negative data. The True Negative value is 429, which is negative data that is predicted to be true.

The data classification process uses several operators to carry out classification methods, including CSV reading, data partitioning, model application, and performance. Classification methods such as KNN, Na�ve Bayes and Decision Tree. These operators have their respective functions, the CSV read function is to import CSV data that has been obtained, in CSV read mode the preprocessing method is carried out, where the preprocessing function is to display imported data sets, whether there are inconsistent data or missing values. The Split data operator works by taking a set of examples as input and sending a subset of the sample sets through its output port. To use the classification method, use the model features. Performance is used to display the accuracy of all types of classification methods.


Accuracy Results

Comparison of Accuracy Results

Comparative analysis of Water Quality accuracy using data from classification results with K-nearest neighbors, Na�ve Bayes, and Decision Tree shows that K-nearest neighbors is the method that produces the highest level of accuracy, namely 86.88% for the classification of quality data used in this study, while Na�ve -Bayes is 63.60% and Decision tree is 80.84%.


The purpose of this study is to find out the results of the accuracy comparison of the research methods used, namely K-nearest neighbor, Na�ve Bayes and Decision Tree. Judging from Class Recall and Class Precision, the method that provides the highest level of precision is the decision tree which is equal to 86.88%. The Decision Tree and KNN classification methods in this study were used quite well because they produced an accuracy rate above 80%, but other methods can be used to obtain maximum accuracy results for further research.






