Abstract— Education system requires developing the necessary competence, in order to improve the skills and academic performances of students. The weakness in core subjects must have some hidden reason. This work is a quest of recognizing the factors that imply significant role in students’ academic performance. The modern science offers useful tools and techniques to understand the education system better. The presented work seek data science as a tool to understand and develop a mechanism of identifying the weaknesses or hurdles in a student’s academic career, which could be communicated to reform the skills and the art to utilize those skills in most efficient manner. The statistics were obtained from two secondary schools of Portugal, for particularly two primary subjects (i.e., Mathematics and Portuguese language). The data recorded under different features, (i.e., school performance, available facilities, social and demographic status) were modeled for two data science techniques (i.e. Decision tree and Random Forest). The results showed good predictive accuracy can be achieved, considering pervious period evaluation/grades are available. Therefore more efficient tool can be developed to better learn the student capacity and a recommendation system that can enhance the ability of students.Keywords— classification; classifier; decision trees; random forest; education systemI. INTRODUCTION Education is the key to success. It proves to one of the most important factors for the development of human civilization. A lifelong learning system is required to develop the necessary competence. It is often observed that student academic performance is assessed upon grades, a representation of how well the student has prepared for and performed in class, and how well the student has mastered the material presented. But there might be other elements which impart their affect in student performance. In addition to the basic features we are looking for other attribute that might contribute to the performance of students.Modern science has made useful advancement in developing valued tools. Every object is generating data, our cars, smart phones, hospitals even education. IoT (Internet of Things) is one of the major factors to produce big data. These devices, appliance and products generate massive amount of data every single second. The basic idea behind IoT is to establish machine to machine level communication. Everything now a day leaves a trail of digital exhaust. The collected data may be structured or unstructured data. The data we produced can be a complete reflection of our behavior. Data Science in this regards, provides with the techniques that could be coupled to develop a tool that can aid the education system. The education statistics can be used to evaluate the academic performance and therefore actively relocate the resources according to student capacity.This paper deal with classification problem, here analyst required well formatted structured data. We appreciate the work of “P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance” 1, for contributing the data in cleaned and structured form.This research work intend to recognize and analyze the relation of social, academic and other attributes, and how they impact the performance of student, specifically for two subjects of mathematics and Portuguese language. The record comprised of two secondary schools of Portugal. Different methods were used to obtain records, first of all school reports of grades and absentee’s, secondly questionaries’ filled by parents, students and teachers which include interesting features like social, gender and academic information about students. The focus is to predict student’s final grade, and will be measuring the efficiency of prediction. The two subjects were modeled under objective;1) binary classification (i.e. Pass & Fail)2) five level classification (i.e from Fail, to Excellent, data classified to five levels)Two different dataset (i.e Mathematics and Portuguese language) passed into the algorithms for the above mentioned conditions and record the observations for three different input criteria. For both of these approaches, three input setups (e.g. with and without the school period grades) and two Data Science algorithms (e.g. Decision Trees, Random Forest) will be tested.Muhammad Owais Baig and Muzammil Ahmed AnsariDept. of Computer ScienceIqra University Karachi-75500, PakistanEmail: [email protected], [email protected] many other classification techniques, in the present work we selected Decision tree and Random forest to work with. It is simply because the simplicity and operation ease, in addition to the effectiveness and computational efficiency. In future the data scientist can apply other algorithms to analyze the model performance. Later sections of the paper will discuss how these simple techniques can result the powerful prediction accuracy.II. MATERIAL AND METHODSThis study will consider data collected during the 2005- 2006 school year from two public schools, from the Alentejo region of Portugal 1, At the time the record maintenance were very poor in public schools of the region. Schools only keep previous period grades and attendance records of students. Therefore two methods were established in collection of student information. Previous performance and attendance was recorded from schools reports, while other features comprised after the questionnaires filled by volunteer students.The human brain is always occupied with circumstantial activities, and students are also considered to be under tremendous pressures and performance can be affected under different circumstances. Being a student, we can easily relate to this theory. Hence other than previous grades, it was considered to include more features that could give more dimensionality and clear perspective.The questionnaires involved many social, academic, financial, resource and facilities availed by a student. The final shape of dataset that is available and used in the process is structured with as many as thirty three attributes (e.g. family income, parent’s education, leisure time activities, social and emotional attachments). Record covers 649 students of Portuguese subject, and 395 students of mathematics. The last four columns denote the variables taken from the school reports 1. The attributes corresponding to parents education categorized in five level (i.e. 0 for none, 1 for primary, 2 for middle, 3 for secondary and 4 for higher education), similarly Parents job categorized to five sublevels (i.e. teaching, health services, civil services, at home or others). Whereas on Alcohol consumption 1 represents very low to5 represent very high. Features like, school support, family support, internet facility, address, parents status, romantic relationship and interest to higher education only has binary selection with either yes or no. Like in the most regions to Europe, schools of Portugal also follow 20 point grading system range start from 0 – 20, where below 9 consider as fail. Material belongs to education sector. Means research is belong to students who are enrolled in these two separate subjects and to the researcher who are ready to dive-in and look for the unobvious.III. MODEL AND TECHNIQUESA. Modeling and AlgorithmsClassification problems are categorized as supervised learning. The labeled data is used for classification prediction model. In this respect the available dataset further subjected to labeling. For binary classification in this case learns to identify the passing criteria. For multilevel classification other labels (i.e. “Fail”, “sufficient”, “satisfactory”, “good” & “Excellent”) used to differentiate characteristic.One of the challenges is the choice of algorithms among the many supervised algorithms. Two most important classification methods chosen, also statistically compared to analyze the accuracy of these model on this specific dataset.This work will observe the performance of very common method Decision Tree (approach is to ask series of questions, with every answer a follow-up question is asked until model reaches to conclusion) 7, and compare its performance with Random Forest (creates forest with number of trees – the more number of trees in the model, classifier wont over fit the model) 34. In decision tree, a tree grows by binary selection over a specified condition. Whereas Breiman’s random forest algorithm is implemented for random forest computations.As the name implies decision tree tool grow the branches with possible outcomes. The tree drives over control condition statement. The operation continues until it meet the condition as true. While random forest is the classifier that consist of many decision tree. Forest is the collection of trees map over randomly selected variable for specified number of trees.To calculate the accuracy of classifier, and to understand the prediction performance of model, here the dataset split into two portion 80% data used for training set, while remaining 20% used for cross validation as test set.B. ComputationsFor all computational and graphical analysis, dataset is subjected to R project and statistical environment. R is an open source environment and its vast variety of libraries is available for all platforms (i.e. Windows, Linux & MacOS). With respect to the algorithm we required to libraries to load prior to any computation, first is “tree” (Decision Tree) and “randomForest” for (Random Forest) Algorithms.IV. RESULTA. PreprocessingBefore implementing the model and start to tune the features according to the dataset, a better practice is to understand the structure of attributes. Str function in R used to return the structure of any object. str does not return anything, for efficiency reasons. The obvious side effect is output to the terminal Random forest and other algorithm treat categorical variable differently from numerical variables. Decision tree and Random forest requires categorical data, it allows to quickly evaluate outliers, invalid or missing values. Hence factor function used to convert the attribute as the discrete category type. So that model will treat the attributes as factor rather than integer or character.B. Input CriteriaThree input criteria were adopted to compute how model is performing and compare the two algorithms performance.1. Criteria A, compute prediction without G3 (final Grades)2. Criteria B, Compute prediction without G3 and G2 (final and Previous grades)3. Criteria C, do not include previous period grades, i.e. G3, G2 and G1For all above criteria, both binary and multi-level classification performed under Decision Tree and Random Forest algorithms to compute result differently on Mathematics and Portuguese language courses.C. Implementing FormulaSplit the dataset into training and testing sets. Training set used to fit the formula, while testing set used to analyze the performance of model.The tree package library used with the default values of tree formula, using all attributes of dataset. For random forest, mtry assign to be 6, recommended was the square root of the total number of variables. Lower the mtry value means there would less correlation between trees. Also ntree=500 nodes, and the with default values of randomForest formula.D. Predictive performanceTable 1 present the comparative analysis for binary classification for all three input methods. The best observation marked in bold:The table is comprised of two datasets i.e. Portuguese and Mathematics students. Furthermore, each datasets has passed to two algorithms DT (decision trees) and RF (random forests). The result accumulated over three input conditions (i.e. exclusion of G1, G2 and G3).Up to 92.40% prediction accuracy achieved during the testing. The results are self-explanatory; although a comparison between decision tree and random forest is displayed under different input parameters. The difference among the two models (DT & RF) is very minimal; clearly Random Forest algorithm outperformed the Decision Tree for binary classification, for both Mathematics and Portuguese Language. But it is also evident that performance of both model degraded as we removed the previous evaluation. Therefore as we come down to the input criteria, previous period grades (i.e. G2 and G3) are found to be more vital for the model, as prediction decrease gradually without these attributes.Table 2 presents the prediction accuracy of five level classifications. The results drop significantly from binary to multi-level. Also Decision tree performance for multi-level classification improves than that of Random forest. For input A both DT and RF compute the same accuracy in Mathematics course, while DT performed better for Portuguese language course. The effectiveness of both models drops for the input criteria B and criteria C for five level classifications. For first two inputs DTs outcome produced better accuracy then RF. Hence these are not very efficient results for criterion B and CThe significance of G2 and G3 are more prominent in multi-level, as performance accuracy drops drastically form 77.21% to 24.04%. Fig 1 shows the strongest predictors are G2 with 115 and G1 with 63 MeanDecreaseGini. It is the ability to define the importance of a variable in process. Other than G1 and G2 of student average grade are the (absence) attendance of student; (health) health of any student found to be important. Worth notably the mother’s job has more significance then fathers occupation, also (schoolsup) school support shows inverse relation to final grades.The results are self-explanatory; although a comparison between decision tree and random forest is displayed under different input parameters. The difference among the two models (DT & RF) is very minimal;V. FUTURE WORKData science is rapidly evolving field. Tremendous amount of data is available now a days, that can be utilize to understand human behavior towards an object or element. Both dataset contain at least 382 identical students, which can be identified with the common attributes they share on both courses. In future data scientist can combine the two dataset and analyze the failure rate among the 382 student. It is worth noting if these students perform differently for Mathematics and Portuguese Language courses. There might be an interest factor in literature or accounts. That can be verifying with future study.In future data scientist and school administration can record some other features to evaluate the performance of students on wider canvas. Unarguably the features collected present a resourceful insight on student background. Still it felt that some more features could bring us more close to student’s mental circumstance. Some of the examples are; course interest, extra tutorials, financial difficulties, job status (full time, part time, not employed), and assignment completion and class participation. It is also understood that attribution selection require better domain knowledge and school teachers and management plays a vital role in this regard.VI. CONCLUSIONStudents are the future prospect of any nation. And education system is the backbone of the economic growth of nation. It is therefore mandatory to assure the quality of education sector, also provide direction to the students. This can be achieve by many means; it can include counseling, extra study sessions, accessibility to resources. But the allocation of resources can only be possible, if we understand the student behavior and attitude towards a particular course. The idea was to develop a tool or mechanism with the help of data science model to predict student performance with the help of past evaluations and other valuable attributes.To understand and visualize the difficulties a student go through in his academic proceedings a model is developed and tested to a relatively small dataset, and attain the accuracy of 92.24% for binary classification while 80.62% is achievable for multilevel classification. This in consideration is good response. This result can be tuned further to provide more accurate results. Confusion matrix of binary classification for Random forest and that Multi level for Decision tree are below;Model accuracy influenced highly by previous gradation of students. Without previous evaluation competent accuracy is not attainable. The inconsistency of model does not help the cause, therefore more academic features required to understand the abilities and dislikes of students over a course of academic year.This is the age of information and Data science is the application of modern technology. Everything is measurable and quantifiable; therefore in future efforts should be made to keep the records in a database, rather than collecting the information manually through questionnaires. Although with the available resources and feature sets, data science and its techniques proved to be equipped to impart its effectiveness for the improvement of education systemACKNOWLEDGMENTWe wish to thank the P. Cortez and A. Silva, University of Minho for the preceeding work, data collection initial cleaning and presented structured records. Also for the proir efforts on the subjectof data sceince and education system allows a direction for student of technology like us.