• Featured
  • 03.16.21

Data Mining: A Primer for IR

  • by Nora Galambos, Senior Data Scientist, Stony Brook University
Data Mining

What Is Data Mining and Why Would I Want to Use It?

Data mining combines the areas of statistics, artificial intelligence, and machine learning to discover relationships and insights hidden within large volumes of data. As data storage capacity and computer processing power have increased, it is possible to run data mining algorithms in short order to develop models to make data informed decisions.

Student records and transaction sources like learning management systems (LMS), provide a wealth of untapped information that can be put to use modeling student outcomes, like retention and graduation rates, and grade point averages. Besides being able to incorporate large amounts of data from many sources into one predictive model, data mining has a number of advantages relative to more traditional statistical methods. Those advantages will be discussed later in this article.

Educational Data Mining (EDM) and Learning Analytics

As you know, data mining has been widely used in retail, marketing, finance, customer service, product innovation, and, of course, social media optimization. Not surprisingly, over the past several years we have seen educational data mining (EDM) and learning analytics emerge as areas of research. EDM uses data intensive approaches (e.g., data mining and machine learning) along with statistics to study students’ learning to develop theories and methods that inform teaching and customize the delivery of course materials. Learning analytics also applies data mining, machine learning, and statistics, but also includes other methods from sociology and psychology to study data collected from educational services, teaching, and student learning to assess student progress and predict outcomes. Its goal is to collect student data from completed assignments, exams, and online course discussion forums, as well as general student characteristics or student activities. These data can also be used for assessment of curricula and programs, thereby contributing to campus assessment activities. The interest in EDM and learning analytics has spawned societies, such as the International Educational Data Mining Society, International Society for the Learning Sciences, and Society for Learning Analytics Research, and journals, such as the Journal of Educational Data Mining.

Clustering Students to Offer Specialized Resources and Tutoring

There has been much research into improving online learning environments because the data can be easily collected. Cluster models can be used to group students together based on learning difficulties and their use of learning tools. The data may include participation in class-related online forums and/or class discussions, practice tests, and interactions with the online learning materials in order to predict students who may fail the class. The information is gathered from the LMS in order to recommend resources to students with similar needs. Students in different clusters can be given specialized tutoring or course advice.

Finding Specific Learning Activities Associated with Better Grades

Relationships can be mined to develop association rules which can be used to improve teaching. For example, association rules can model the co-occurrence of student errors. Student LMS activity, such as class discussion forums can also be associated with grades in a class. Another use of relationship mining is to model the relationships between student performance and course sequence or pedagogical methods. Many of these methods can then be used to customize course delivery to individual students by uncovering the optimal sequence of topics for particular students, finding learning activities that are associated with better grades, or identifying elements of an online learning environment that are related to improved learning and student satisfaction.

Improving Learning in Blended Courses

One major university has developed sophisticated learning systems for online and blended learning courses (i.e., courses that combine online learning with classroom teaching) by applying learning science methods to develop ways for students to practice with the course materials. The practice is then followed-up with feedback. In blended courses, the LMS collects work the students do online, giving professors feedback on topics that students are finding difficult, which then is used to plan upcoming classroom activities. The courses provide students with targeted feedback combined with self-assessment tools. The feedback is immediate and provided after each question in a learning exercise. This model can evaluate when a student is ready to move on to the next topic or determine if they are falling behind, as well as what grade they are likely to get without receiving extra help, and whether a student needs to be referred to a tutor.

Why Use Data Mining Instead of More Traditional Analytic Methods?

There are a number of advantages to data mining as opposed to more traditional statistical analysis. Data mining differs from some types of statistical methods (e.g., chi-square tests, analysis of variance, or t-tests) in that the goal is to develop a predictive model rather than focusing on finding factors significantly associated with an outcome. There are fewer assumptions to satisfy relative to traditional statistical methods (e.g., requirements for normal distributions or homogeneity of variance). For linear regression multicollinearity needs to be evaluated. When a large number of predictors are involved, it can be time consuming to properly assess all of the potential multicollinearity issues. Another important consideration is missing data, which becomes more problematic with large datasets that may include dozens or even hundreds of predictors. When performing linear or logistic regression, the missing data is listwise deleted. That means that any observation (or row of data) with one or more missing values on any predictor entered into the analysis will be eliminated. With large datasets it could happen that a rather large percentage of observations is listwise deleted. Yes, imputation could be used, but doing so properly can be a time-consuming enterprise involving studying the missing data mechanisms and the distributions of the variables. When using decision tree analysis, missing data are not listwise deleted as with traditional statistical methods. Instead, the decision tree algorithm combines the missing values with categories in the data based on significant statistical associations or other appropriate considerations. The decision tree output clearly labels where the missing data are placed in the model. Lastly, some data mining software will generate code that can be used to score new data. For example, if you have a model to predict one-year retention, you can run the score code on data from a new term to generate the new retention predictions.

How Can I Learn Data Mining?

In order to learn to do your own data mining, it is important to have statistical expertise a bit above the intermediate level and some programming and/or data management skills. Before data mining begins, a large volume of data must be assembled. For example, if the plan is to predict the first-term GPA for new freshmen who are in their first weeks of college, you may want to assemble admissions data including test scores and other information pertaining to their pre-college profiles; their early LMS interactions, which will entail some programming to transform the data into an analyzable form; course information; early campus activities/interactions; and any other available early data that may be associated with students’ academic performance or satisfaction with their college experiences. Ideally, it would be helpful to have this work done by IT personnel. If no data management assistance is available, it is important to be comfortable combining large volumes of data from different sources. When using the data mining software, it is necessary to configure choices that tell the program what statistics to use to evaluate the variable associations, compare models, and evaluate their strengths. It is also important to understand those statistical methods, as well as the model diagnostics, such as gain, lift, receiver operating characteristic (ROC) curves, and many others. For those not comfortable with statistics, the first recommended step is to find online courses to advance statistical knowledge--model diagnostics, in particular. For those ready to learn data mining, both SAS, SPSS, and other providers of data mining software offer online courses that can get you started.


Bienkowski, M., Feng, M. & Means, B. (2012). Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief. U.S. Department of Education: Office of Educational Technology. Retrieved February 2021, from https://tech.ed.gov/wp-content/uploads/2014/03/edm-la-brief.pdf

Chen, F., Cui, Y. (2020) LogCF: Deep Collaborative Filtering with Process Data for Enhanced Learning Out-come Modeling. Journal of Educational Data Mining. 12(4). p. 66-99.

Data Mining: What it is & why it matters. (2021). Retrieved February 2021, from https://www.sas.com/en_us/insights/analytics/data-mining.html. SAS Instititute, Inc.

Romero, C., Ventura, S., Pechenizkiy M., & Baker, S.J. (Eds). (2011). Handbook of Educational Data Mining. Chapman & Hall/CRC.

SAS Enterprise Miner 15.2: Reference Help. (2018). SAS Institute, Inc. Retrieved February 2021, from SAS® Enterprise Miner™ 15.2: Reference Help