The Next Big Thing for IR: The Data Lake

Dear Sonia: My IR director wants us to create a data management system that can house a data lake. What is a data lake and how can we accomplish this?

For institutional research and chief data officers (CDOs), one of the biggest challenges can be the development of a data management system to house a data lake—the nirvana of a clean, comprehensive institutional data warehouse. First generation CDOs are often hampered by legacy systems that previously contained transformed tables and fields with few controls on data definitions. Institutional misunderstanding of “democratizing” data can lead to additional problems, such as too many staff entering data. Frequently, systems purchased outside of the student information systems are not well integrated with reporting structures. Prior to the time of cloud storage, it was often costly to store data. Some IR offices may have opted to use shortcuts to preserve their snapshots, rather than build a true warehouse environment. All these openings, which can potentially move garbage into a data environment, could lead to the reality that an institution may have more of a data swamp than a data lake.

Building a clean data lake is necessary for a robust reporting environment, but it is not an easy task. However, in order to keep up in this fast-moving data environment and compete for a decreasing pool of students, it is essential.

So, how does an IR office lead the charge to clean up a data swamp? Data governance must come into play, and there are basically two methods from which to choose:

Method 1

This method uses a top-down approach where institutional research controls the data requiring a heavy hand from upper administration (especially the CEO) and the understanding from staff that all roads for data codes and all authority for making changes go through the IR office. Although this can speed up the path to building a lake and may work in smaller environments, this often is not feasible at large institutions with dispersed reporting and culturally engrained “rights” to data.

Method 2

The second method for cleaning up a data swamp is much more involved. It requires a great deal of attention to best practices in change management and obtaining buy-in from various stakeholders throughout the institution. In these environments, it may be necessary to change an entire culture of potentially thousands of people, which is not an easy task. Here are some things to consider:

Create a task force. In order to begin getting buy-in, it is important to build a committee of people who are well versed in the current data systems and engage them in initial discussions of how to restrict access. Administrators and faculty involved with data should be included in the task force.
Develop a data governance policy. The policy will provide a mechanism to establish a well-defined and communicated data governance structure with clearly established roles and responsibilities. It should also establish a universally understood central repository for data standards and access controls.
Create subcommittees to distribute the work and have them report to the main task force. Possible examples include:

Data Quality/Data Definition Committee: Focused on tightening definitions, examining the lineage of the data designated to go into the warehouse, creating more validation processes, potentially using ACID transactions where data only goes in where entire transaction is validated.
Data Warehouse Committee: Focused on constructing the data lake environment, responsible for moving all sources of data (e.g. SIS, LMS, Sponsored Research data, other existing silos), using and developing APIs and DML commands to manipulate/update data where necessary, creating batch processes for the loads.
Data Wrangling Committee: Focused on identifying where garbage data is entering the environment.

Examine position descriptions and ensure that those assigned to enter data into the system are specialized and that data entry is a high priority.
Develop dashboards and reports for a test environment. Make sure user testing is employed, as this will assist in ensuring that any remaining issues are uncovered.

As you can see, the second method is a much more inclusive way to develop a data management system. However, depending upon the size of the organization, this may not be necessary.

Whichever method your institution decides to use, it is imperative that those who are cleaning the swamp have backing from executive administration, including the president or CEO.