Why Good Intentions Aren’t Enough to Use Data Ethically

Significant ethical scandals are thankfully rare in institutional research and effectiveness (IR/IE). Every few years brings a case of an institution that has deliberately falsified data, usually data for external surveys to private actors; the more serious concern of falsifying data submitted to government agencies or accreditors appears, at least, to be just short of unheard of. There is the occasional headline-grabbing scandal, as when then-Mount St. Mary’s University president Simon Newman intended to improve his institution’s retention rate (and thus its U.S. News ranking) by administering unreliable surveys asking potentially illegal questions and using the results to weed out struggling students before they would be reported as part of the institution’s retention rate cohort—an approach he described by saying, “Sometimes you have to drown the bunnies, put a Glock [handgun] to their heads.”

But for the most part, data scandals are infrequent and most often involve IR/IE offices only peripherally. We certainly need to be reminded from time to time that simply “following orders” absolves us of neither our ethical responsibility nor our legal liability. Periodic review of those responsibilities, especially in areas such as privacy law and human subjects research ethics, is something all offices should include in their ongoing professional development activities. Occasionally those carrying out the IR/IE functions (whatever their position in the institutional administration) play the hero, as when the team developing Newman’s survey apparently ran out the IPEDS reporting clock to prevent the survey from being used. It is certainly worth honoring those of our profession who courageously uphold its ethical standards.

That such scandals are rare does not mean that IR/IE professionals do not face challenges in data ethics. Sometimes these are easily recognized, such as when we suppress data cells that would reveal individual information. But some of the deepest data ethics challenges are those we may not be aware of. We tend to think that data is an objective reflection of reality and, thus, that if we accurately represent the data and protect the privacy rights of our data subjects, then we have done our duty. The reality of data systems is, alas, much messier. Because of that, good intentions aren’t enough to ensure that we use data ethically.

The objective view of data is that a data system stores information that is an objective reflection of the world as it exists. There are two ways in which we might mean this to be true. The most common sense of it, what me might call “strongly objective,” is that an accurate data system provides the definitively true view of the world. The fields represent all of the things that are in the world (at least all of those of interest); the set of possible values represent all possible values of those things in the world; and the values each case holds accurately categorize the case. Any other data is false (or at least dismissed as “anecdotal”), and any case in which reality contradicts the data system is a case of an inaccurate data system—one that is fixable with technical means.

The problem with this view is that it assumes that there is only one way to accurately represent the world. IR/IE professionals should immediately recognize this as deeply problematic. We routinely experience data systems that reflect the myriad ways we can represent the world in our data. At my own institution, we have recognized at least three different schemas for storing race and ethnicity data: the standard IPEDS framework, the U.S. Census framework that stores race and ethnicity data for non-resident aliens, and the state framework that uses binary fields for each racial category to provide a complete picture of multiracial students. Each represents a different set of purposes and constructs, and each stores data differently according to those purposes. None is inherently incorrect, and often they are incommensurable. We can say at most that the data is objective in a weak sense—that the data system can reliably assign values to cases—but not in the strong sense that it is a purely observational representation of reality. We can say that some schemas are better than others on technical or observational grounds. But we cannot say that there is one inherently right schema that can be determined by observation and technical characteristics of the data system to be the only objective one. The architecture of a data system is a process of translating reality into one of many possible representations with no fully objective basis for choosing one representation over another.

The traditional view of data systems simply considers this bias and dismisses those data systems accordingly. This can be done when one considers individual data systems in isolation. One could, for instance, criticize a data system that stores native language as a single value as being biased toward 19th and early 20th century nativist understandings of nationalism, ethnicity, and family. One could propose a system that stores multiple languages spoken in the subject’s home as lacking such a bias. But taken together, there is no objective means of choosing a schema that is separable from the politics of inclusion that is inherent in the concept of “native.” We cannot store native language data without deciding what the relationship is between language and social membership. We can choose between being exclusive or inclusive, but either is a political choice. That most institutions will opt for the latter as an expression of an institutional commitment to diversity may make the schema more just, but it does not make it more objective. This means that all data, not just biased data, comes to us with social and ethical considerations built into it. System design is as much a social question as it is a technical one. And so data ethics is an inherent part of working with data rather than an occasional challenge.

As IR/IE professionals, we do have resources to help us make these decisions. Data ethics is the chief, though not the sole, concern of the AIR Statement of Ethical Principles. The Statement was adopted in September 2019, replacing a code that many saw as unnecessarily legalistic and proscriptive with principles that can guide IR/IE professionals in making practical decisions about the ethical challenges we encounter. The statement calls on us to “act with integrity,” and includes 11 principles that articulate what that means. The principles recognize our responsibility as data stewards, our duties to represent data effectively, and our obligations to respect the legal and moral rights of data subjects. Most importantly, the Statement calls on us to be cognizant of the consequences of our work for the people we serve: “The analytic algorithms and applications we build and/or implement, as well as the policy decisions incorporating information we analyze and disseminate, impact people and situations.”

This concern for consequences often doesn’t get the attention it deserves. Our attitudes toward consequences too often occupy the spectrum between “you can’t argue with the numbers” and Tom Lehrer’s parodic version of German-American rocket scientist Wernher von Braun: “‘Once the rockets go up, who cares where they come down? That’s not my department,’ says Wernher von Braun.” A consequences-be-damned attitude toward our data and models can only be ethically justifiable if data are objective in the strong sense described above. If data are only weakly objective—embedded with social considerations at their creation, then we cannot deny our responsibility for the consequences of our data and models. Our only option is to own it.

Owning our data choices centers our ethical responsibility as IR/IE professionals. If data are only weakly objective, we cannot fall back on well-defined rules that will always ensure our integrity. Things can go wrong despite our best intentions. We must acknowledge that we are obligated to respond for, to give an account of, our decisions. We can only do that if we are intentional about the ethics that we build into our data. One might be tempted to be omni-responsible by being omni-principled, demanding that our data consistently embody some core set of ethical values. But this is, in fact, the opposite of responsibility. To adhere to a principle slavishly is to deflect onto the principle the responsibility that ought to be ours. That, of course, inevitably fails: we chose the principles. Principles are most helpful as guides to decision-making established before a difficult situation arises, when we can calmly consider all of the nuances without the distraction of our immediate interests. That is the great virtue of the AIR Statement of Ethical Principles. We have decided ahead of time that privacy, confidentiality, accuracy, and transparency are almost always better than the alternatives. Principles help make decisions, but they are never a substitute for responsible decision-making.

Being intentional, then, means not that we always adhere to a definite set of principles but that we choose very deliberately how we will develop, select, and model data with the ethical questions of the intended application in mind. Ultimately our data and models support interventions that our institutions see as good into circumstances that our institutions see as problems. We can best bear our responsibility by tuning our data and models to intentionally further the ethical resolution of those problems. Characterizing students as at-risk based in part on ethnicity may recognize histories of unjust primary- and secondary-education segregation in an institution’s service area, making non-resident alien status a meaningful distinction from U.S.-based students. Creating a climate of inclusion may make international students’ ethnicity more important than their visa status. The ethical action is not to always choose one or the other, but to choose the race/ethnicity field for which we are prepared to be accountable.

The unethical action is to attempt to avoid the choice altogether. Doing so may be soothing to troubled souls, but avoiding responsibility is only a screen that hides the inherent ethical nature of our choices. We are unavoidably responsible for the consequences of our choices about data as IR/IE professionals. We should acknowledge that responsibility in everything we do.

Jeffrey Alan Johnson is Director of Institutional Research and Effectiveness at Wartburg College in Waverly, Iowa and a leading scholar in the area of information ethics in higher education administration. His recent book, "Toward Information Justice: Technology, Politics, and Policy for Data in Higher Education Administration" (New York: Springer, 2018), breaks new ground in the study of ethics in public organizations’ use of information technologies. He previously served as President of the Rocky Mountain Association for Institutional Research. Dr. Johnson holds a Ph.D. in political science from the University of Wisconsin-Madison.