Recently there has been a trend to expand and improve open data initiatives for the simple reason that big data problem-solving can help everyone. On paper, that prospect sounds quite nice, and, without a doubt, big data is insanely cool. There are a myriad of interesting trends and helpful conclusions that can be drawn from open datasets. Indeed, much of the work I did over the summer consisted of working with publicly available crime data. However, the actual logistics of open-data initiatives are cumbersome at best and a complete nightmare at worst.
Though there is, in this article, little room to properly rant about problems associated with datasets being published in largely useless formats—of which PDFs are among the largest offenders— suffice it to say that the act of acquiring publicly available data can be far more complicated than the phrase “publicly available” may generally lead you to believe. This is certainly a problem, but not the one on which this piece shall focus. Rather, the topic of discussion here is how bias can insidiously creep into our datasets.
One of the big draws of big-data approaches to problem-solving is the belief that the conclusions reached are purely mechanical—that is, free of human bias. After all, if the data supports a conclusion, then that conclusion is likely to be correct, regardless of what the human analysts may believe. This is indeed the case sometimes; however, it can just as easily swing the other direction. That is to say, datasets that are contaminated by human bias can be and are constructed, all the while masquerading as if they were not.
While that last sentence may evoke images of a shadowy cabal of data-altering super villains, the truth is far more mundane. The problem is the nature of dataset construction and, more specifically, the tracking of metadata as it relates to interoperability. Metadata is information about information, generally how it was collected; interoperability, on the other hand, is the ease with which multiple datasets can be integrated. Even though it sounds like a horribly unexciting topic of discussion, metadata is absolutely necessary for any complex dataset because various anomalies may be present or absent from the data, depending on the collection techniques used. Some data may only be properly understood with its accompanying contextual metadata, and while interoperability is important—more data being generally a good thing—it is sometimes at odds with good metadata maintenance practices. When two datasets are merged, it may be necessary to drop some information that does not fit into the format of one dataset. Similarly, it may be the case that a dataset lacks information on certain variables that are present in the other data set; in this case, values must be filled in or else be left blank. This ignores the massive hurdles associated with trying to convert data taken in one format to another—truly the stuff of any data scientist’s nightmares.
To get an idea of why we need to consider the way data was collected and how it may be used, let us consider a case study. Suppose you are running a company that intends to diversify its workforce and, in order to make sure this process is fair, you decided to create an algorithm to sort through various applications. As a baseline, you train this algorithm using the résumés of the people already hired by the company in hopes that the algorithm can learn the appropriate skills and abilities an applicant will need. In theory, this sounds like a great strategy. An algorithm won’t be prone to human bias or discrimination and should therefore get you a diverse selection of qualified applicants from the pool of applications. Job well done! Go take a well deserved nap.
Wait, wait. It’s actually not quite that easy. Why? Because, unless your company already had a diverse group of employees, you may have inadvertently created an algorithm that does not avoid bias, but instead systematically—and unconsciously—perpetuates it.
How could this happen? Suppose the company is predominantly staffed by white males. When the algorithm is trained on the existing corpus of successful and unsuccessful hires, it will notice a pattern that successful hires seem to be largely white and male and so, thusly justified by the data, the algorithm does what it was built to do: It begins to reject applications belonging to women and to members of minority groups. This is far from an unrealistic hypothetical, and in general companies who use hiring algorithms—and most large companies do—need to be particularly cautious with their datasets and parameters, lest they create and bolster algorithmically ordained bias.
The issue lies in taking a partial dataset—that is, employees of a given company—and believing it to be representative of a larger population that it may not in actuality capture—the pool of all qualified applicants. In our previous hiring example it is possible to see the problem coming, but it is often more challenging to foresee in practice. Suppose the public transit system hires a group of data analysts to go through ride numbers at certain stops and to determine which subway routes are most successful and which ones should be given greater priority for renovations. The data analysts will return an answer based on the information they have at hand. Now suppose the data were collected by the computers in the electronic turnstile. These systems register a ride when someone pays with their electronic subway card. Can you see any ways in which this might produce biased results?
As it happens, this dataset is unlikely to be representative of the entire group of people and furthermore may result in discrimination. This is because, while it seems to be collecting data in a straightforward and robust fashion, it doesn’t include those who may have paid for their ride in cash. When we consider that those who pay in cash may be incapable of reliably being able to pay with electronic methods, this suddenly becomes a dataset that ignores those who might be in an unstable financial situation. When we further consider that in most American cities, class, income and location can be indicators of race, this error becomes much more problematic. The worst part is, although they are likely to be blamed, the data analysts didn’t actually do anything wrong. They were completely unaware that the dataset wasn’t representative.
The lack of robust metadata can lead to these admittedly extreme examples precisely because of the implicit assumption that data are unbiased. Getting unbiased data can be a significant challenge, for at every step of the process, design choices are made that imperceptibly alter the utility of a given dataset. One must decide which questions are being asked, what exactly is being collected, what tools are available for collection, how to deal with missing values, what to do with particularly anomalous results, whether to record any or all of the conditions surrounding the data collection, as well as a myriad of other choices.
All of this is to say that there is no such thing as “raw data.” The act of collecting data in the first place involves some design choices that are inevitably going to affect what results the data can show. Many data scientists speak of “cleaning” or “cooking” data, and that is because a majority of their time is spent trying to get data into a usable format before actually doing any analysis.
None of this is to say that big data is bad. Big data, and particularly open data, are a powerful set of tools that can reap groundbreaking and transformative rewards. As individuals who may produce and interact with or make decisions based on data, however, we must be aware of how the data were collected and what metadata are available. A healthy skepticism goes a long way, as a conclusion drawn from incomplete data is always an incorrect conclusion, albeit one that may be wrong without anyone’s knowledge.
The Miscellany News is not responsible for the views presented within the Opinions section. The weekly staff editorial is the only article which reflects the opinions of the Editorial Board.