When Not to Trust a Number

November 21, 2017 1:36 pm PST | Open Data

The amount of data generated each day around the world has reached staggering proportions. While estimates vary, it’s generally accepted that more data has been collected in the last several years than in the entire span of human history.

An increase in our capacity to collect and analyze big data can have enormously positive effects. It can lead to discovery of new treatments for diseases or enhance predictions of weather patterns to mitigate droughts and other natural disasters. But, availability of more data doesn’t automatically translate into better outcomes.

 

More data has been collected in the last several years than in the entire span of human history.

 

First, access to large datasets is limited to a relatively small number of actors, whose interests may or may not align with those of the public at large. Second, human interpretation of the data only gets more complicated with larger and more diverse datasets. It has always been true that data can be used to tell stories in a variety of ways. But, with more data available, the number of stories that can derived from this information continues to increase.

As information consumers, we need to be more vigilant than ever in questioning sources, correlations, and the assumptions behind the arguments that the data is being used to support. The following lists basic ways to scrutinize a number to find out if the data you’re reading is trustworthy.

 

Scrutinize Outliers

Most datasets contain outliers — values that are statistically distant from the other points in a dataset. Sometimes these outliers are valid and can be markers for an emerging trend.

In many instances, however, they’re the product of a poor sampling methodology or mistakes in data collection. Researchers, social scientists, data journalists, and others who present this information should carefully consider whether to include or exclude outliers. If these data points are the result of imprecise measurement or faulty collection, they should be excluded from analysis.

 

Outliers can be the product of a poor sampling methodology or mistakes in data collection.

 

Unfortunately, making this determination isn’t always that clear cut. And there seems to be an increasing tendency among the media and others to exploit this ambiguity to craft stories that are more interesting or newsworthy. As readers, we should always start from a point of healthy skepticism, especially when it seems like outliers in the data are being used to sell a point of view or a pre-determined conclusion.

 

Question Overuse of Qualitative Analysis

There is a balance required between quantitative and qualitative analysis. Numbers on their own, without sufficient context about where they came from and how they relate to one another, offer limited value.

At the same time, readers should take notice when the data being presented is crowded out by qualitative descriptions, or when an author’s interest in telling a story seemingly leads to a deemphasis of the numbers.

 

Readers should take notice when the data being presented is crowded out by qualitative descriptions. 

 

Here’s a quick example to illustrate this point. A box contains 50 pens, 30 of which are red and 20 of which are blue. The most straightforward way to communicate this information is to show it in a table with the appropriate data labels. It would be less clear, and potentially misleading, to only offer the qualitative assessment that there are more red pens than blue ones, when the precise quantities of each type are known.

 

Be Wary of Too Much Precision

When it comes to data analysis, it might seem counterintuitive to call precision into question. But it’s valid to be skeptical about calculations that don’t match the unit they’re describing.

 

Be skeptical about calculations that don’t match the unit they’re describing. 

 

For example, if a report describes how many people gave a particular answer to a survey question, and the numbers in the results include multiple decimal points, we as readers should automatically wonder if those conducting the survey have any proficiency in data collection at all. In this example, since a person is a discrete unit of one, there’s no reason to calculate results out to multiple decimal places.

This kind of reporting error happens all the time. And it’s often overlooked because we’re too busy to notice, or because we assume that fastidious attention to detail is a sign of domain authority. In fact, it could be the opposite.

These are just a few of the many instances when it’s justified to stop and ask questions about the statistics that we find scattered across research reports, news articles, and government websites. As more and more information is collected and reported, our role as the gatekeepers of data integrity becomes even more important.

Sharpen your data analysis skills with a free online course through the Socrata Data Academy.


Previous Articlecode4PA logo
Open Data
How to Host a Successful Hackathon in 4 Steps: The Pennsylvania Story

November 27, 2017

Next Articleperson working at computer with papers
Open Data
How to Frame a Government Data Analysis Project

November 17, 2017