Today, everyone is talking about using data to inform or even drive decisions, and we are collecting more data than ever. Amidst the insatiable appetite for data, we start to hear some researchers and decision-makers say "bad data is better than no data." I disagree. I believe bad data is the root cause of ill-advised decisions and underperforming programs and policies. In a time when more problems are brought to our attention everyday and competition for resources is becoming ever-more fierce, we simply cannot afford another bad decision.
So how do we create data that is actually informative? To answer that question, in the remainder of the article, I use survey research as a case study to discuss 3 common blind spots in sampling, survey design, and data analysis. I then offer five tips on how you can effectively and quickly improve your data practices.
Three Common Blind Spots of Survey Research:
1. Overlooking Data Bias Rooted in Self-Selection
The last time you or your team sent out a survey, what was the response rate?
When the response rate is low, self-selection creates data bias. Just think about this: of all who received the survey, who has the strongest incentives to actually fill it out? The answer is those falling into the extremes, either those who are completely blown away by how amazing the services are or those who are completely unsatisfied. Of course, some who fall in between will also provide feedback, but it's less likely. So in the end, the "best" and "worst" feedbacks are overrepresented, making a program seem either much better or worse than it actually is.
Is that a problem?
It depends on two things: the purpose of your survey and whom you intend to collect information on.
If your goal is to collect testimonials or troubleshoot the services, the data gives you a pretty solid starting point. However, if your goal is to assess the impact of your services and if the cost of filling out the survey could prevent some subgroups from participating, the data is inherently misleading. When assessing impact, we ideally want to know how our services have improved the situation of marginalized groups. Those with the least amount of time, energy, access to technology, and survey literacy tend to be the least likely to provide feedback because they simply don't have the resources to do so.
2. Overlooking the Internal Variation Behind Data
Every survey designer can testify to this: it takes a tremendous amount of preparation to create a good survey. And a big chunk of that time and effort goes into operationalizing the concepts one tries to measure. For those unfamiliar with the term, operationalization means you take an abstract concept and turn it into something you can measure.
And that's when it gets tricky.
Abstract concepts such as empowerment and well-being mean different things to different people or even different things to the same people in different situations. When you ask someone, "To what extent did this experience make you feel more empowered," it's unreasonable to assume all survey respondents understand "empowered" in the same way. It's also unreasonable to assume two respondents both giving an 8 out of 10 experienced the same level of change, but that's beyond the scope of the discussion here.
If there is no consensus about what a survey question is asking, how can you use the data to make any conclusions about anyone or anything? (To read more about this topic, check out another article I wrote: What Being Bilingual Has Taught Me about Designing Surveys.)
3. Inferring Causality from Correlation, Unconsciously
When it comes to writing and reading data reports, remember one thing:
Correlation + Time = Causality
In a nutshell, A is the explanation of B only if there is a strong correlation between them and A happens before B.
As common sense as it sounds, it is very tempting to make conclusions about causality when all you really have is a correlation. Sometimes, even researchers with the greatest integrity do it without knowing it. The problem lies in how correlation relationships are presented.
Take this statement as an example. It goes, "Do you know employees who participate in an onboarding program are X% more likely to work in the same company after 5 years?
The evidence behind the statement is this: there is a strong correlation between the number of years one works at a company and participating in an onboarding program. And if it is written that way, no inference is made. However, in reality, we often unconsciously pick one variable that we think is the cause and put it before what we think is the result. By doing so, we artificially add the element of time to a correlation. That is why, even if the following two sentences are technically talking about the same correlation, they read very differently:
Employees who participate in an onboarding program are X% more likely to work in the same company after 5 years.
Employees who work in the same company after 5 years are X% more likely to participate in an onboarding program.
I mean, does the second sentence even make sense? Why would someone who has already worked in a company for so long participate in an onboarding program? And that is exactly my point: correlation statements infer causality when they're written in certain ways.
What Can You Do to Quickly Address These Blind Spots?
1. Collaborate with Stakeholders.
It's imperative that researchers work with decision-makers to understand how data will be used. What insights are they interested in? Is there a particular group that they want to know more about? In addition, work with the communities being served. Ask community members to help you identify marginalized groups that are not yet on your sampling radar. Make sure you dedicate sufficient outreach effort to reaching these communities. Also, try to invite community members to be part of the survey design effort. They can help you create better survey questions that make more sense to them.
2. Provide Compensation to Increase Research Participation.
To create incentives for marginalized communities to participate in your research design and data collection, compensate for their time or acknowledge their contribution in a way that makes sense to them. Underprivileged populations might not be able to afford to participate in your research, even if they want to. If your organization is interested in rethinking the power dynamics of researchers and research participants, you can refer to the guidebook Why Am I Always Being Researched? published by Chicago Beyond.
3. Build Cross-board Data Literacy.
I'm not talking about asking everyone in your organization to take a statistics class. I'm talking about making the potential bias in analyzing, reporting, and using data more explicit. You are likely to continue to run into correlation statements that infer causality, but you now know you should always ask if there is a timeline built in the data or if it's added back in unconsciously by data interpreters.
4. Community Advisory Boards Are Valuable, But...
Having served on a few community advisory boards myself, I know community advisory boards are valuable by giving decision-makers reliable and ongoing access to a few community members who have more knowledge of the community. These advisors serve as guides into a community. However, be aware of self-selection issues. It is a privileged status to volunteer one's time and expertise. And folks who have that privilege might not completely understand the experiences of the most marginalized. That being said, a community advisory board is still valuable especially if you have limited knowledge of a community.
5. Utilize Mixed Methods.
There is a reason why mixed-method researchers are so sought after these days. A well-trained mixed-method researcher who understands the pros and cons of various research methods can help you create tailored data practices with maximal effectiveness and minimal bias. If you have been predominantly using quantitative methods, it's a good idea to bring a qualitative researcher in. You will start to see data from a brand new perspective.