When Qualitative Evidence Is Too Thin

Why weak interviews, weak observation, and low analytic yield belong to the same research-design problem

Opening: the transcript is long, but the evidence is still thin

One of the most misunderstood problems in qualitative research is the belief that a large amount of material automatically counts as strong evidence. It does not. A study may have many interview pages, long fieldnotes, or hours of recorded interaction and still produce thin qualitative evidence. Thin evidence appears when interviews stay at the surface, observation captures too little of the relevant context, or the collected material does not generate enough analytic depth to support the claims being made. Sutton and Austin (2015) stress that qualitative research depends not simply on collecting words, but on collecting material that can support meaningful interpretation. Moser and Korstjens (2018) make a similar point when they treat sampling, data collection, and analysis as tightly linked rather than as isolated stages.

These mistakes belong together because they are not separate technical defects. They are three expressions of the same design failure: the study gathers qualitative material, but not enough of the right kind of material to answer the research question well. A shallow interview produces little interpretive depth. Weak observation misses the social, spatial, or behavioral texture that gives meaning to what people say. Low analytic yield means that even after coding and interpretation, the material does not support insight proportionate to the ambition of the question. Malterud, Siersma, and Guassora (2016) capture part of this logic with the idea of information power: what matters is not only how much data one has, but how much relevant and useful information the sample and the material actually contain for the study’s aim.

The shared failure logic across the cluster

The dominant design context here is mainly qualitative, although the lesson also matters for mixed methods work. The core logic is D > M > RQ. Data come first because the most visible problem appears in the evidence itself: the interviews, the observations, the documents, the fieldnotes, or the interactional material are too thin to carry the interpretive load placed on them. Methodology comes second because weak evidence is often produced by a weak data-generation strategy, poor interview design, insufficient observation, limited access, rushed fieldwork, or underdeveloped sampling logic. Research question comes third because the question may still be worthwhile, but the study has not generated evidence capable of answering it.

This cluster is not mainly about statistics, sample size in the quantitative sense, or whether there are ten interviews versus twenty. Hennink, Kaiser, and Marconi (2017) distinguish between code saturation and meaning saturation, showing that researchers may quickly reach recurring topical codes while still lacking the deeper interpretive understanding needed for robust qualitative conclusions. That distinction is exactly why thin evidence can be present even when the dataset looks “large enough” on paper.

Three related mistakes, clearly distinguished

The first mistake is weak interviews. This happens when interviews generate brief, generic, overly descriptive, or socially polished responses that do not illuminate processes, meanings, dilemmas, contradictions, or context. The problem is not simply that participants speak little. It is that the interview design, questioning style, rapport, or conceptual focus fails to invite material that can sustain interpretation. Sutton and Austin (2015) note that qualitative data collection requires skill in eliciting experiences, not merely asking a list of questions.

The second mistake is weak observation. Observation is weak when the researcher sees too little, stays too far from the relevant setting, records only obvious events, or fails to capture interaction, routine, silence, timing, spatial arrangements, and embodied behavior. In many studies, observation is added as a token gesture, but not designed as a serious source of evidence. Moser and Korstjens (2018) emphasize that data collection in qualitative research must be appropriate to the nature of the phenomenon, which means that when social practice, setting, or interaction matter, observation cannot be reduced to a few casual visits.

The third mistake is low analytic yield. This occurs when the material, even after coding and review, produces only obvious themes, descriptive repetition, or weak conceptual leverage. A researcher may end up with many pages of text but little insight into mechanism, meaning, pattern, or contrast. Low analytic yield is often the downstream effect of the first two mistakes, but it can also be worsened by weak sampling and poorly focused questioning. Hennink et al. (2017) help clarify why repetition of surface topics does not guarantee deeper interpretive sufficiency.

Where the cluster breaks the RQ–RH–D–M chain

At the RQ level, the study may begin with a reasonable qualitative question: how do people experience a policy, interpret a health condition, navigate a professional role, or use a culturally meaningful landscape? The problem is not necessarily that the question is wrong. The problem is that the design often underestimates how much depth, specificity, and contextual contact are needed to answer such a question responsibly. A broad interpretive question cannot be answered with thin talk and minimal exposure to the setting.

At the M level, the failure becomes more concrete. Interviews may be too structured, too short, or too generic. Observation may be intermittent, peripheral, or poorly documented. Sampling may be driven by convenience rather than relevance, which reduces information power. Malterud et al. (2016) argue that sample adequacy in qualitative research depends on aim, sample specificity, quality of dialogue, use of theory, and analytic strategy. That means thin evidence is not simply about “not enough participants.” It is about insufficiently informative material relative to the question.

At the D level, the weakness becomes fully visible. The researcher now possesses data, but the data are too shallow, too generic, too repetitive at the surface, or too disconnected from context to sustain a strong answer. This is why D is ranked first in this cluster. The study’s most immediate problem is not that the question is uninteresting or the methodology label is wrong, but that the evidence itself is too thin for the inferential work the study wants it to do. Sutton and Austin (2015) explicitly frame qualitative data collection, analysis, and management as interdependent; thin data at the collection stage become thin findings later.

There is no central RH issue here, because this cluster belongs mainly to qualitative work, where formal hypotheses often do not drive the design. The key chain is therefore RQ → M → D, with the most visible failure appearing in D.

How this cluster harms findings and conclusions

Thin qualitative evidence distorts studies in a subtle but serious way. It often produces findings that sound plausible but remain under-supported. The themes may look neat, but they are generic. The quotations may be readable, but they do not carry much interpretive weight. The conclusions may sound insightful, but they are based on material that has not reached enough depth to justify strong claims about meaning, process, or lived experience. Hennink et al. (2017) show why this happens: researchers may reach topical repetition relatively early while still lacking enough depth for interpretive confidence.

In Sociology, a study may ask how precarious workers navigate insecurity, but use brief interviews that capture only standard complaints and general attitudes. The result may identify familiar themes such as uncertainty, stress, and instability without revealing how workers actually interpret trade-offs, cope with risk, or negotiate identity.

In Health/Wellbeing research, a study may ask how patients live with a chronic condition, but rely on short interviews focused mainly on symptoms and service satisfaction. That material may describe burden but not fully illuminate adaptation, meaning, time, and relational context.

In Archaeoastronomy, a project may ask how a community relates ritually to sky-oriented places, but observation may be too limited and interviews too general to connect spatial orientation, practice, memory, and interpretation. Across these cases, the findings risk remaining descriptive fragments rather than well-supported qualitative explanations.

How to avoid the cluster before collecting data

The strongest prevention is to design for depth, not just for access. Before data collection starts, the researcher should ask what kind of material would make a credible answer possible. If the question concerns meaning, contradiction, coping, ritual, or process, then the interview guide must invite narration, example, tension, sequence, and reflection, not just opinion statements. Sutton and Austin (2015) underline that qualitative interviewing requires attention to how questions are asked and how rich responses are elicited.

A second preventive step is to treat observation as a serious evidence source when the phenomenon is practical, spatial, embodied, or interactional. Weak observation often comes from underestimating how much context matters. Moser and Korstjens (2018) emphasize that data collection strategies should match the phenomenon under investigation. If practice matters, the design must create conditions for seeing practice.

A third preventive step is to think in terms of information power rather than mechanical sample size. Malterud et al. (2016) argue that richer, more specific, and more relevant data can justify smaller samples, while weaker dialogue and broader aims demand more material. This is a very useful corrective to the shallow rule that qualitative rigor is mainly about hitting a certain number of interviews.

A fourth preventive step is to monitor not just whether new interviews add new codes, but whether they add new meaning. Hennink et al. (2017) show that code saturation and meaning saturation are not the same thing. That distinction can help researchers decide whether they are actually building depth or merely collecting more of the same surface material.

What can still be repaired after data collection

After data collection, some repair is possible, but only within limits. If interviews are somewhat thin but not empty, the researcher may be able to narrow the claim, reduce interpretive ambition, and present the findings as more descriptive than explanatory. If observation is weaker than planned, it may still help with contextual framing even if it cannot support stronger claims about practice or interaction. In some cases, a follow-up round of interviews or additional observation may genuinely strengthen the study, but only if the project is still open and the researcher is transparent about the revision.

What usually cannot be repaired is a deep lack of evidentiary richness once the fieldwork is over and closed. A short, generic, low-trust interview cannot later be turned into a rich account through coding alone. Minimal observation cannot be retroactively expanded into contextual immersion. Thin data can sometimes support a narrower paper, but they rarely support the original ambitious question. That is why prevention matters more than rescue in this part of the series.

Short takeaway checklist

Before collecting data, ask:

  • Will my interviews generate stories, tensions, examples, and processes—or just opinions?
  • If practice and setting matter, have I designed serious observation rather than symbolic observation?
  • Am I looking for code repetition only, or for enough depth to support interpretation?
  • Does my sample have enough information power for the aim of the study?
  • If my fieldwork ended tomorrow, would my material support description only, or real qualitative insight?

Good qualitative research is not defined by how many pages of transcript it produces. It is defined by whether the evidence is rich enough, focused enough, and contextual enough to answer the question well.

References

Hennink, M. M., Kaiser, B. N., & Marconi, V. C. (2017). Code saturation versus meaning saturation: How many interviews are enough? Qualitative Health Research, 27(4), 591–608. https://doi.org/10.1177/1049732316665344

Malterud, K., Siersma, V. D., & Guassora, A. D. (2016). Sample size in qualitative interview studies: Guided by information power. Qualitative Health Research, 26(13), 1753–1760. https://doi.org/10.1177/1049732315617444

Moser, A., & Korstjens, I. (2018). Series: Practical guidance to qualitative research. Part 3: Sampling, data collection and analysis. European Journal of General Practice, 24(1), 9–18. https://doi.org/10.1080/13814788.2017.1375091

Sutton, J., & Austin, Z. (2015). Qualitative research: Data collection, analysis, and management. The Canadian Journal of Hospital Pharmacy, 68(3), 226–231. https://doi.org/10.4212/cjhp.v68i3.1456