Data extraction for systematic reviews

From PDF to structured evidence table, with full provenance.

The problem

Data extraction is the slow, expensive center of a systematic review. A reviewer reads a paper, locates the relevant numbers and design details, and transcribes them into a spreadsheet. This takes 45 to 90 minutes per paper (Buscemi et al., 2006)—and best practice requires a second reviewer to do the same work independently, followed by someone to reconcile their differences.

The process is manual, and manual means error-prone. Methodological reviews have found extraction error rates between 8% and 63%, with the highest rates on standardized mean differences—means, standard deviations, confidence intervals (Mathes et al., 2017; Gotzsche et al., 2007). The most dangerous error is also the most common: entering a Standard Error where a Standard Deviation belongs. The numbers look plausible. The meta-analysis runs. The pooled estimate is wrong.

45–90 min
per paper, per reviewer
8–63%
error rate in manual extraction
83%
of teams still use spreadsheets

A recent survey found 83% of systematic reviewers extract into spreadsheets (Buechter et al., 2023). But the problem is not the spreadsheet. Whether you use Excel, Covidence, or a custom REDCap form, the extracted number has no memory of where it came from. No link from a cell to the sentence in the PDF. No record of who changed what, or why. When a reviewer catches a discrepancy six months later, there is nothing to trace. The final dataset is trusted because the people who built it are trusted. The data itself carries no proof of how it was made.

The shift

AI can now read a PDF and extract structured data from it. But researchers are right to be cautious. Awareness and adoption of automation tools remains low (Scott et al., 2021), and the early tools gave them good reason: hallucinated values, no source links, no way to check the work.

The emerging standards reflect what researchers already knew was needed. PRISMA-trAIce requires that every AI-extracted data point be grounded in source text. The RAISE framework, endorsed by Cochrane and Campbell, demands immutable audit trails and human accountability at every decision point.

These are not new ideas. They are the same principles that made double extraction the gold standard—now formalized for a world where one of the extractors is a machine.

You shouldn't think about the AI. You should think about the codebook.

The question is not whether AI is accurate enough. The question is whether you can verify its work and improve its instructions. That shifts the craft from data entry to instrument design.

The craft

The codebook is the instrument. It encodes your inclusion criteria, your operational definitions, your edge-case decisions into clear instructions—such that the remaining task is reading comprehension, not judgment.

This is established methodology. MacQueen (1998) and DeCuir-Gunby (2011) describe the iterative loop that makes codebooks rigorous: draft the codebook, pilot it with multiple coders, assess inter-rater agreement, resolve disagreements, refine the definitions, and repeat.

The codebook is never right on the first cut. It improves through contact with real documents and real disagreements. Each round of refinement encodes another edge case, closes another ambiguity, and makes the next extraction more precise.

This iterative loop is the core of rigorous extraction. The tool should serve it, not replace it.

How datamint.ing supports this

The platform implements the codebook-centered workflow, with AI doing the reading and the researcher doing the thinking.

Multiple independent readers

Each document is extracted by independent AI readers using your codebook. You compare their answers the same way you would compare two human coders.

Evidence on every cell

Click any value to see the key quotes, reasoning, and assumptions that produced it. Every answer links back to the source text.

Disagreement detection

When readers disagree, the system pinpoints the ambiguity in your codebook and asks what you intended. Disagreements drive codebook refinement.

Iterate in minutes

Refine your codebook, re-extract, and inspect again. Each round takes minutes, not weeks. The codebook evolves with your understanding.

Define a codebook. Mint your data. Inspect every cell.

References

  • Buscemi, N., Hartling, L., Vandermeer, B., Tjosvold, L., & Klassen, T. P. (2006). Single data extraction generated more errors than double data extraction in systematic reviews. Journal of Clinical Epidemiology, 59(7), 697–703.
  • Buechter, R. B., et al. (2023). Systematic reviewers used various approaches to data extraction and expressed several research needs: A survey. Journal of Clinical Epidemiology, 159, 214–224.
  • Gotzsche, P. C., Hrobjartsson, A., Maric, K., & Tendal, B. (2007). Data extraction errors in meta-analyses that use standardized mean differences. JAMA, 298(4), 430–437.
  • Mathes, T., Klassen, P., & Pieper, D. (2017). Frequency of data extraction errors and methods to increase data extraction quality: A methodological review. BMC Medical Research Methodology, 17(1), 152.
  • Scott, A. M., Forbes, C., Clark, J., Carter, M., Glasziou, P., & Munn, Z. (2021). Systematic review automation tools improve efficiency but lack of knowledge impedes their adoption: A survey. Journal of Clinical Epidemiology, 138, 80–94.