Garbage In, Garbage Out: Data Quality in the Age of AI
Garbage In, Garbage Out: Data Quality in the Age of AI
LLMs do not understand your data. They pattern-match on whatever representation you give them. If that representation is messy, with inconsistent labels, missing values, duplicates, or formatting errors, the model will confidently generate plausible output based on artifacts in the data rather than real signal.
That's how these systems work.
The fundamental issue is that LLMs optimize for plausibility, not truth. If the inputs are garbage, the outputs will still look coherent. They will simply be coherent in garbage-land.
What clean data actually means
The usual data quality rules still apply. Standardize categorical values so “NY” and “New York” are not likely to be treated as separate entities. Validate data types so timestamps are not in the future and ages are not negative. Remove duplicates, like in text-heavy datasets where repetition can artificially amplify certain patterns.
What changes with LLMs is the importance of context. These models perform better when you explicitly provide schema information, field descriptions, and examples. Dropping in a raw CSV and expecting useful results is just wishful thinking. Structured data needs to be transformed into a representation that makes relationships legible, whether that is well-formed JSON, a textual summary, or a constrained schema the model can reason about.
Clean data is not just correct data. It is interpretable data.
Stop asking for “insights”
Handing a dataset to a model and asking for “insights” almost always produces the same outcome. You get generic observations, confident hallucinations, or fixation on irrelevant details. The model does not know what matters in your domain, and it has no internal notion of importance or, for that matter, "what you might not have seen".
A better approach is to ask specific, testable questions. Not “what insights can you find?” but “what anomalies exist in the price column, and what might explain the drop on January 10th?” Require the model to show its work. Ask it to list observations and cite the specific rows or records that support each one. Instruct it to explicitly mark anything it cannot support with evidence as unsupported.
This shifts the model from storytelling to assisted analysis and gives you outputs you can actually verify.
Humans still do a better job running the place
Humans define the problem space. They decide what success looks like, which labels are valid, and which errors are acceptable. They design evaluation sets that reflect real-world constraints, not just statistical convenience.
Consider fraud detection. A model may flag transactions as suspicious based on learned patterns, but it does not understand operational reality. Humans design the rules that combine model scores with business logic. They investigate edge cases, correct labels, and surface failure modes that the model cannot see because those failures are semantically, not statistically, wrong.
This is not oversight as a formality. It is judgment applied where these probabilistic systems, and that's what LLMs are, are blind.
Building something reliable with AI as a lunatic assistant
Reliability comes from process, not clever prompting alone. Profile your data before it reaches the model and catch basic issues early. Maintain small validation sets that you trust and revisit them regularly. Write prompts that demand evidence rather than linguistic correctness or confidence.
AI can dramatically accelerate analysis, but it amplifies whatever discipline, or lack thereof, you bring to it. Clean inputs, domain knowledge, and verification do not become optional as you scale to production. They become the thing you, human person, do.
Used carefully, these systems are sharp tools. Used carelessly, they are engines for producing authoritative-sounding garbage.
Ready to Transform Your Business with AI?
Take our free assessment to get personalized recommendations.
Start Free Assessment