Data Documentation Matters More Than You Think

Published on September 29, 2025 by Fabian Stadler

2026-02-23: This is an updated and migrated version of an article I formerly published on medium.com.

Governance is inching closer to the heart of engineering. The new pressure is not just about access control; it’s about documentation, semantics, and AI-readiness. In practice, that often turns into a push to fill in table and column comments.

This raises a deceptively simple question: does documentation actually matter? My answer to this question is both yes and no. It depends on how you use it, what you expect from it, and whether you’re willing to let AI help you maintain it.

The Case for Comments: Human and Machine Context

Comments are the smallest unit of semantic intent we can attach to a schema. They explain names, reveal business meaning, clarify units and scales, and hint at lineage or sensitivity. For humans, that’s documentation. For AI agents that’s a breadcrumb trail to the “right” data with fewer hallucinations and fewer false joins.

In Snowflake, for example, you can write comments in DDL statements or afterward:

-- Inline during create
CREATE OR REPLACE TABLE analytics.customer_orders (
   order_id       NUMBER(38,0) COMMENT 'Surrogate key; unique per order record',
   customer_id    NUMBER(38,0) COMMENT 'FK to analytics.customers.customer_id; represents billed customer',
   order_ts       TIMESTAMP_TZ COMMENT 'Order commit time in UTC',
   order_total_usd NUMBER(10,2) COMMENT 'Total order amount in USD including tax',
   channel        VARCHAR COMMENT 'Sales channel: web|retail|partner'
) COMMENT = 'Orders fact at customer grain; 1 row per order';

-- After the fact
COMMENT ON TABLE analytics.customer_orders IS 'Orders fact at customer grain; 1 row per order';
COMMENT ON COLUMN analytics.customer_orders.order_ts IS 'Order commit time in UTC';

-- Discover comments later
DESCRIBE TABLE analytics.customer_orders;
SELECT column_name, data_type, comment
FROM analytics.information_schema.columns
WHERE table_schema = 'ANALYTICS' AND table_name = 'CUSTOMER_ORDERS';

These comments aren’t a semantic layer, but they’re the raw material. They give BI tools, catalogs, and AI copilots a first pass at meaning. The result is a faster “time to correct query” for humans and machines alike.

When Comments Become Theater

But there’s a catch: mandated documentation often devolves into theater. Teams paste “Order ID” into order_id, or “Customer ID” into customer_id. You meet the letter of the policy but add zero meaning—no units, no grain, no owner, no warnings.

That kind of metadata ages like milk. It creates the illusion of order while routing no real signal to consumers or agents. Worse, it teaches teams that metadata is a checkbox, not an asset—which kills the culture you’ll need for true AI readiness.

Governance That Makes Comments Useful

Good comments are possible, but they need a shared contract. Governance must define the minimum viable comment (MVC): what fields must be present and how to write them succinctly. Then you need guardrails—checks in CI/CD and periodic monitoring—to keep metadata alive as the data evolves.

A simple, enforceable template for comments works wonders. Treat each comment as a four-to-six-line definition, for example:

Business meaning: What does this field represent, in business language?
Grain: At what entity/time granularity is this column defined?
Units/Scale: Currency, timezone, measurement unit, or categorical domain.
Lineage/Owner: Upstream system or steward contact (team alias is fine).
Quality gotchas: Known quirks, nullability expectations, late-arriving data.

Here’s how that looks in practice:

COMMENT ON COLUMN analytics.customer_orders.order_total_usd IS
'Business: Total billed amount per order including tax.
Grain: One row per order_id, at order commit time.
Units: USD, 2 decimal precision.
Owner: Data Commerce Team (data-commerce@company.com).
Gotchas: Can be null for canceled orders; refunds appear as negative totals.';

You can also monitor coverage and quality:

-- Find missing or trivial comments (e.g., comments equal to column name)
SELECT table_schema, table_name, column_name
FROM analytics.information_schema.columns
WHERE (comment IS NULL OR TRIM(comment) = ''
      OR LOWER(REGEXP_REPLACE(comment, '[^a-z0-9]', ''))
         = LOWER(REGEXP_REPLACE(column_name, '[^a-z0-9]', '')))
  AND table_schema NOT IN ('INFORMATION_SCHEMA');

Make it a build gate: fail if a production schema is lower than a threshold of blank/trivial comments, or if a new column lacks the MVC fields. Small friction now, huge clarity later.

Comments Aren’t Enough for AI Agents

Even perfect prose won’t answer every question an AI agent needs to route a query or pick a table. Agents need operational signals: - Recency - Completeness - Expected row counts - Null rates - Join keys - Data sensitivity

A column comment won’t tell you that a table is stale, that nulls spiked yesterday, or that an enum drifted.

This is where quality metrics, tags, and telemetry complement comments. Track freshness and timeliness for each table, expose expected update SLAs, and store join cardinality hints or candidate keys. In Snowflake, object tags can augment comments with policy-critical metadata (e.g., pii=true, retention=365d) that downstream systems can enforce automatically.

-- Example: complement comments with tags
CREATE TAG sensitivity COMMENT = 'Data classification';
ALTER TABLE analytics.customer_orders MODIFY COLUMN email
   SET TAG sensitivity = 'pii';

When AI agents choose between two “customer” tables, comments provide semantics, while freshness metrics and tags provide fitness-for-use. You need both.

Why Invest Anyway: Teams Learn Faster

Even if you later automate metadata generation, writing good comments now forces clarity. Teams articulate grain, reconcile units, and surface implicit business rules. That reduces defects, shortens onboarding, and prevents “semantic debt” from quietly compounding.

Comments also serve as affordances for code review. A PR that adds a status column with a clear domain and owner will trigger useful discussion. Quality emerges from the negotiation around meaning as much as from the data itself.

The Near Future: Agents That Explore the Warehouse

AI is rapidly getting better at inferring structure from data—sampling, profiling, clustering values, and mapping lineage from query logs. Agents can: - Propose comments - Validate enums - Detect join keys - Account for temporary defects - Suggest sensitivity tags

In many cases, they will produce richer, more current metadata than humans can maintain by hand.

That doesn’t make human intent obsolete; it moves us up a level. Our job becomes steering the agent: defining governance policies, validating proposals, and wiring feedback loops into CI/CD. Think of it as metadata pair programming—agents explore, we curate.

The Pragmatic Middle Way

So, does data documentation matter? It does—when it’s concise, structured, and backed by governance and monitoring. It doesn’t—when it’s forced, shallow, and left to rot.

Bootstrap with human-authored MVC comments, augment with tags and quality metrics, and progressively automate with AI. That way, you might achieve equilibrium.

A Practical Starter Plan

Define your MVC: 5 required fields for each comment (meaning, grain, units, owner, gotchas). Keep each to a sentence.
Instrument freshness: Publish recency, null rates, and row deltas per table; surface them in your catalog.
Enforce gates: CI checks for blank/trivial comments and missing tags on sensitive columns.
Use AI as a drafter: Let an agent propose comments and tags from profiling; require human approval.
Review drift: Set a quarterly cadence to revalidate the most-used tables; automate reminders with usage data.
Close the loop: Capture corrections from analysts and feed them back to retrain your metadata agent.

Finally, accept impermanence. Metadata is living knowledge, not a PDF. What’s “true” this quarter may change next quarter as products, schemas, and regulations evolve. Our job isn’t to freeze meaning—it’s to keep meaning moving at the speed of the business.

If we start now with that mindset, we’ll get a better outcome than forcing engineers to fill in soon-to-be-stale text boxes. We’ll create a virtuous cycle where comments, tags, and metrics feed both humans and agents—making a data environment genuinely AI-ready, rather than merely annotated.

If you have any questions or feedback, feel free to write me a mail or reach out to me on any of my social media channels.