Enterprise AI is only as good as the data you feed it, and most of that data is a mess.
Roughly 80–90% of enterprise data is unstructured, and a huge share of the knowledge AI needs lives in documents. In legal alone, an estimated 550 billion to 1.1 trillion pages sit in formats machines can't reliably structure, verify, or use at scale: contracts, pleadings, productions, exhibits, transcripts, and decades of scanned records. The pile isn't just a one-time backlog. PDFs remain the default way documents are created, shared, and stored by enterprises across industries.
The tools and services most companies use today read those documents as flattened text. They go page by page, pull out words, and in the process drop the thing that gives a complex document its meaning: its structure. Layout is part of it: read a two-column page in the wrong order and the text comes out scrambled. But the real value lies in the relationships, which span the entire document. A citation and the document-wide context it requires. A defined term and the definition that governs how it's read across the document. A value in a multi-page table and the column header that gives it meaning, hundreds of pages away. Those relationships are what make an answer trustworthy: accurate, because the model can see how the pieces connect, and verifiable, because every answer can be traced back to its exact place in the source.
Throw that structure away and the AI built on top inherits the loss. Reaching for a better model doesn't fix it. A capable model can often produce a plausible answer from a messy document, but it does so by inferring structure on the fly and discarding it, a little differently each time, with nothing you can check. A person reviewing every output can live with that; a system running on its own, at volume, can't. The fix isn't a smarter model re-deriving structure on every call; it's structuring the document once, explicitly, into something every model can reuse and anyone can inspect. The more autonomous AI becomes, the more essential that data infrastructure layer is.
Why this is becoming urgent now
There are two kinds of enterprise AI, and they have very different tolerances for bad input.
The first is user-facing: a person asks a question, gets a draft or an answer, and reviews it before it goes anywhere. A flawed input here is an annoyance. Someone usually catches it. Most of the AI in production today works this way.
The second is system-level: documents flowing automatically into products, databases, workflows, and systems of record, processed continuously, at volume, with no person reading each result. This is where much of the durable enterprise value sits, and it's where adoption stalls, because a system you can't supervise by hand can only run on inputs it can trust. You can't automate a workflow without reliable data.
The bar is rising, too. The first wave asked AI to do familiar work faster; the next asks it to produce outcomes a business can act on without a person checking each one. Reliable outcomes start with reliable inputs.
The structured data layer
That trustworthy input is the layer we've been building at LexSelect.
Our parsing engine takes a complex document and converts it into a document-wide structured node tree shaped by a legal document ontology, preserving the hierarchy, relationships, and meaning the document actually carries, not just the characters on each page. Downstream AI and workflows get something they can reason over and verify, instead of a flat wall of text they have to guess at.
That layer is now available as an API. Partners send us their documents and get the structured node tree back, to build on directly inside their own products and workflows. The engine that began inside our own software is now infrastructure other companies build on. That's the difference between having a capability and becoming a data infrastructure layer.
Legal first, on purpose
We started with legal deliberately, because it's the hardest version of this problem. A single matter can carry exhibits, defined terms, citations, nested tables, multiple versions, and relationships that run across hundreds of documents. If you can preserve structure there, the same architecture carries to less complex use cases. Legal isn't a wedge we're hoping to climb out of. It's the proving ground for a much broader Document AI opportunity. The architecture that earns trust here is the one the rest of the enterprise needs.
Who's already pulling on it
The clearest sign that the thesis is right is the range of companies showing up for the same engine. In the first weeks since launch, we have partners in onboarding and active discussions across four very different categories at once:
- Legal content and data providers, structuring the proprietary content they own and license, from deep archives to the new material they keep producing.
- Embedded workflow platforms, enriching document-heavy products with structured metadata.
- LLM-native legal products, which need clean, verifiable input to make their models more accurate and their outputs trustworthy.
- High-volume enterprise operations, running backend document AI workflows at scale.
Four different problems, four different kinds of company, one bottleneck. When the same piece of infrastructure gets pulled in four directions at once, it's usually because it's a layer the market has been missing.
The same engine is already proving itself in production on the enterprise workflow side. Our first enterprise workflow deployment, Medina McKelvey, has moved from pilot to production, automating mediation intake workflows, and the relationship is now expanding into a second deployment.
How we got here
If you've followed LexSelect, none of this is a hard turn. When we introduced our parsing engine last year, we wrote that the company was "evolving from a tool into infrastructure." At the time, that was the idea sitting under the product. We were building tools for lawyers as end-users.
What that work taught us was that the tool was not the hardest part. The hardest part was turning messy documents into structured, reliable inputs. That same problem, multiplied by billions of pages, is what's slowing AI across the whole enterprise. So we followed it to the root and built the layer. The opportunity turned out to be much bigger than we first described.
Where this goes
As AI moves from helping people to running inside the systems businesses depend on, the test it has to pass changes: less about the time it saves, more about whether the outcome can be trusted and generate value. Every one of those systems needs the same thing to clear that bar: clean, structured, verifiable inputs it can trust. That's the layer between the world's documents and the models and applications built on top of them: foundation-model-adjacent infrastructure. We're starting with legal because it's the hardest place to earn that trust. We don't intend to stop there.
Where you might fit
- If you're building AI on unstructured documents, the API gives you structured, verifiable input at scale. Let's talk.
- If your team runs high-volume document workflows, we do enterprise automation on top of the same engine. Let’s chat about building a solution for your enterprise.
- If you think about where the enterprise AI stack is still missing pieces, my argument is that structured, verifiable document input is the most underbuilt layer in it. I'm always eager to compare notes.
Try LexSelect on your messiest document, or reach me on LinkedIn or at morgan@lexselect.io.
