Rob Buller, Founding Partner, Cyberhill Partners.
Globally, unstructured data represents 80% to 90% of the world’s digital information. By 2025, that volume is expected to reach 175 zettabytes.
Unstructured data is everywhere—medical images, surveillance video, documents, emails, training content, even social media. It’s the real operating system of any organization.
Most companies today are building their AI strategy around structured data because unstructured data is operationally hard. Unstructured data is messy, fragmented and buried in silos. Leaders under pressure to “show results with AI” take the easier path and run pilots on structured data. In fact, according to Huble research, 45% of companies working with AI say unstructured data is a major obstacle to success.
Consider Zillow’s iBuyer, which worked well on structured real estate data but struggled because it couldn’t reliably interpret the unstructured, qualitative signals—such as home-specific features and local nuances—that significantly affect real estate valuation.
In short, without including unstructured data, AI pilots will struggle to show true ROI and rarely mature into a foundation the business can build on. Fixing that starts by helping AI understand context.
How To Turn Chaos Into Context
AI doesn’t “understand” the world the way we do. It understands things and relationships. Ontologies label the things and map the relationships so machines can follow our logic.
A contract becomes data when AI can identify its entities and how they connect—who paid whom, jurisdiction, dates, obligations and the signing authority. That’s when unstructured content turns into business knowledge.
A simple three-step process can help organizations turn unstructured data into actionable insights to make AI more reliable and effective:
1. Ingest and store. Bring unstructured data into a flexible environment like a data lake (Snowflake, Databricks or others). This gives you room to work with the data in different ways and prepare it for downstream use.
2. Vectorize and segment. Use a vector database such as Pinecone to translate that raw content into searchable, machine-readable representations. Just as important, create metadata during this process. That metadata will later drive the semantics and ontologies you need to build.
3. Build context/semantics. Use the metadata to create a semantic layer with knowledge graphs and ontologies (for example, Neo4j). This final step defines the relationships and meaning that make the data truly useful.
That third step is most often skipped, but is critical. It allows organizations to gain outputs they can explain, verify and trust. Building a knowledge graph enables 100% traceability. Without traceability, you will never truly understand your AI.
Industry Examples
Every industry sector is sitting on mountains of unstructured data. These four could benefit most from employing these techniques.
• Healthcare: More than 90% of medical data is unstructured—MRIs, pathology slides, physician notes. AI paired with ontologies can reveal insights that traditional records miss.
• Finance: Call center transcripts, contracts and compliance filings often hold the first signals of fraud.
• Cybersecurity and National Security: Surveillance video, IoT sensors and communications metadata generate streams of unstructured input that only become useful once organized and contextualized.
• Legal and Services firms: Contracts, proposals, state law, services regulations and government regulations are all contained within documents.
Ignoring these sources has a cost. Gartner estimates that, as of 2020, poor data quality drains every large company an average of $12.9 million a year. In other words, doing nothing already comes with a price tag. Data cleansing comes along with the movement of unstructured data to vectorized data for AI, so it serves two critically important steps and, in and of itself, can save companies $10 million annually.
Across every sector, the story is the same: Organizations are data-rich, but context-poor.
Strategic Implications
This isn’t just an IT challenge, it’s a strategic one. Companies that organize their unstructured data can improve in speed and decision accuracy, turning static knowledge into a living, connected system.
Imagine a global enterprise where training content, contracts and customer feedback are all connected through a shared semantic framework. This can enable not only faster search but also a system that retains knowledge and improves over time.
When organizations give structure and meaning to their unstructured data, they unlock speed, efficiency and intelligence that structured datasets alone can’t provide. Turning unstructured data into reliable, traceable insight isn’t just a technical win—it’s the foundation of responsible AI.
Looking Ahead
Gartner predicts 80% of enterprises will deploy generative AI by 2026—up from less than 5% in 2023.
But the next wave of AI success won’t come from who builds the flashiest models. It will come from those who build on the smartest data. Organizations that prioritize organizing and contextualizing unstructured data are better positioned to leverage AI effectively and generate reliable insights at scale.
Too often, companies are looking up the AI funnel, not down. What I mean by that is almost all companies start at the top, the large language model (LLM). They enlist their choice of LLMs and start running it over structured data. In reality, the process begins at the opposite end of the funnel, with the data.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
