Kadoa/Blog/The Rise of Unstructured Data ETL
Blog post illustration

The Rise of Unstructured Data ETL

16 Mar 2024

When Clive Humby coined the term "data is the new oil" in 2006, he meant that data, like oil, isn't useful in its raw state. It needs to be refined, processed, and turned into something useful; its value lies in its potential.

Fast forward to today, this quote is more true than ever as we're drowing in data, 80% of which is unstructured and largely untapped.

  • 80-90% of the world’s data is unstructured in formats like HTML, PDF, CSV, or email
  • 90% of it produced over the last two years alone
  • The rise of LLM and RAG applications increases the demand for data

However, preparing unstructured data is a major bottleneck. A survey shows that data scientists spend nearly 80% of their time preparing data for analysis. As a result, a lot of the data that companies produce goes unused.

Manual data pipelines are error-prone, hard to scale, and require constant maintenance. In the past, enterprises relied on a complex daisy chain of software, data systems, and human intervention to extract, transform, and integrate data.

This is where unstructured data ETL comes in - a new paradigm for automating the end-to-end processing of unstructured data at scale.

The Untapped Potential of Unstructured Data

Enterprises have struggled to fully leverage their data, since most of it is trapped in hard-to-analyze formats. Existing tools only scratch the surface of what's possible to do with that data since it was a very manual process that involved developers using different tools (web scrapers, document processing) and writing custom code to extract, transform and load unstructured data from sources like websites or PDFs.

LLMs are really good at understanding and structuring data, and there is now a modern AI-first data stack emerging with tools that help with efficiently preparing data for usage in LLM and RAG applications.

Introducing Unstructured Data ETL

Unstructured Ddata ETL is a new approach to automating the end-to-end processing of unstructured data at scale. It combines the power of AI with traditional data engineering to extract, transform, and load data from diverse sources and formats.

Market Map

The key components of unstructured data ETL are:

  1. Extract: Orchestrating AI agents to autonomously navigate and extract data from sources like websites, PDFs, and emails.
  2. Transform: Automatically cleaning and mapping the extracted data into the desired structured format.
  3. Load: Loading the transformed data into databases or delivering it via API for consumption by downstream applications.

The big efficiency boost over the traditional ETL approach is the ability to handle the complexity and variability of unstructured data sources. If a website or PDF layout changes it's structure, all existing rule-based systems break. We can now use AI to adapt to changes in data sources, making the workflows more resilient and maintenance-free.

Real-World Applications of Unstructured Data ETL

Automated unstructured data ETL enables a wide range of use cases, both traditional and AI-powered:

Traditional use cases:

  • Web scraping for price monitoring, lead generation, and market research
  • Extracting data from PDFs for financial analysis, legal compliance
  • Processing emails and support tickets for sentiment analysis and trend detection

AI data preparation use cases:

  • Powering RAG architectures for question-answering over large unstructured datasets
  • Generating training data for fine-tuning LLMs on domain-specific tasks
  • Enriching chatbot conversations with data pulled from external sources in real-time

The AI data preparation market is expected to experience significant growth in the coming years, and unstructured data ETL will play a crucial role.

The ability to efficiently process unstructured data at scale opens up new possibilities for enterprises to extract insights and automate processes that were previously too complex or labor-intensive to tackle.

Conclusion

Unstructured data ETL is the missing piece in the modern data stack. Data pipelines that took weeks to build, test, and deploy, can now be automated end-to-end in a fraction of the time with the use of tools like unstructured.io or Kadoa. For the first time, we have turnkey solutions for handling unstructured data in any format and from any source.

Enterprises that apply this new paradigm will be able to fully leverage their data assets, make better decisions faster, and operate more efficiently.

Data is definitely (still) the new oil.