Modal Title
AI / Data / Software Development

What Is Unstructured Data?

A look at the intricacies of unstructured data and methods for processing, analyzing and querying it.
May 22nd, 2023 9:35am by
Featued image for: What Is Unstructured Data?

This is the first of a three-part series.

Our world is constantly evolving digitally, with data growing exponentially every second. The rise of AI technology has only accelerated this process. However, not all data is created equal. An astonishing 80% of newly generated data is unstructured. This proportion is expected to increase as industries advance and technology develops. Most importantly, unstructured data is abundant in quantity and a valuable source of rich information that can provide helpful insights for informed business decisions.

So, what exactly is unstructured data, and how does it differ from structured and semi-structured data? How can we effectively process, analyze, and search through unstructured data? In this blog, we will explore the intricacies of unstructured data and discuss methods for processing, analyzing and querying it.

Structured Data vs. Unstructured Data vs. Semi-Structured Data

Let’s start by learning about different data types — structured, semi-structured and unstructured.

Structured Data

Structured data follows a specific format, making it easy to store and analyze using traditional data management tools like SQL. Examples of structured data include customer information, transaction records and inventory lists.

Semi-Structured Data

Semi-structured or partially structured data is a mixture of structured and unstructured data. It contains some level of organization, such as metadata or tags, but is not fully structured. Semi-structured data is commonly found in XML files, JSON documents and other data types that follow a specific schema. This type of data is usually stored in a NoSQL database like a wide-column store or object/document database since it cannot be directly stored in a relational database.

Unstructured Data

Unstructured data refers to data that does not have a specific format or structure. This data type is often created by humans in forms such as text, images, videos, emails and social media posts. However, unstructured data can also include less common examples like protein structures, executable file hashes and human-readable code, among others — the possibilities are endless.

Below are some specific examples of unstructured data, both machine-generated and human-generated.

  • Sensor data: Data collected from various sensors, including temperature, humidity, GPS and motion sensors.
  • Machine log data: Data generated by machines, devices or applications, including system logs, application logs and event logs.
  • Internet of Things (IoT) data: Data collected from smart devices, including smart thermostats, home assistants and wearable devices.
  • Computer vision data: Data generated by computer vision technologies such as image recognition, object detection and video analysis.
  • Natural Language Processing (NLP) data: Data generated by NLP technologies, such as speech recognition, language translation and sentiment analysis.
  • Web and application data: Data generated by web servers, web applications and mobile applications, including user behavior data, error logs and application performance data.
  • Emails: Email messages typically contain unstructured text, images and attachments.
  • Text messages: Text messages can be informal, unstructured and contain abbreviations or emojis.
  • Social media posts: Social media posts can vary in structure and content, including text, images, videos and hashtags.
  • Audio recordings: Human-generated audio recordings can include phone calls, voicemails, audio files and audio notes. They are considered unstructured data.
  • Handwritten notes: Handwritten notes can be unstructured and may contain drawings, diagrams and other visual elements.
  • Meeting notes: Meeting notes can contain unstructured text, diagrams and action items.
  • Transcripts: Transcripts of speeches, interviews and meetings can contain unstructured text with varying degrees of accuracy.
  • User-generated content: User-generated content on websites and forums can be unstructured data, including free-form text, images and video files.

Analyzing Unstructured Data Is Challenging

Working with unstructured data can be challenging due to its lack of a standardized format. In addition, things become more complicated when it comes to querying and analyzing data, especially when compared to structured and semi-structured data.

Finding or filtering specific items in a database is simple when dealing with structured or semi-structured data. For instance, to retrieve the first book from a particular author in MongoDB, you can use the following code snippet (with the help of pymongo).


This query methodology is similar to traditional relational databases, which filter and retrieve data through SQL statements. The basic idea is the same: databases built for structured or semi-structured data perform filtering and querying using mathematical (such as <=, string distance) or logical (EQUALS, NOT) operators across numerical values and strings. For traditional relational databases, this is called relational algebra. That’s why they always return exact matches for a given set of filters.

However, traditional relational databases and data management tools cannot handle the complexities of unstructured data analysis. For instance, if a user wants to find similar shoes based on a collection of shoe pictures taken from different angles, a relational database would be unable to comprehend the nuances of shoe style, size, color, etc., based solely on the raw pixel values of those images. It poses a significant challenge for industries and companies that use unstructured data: How can we transform, store and similarly search unstructured data for structured/semi-structured data?

How to Search and Analyze Unstructured Data

To address the challenge of analyzing and searching unstructured data, specialized software and techniques such as machine learning or, more specifically deep learning, are used. Machine learning is an artificial intelligence method that allows computers to learn from unstructured data without being explicitly programmed. Most machine learning models convert a single piece of unstructured data into a list of floating-point values, also known more commonly as embeddings or embedding vectors, before the data is searched and analyzed for insights.

How machine learning models process unstructured data

For example, the preeminent ResNet-50 convolutional neural network can represent the image below as a vector of length 2048. This vector’s first three and last three elements are: [0.1392, 0.3572, 0.1988, …, 0.2888, 0.6611, 0.2909].

Photo by Patrice Bouchard

Embeddings generated by a properly trained neural network possess mathematical properties that make them easy to search and analyze. For example, embedding vectors for semantically similar objects are close to each other in terms of distance. As a result, by using vector arithmetic, unstructured data can be understood, searched and analyzed.

Embedding arithmetic

Why Should You Work With Unstructured Data?

Even though handling unstructured data can be challenging, it is still valuable for developers and businesses. Unstructured data makes up a massive 80% of both existing and newly generated data, especially in the age of AI. It contains a wealth of information that can provide valuable insights into customer behaviors, market trends and other essential business metrics for more accurate decision-making. Thanks to technological advancements, such as natural language processing and deep learning, managing unstructured data will become easier with time.

Furthermore, working with unstructured data can help you discover hidden patterns and relationships that would be challenging to detect through traditional methods. Handling unstructured data will also lead to innovation and product development. We’ve already seen breakthrough applications, services and products sprout using Large Language Models (LLMs) like OpenAI’s ChatGPT to extract value from unstructured data. There will be even more in the future.

Summary

In this post, we covered the meaning and instances of unstructured data. We also explored the difficulties and techniques for handling and analyzing unstructured data to make informed business choices.

In my upcoming posts, I will delve deeper into vector databases, a simple yet effective solution to store, index and search unstructured data using the power of embeddings generated by machine learning models. I will also introduce Milvus, a highly scalable and effective open source vector database, and elaborate on how Milvus can supercharge your AI-powered applications. Stay tuned for more information.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.