Ensuring Good Data in AI Applications: 4 Key Strategies

By Adam Clarke

12th Mar 2024

3 min read

Data should be useful

It might be tempting to think that one should just hoard as much of everything as one can, just in case! In reality, this can incur a lot of unnecessary effort and storage costs without necessarily opening up any payoff later. Data should be collected either as a necessary side-effect of known processes or with a specific goal in mind.

Data should be well-formatted

The foundation of any robust AI or ML system lies in the format of the data it uses. Properly formatted data ensures ease of analysis, processing, and compatibility with various tools and algorithms. Consistency in data representation is crucial – it allows for seamless integration of data from multiple sources and simplifies the data processing pipeline. Imagine a scenario where each data source uses different formats for dates or addresses. The effort to standardise these before any meaningful analysis or machine learning can occur is not just time-consuming but also prone to errors. A consistent format across the dataset reduces these complexities and paves the way for more efficient data handling.

The "Schema-less" Trap

In recent years, the allure of 'schema-less' data systems has grown, particularly in environments that emphasise flexibility and rapid development. However, this approach can be a false economy. The trap here is based on the belief that by avoiding strict data schemas, one can save time and effort. In reality, this approach merely shifts the burden of dealing with data complexities to a later stage where it is more costly and time-consuming to resolve. Simply ensuring that the data you're collecting is consistent and reliable when you begin to collect it can mean the difference between easily making use of it down the line and encountering a mess of issues and costly migrations to deal with before any value can be gained from it.

In systems where data feeds multiple processes, the effort required to handle unstructured data then also multiplies, leading to increased engineering efforts and costs. In contrast, defining a data schema early in the project lifecycle can significantly reduce these challenges. It streamlines data handling, ensures consistency, and simplifies the integration of new data sources. More importantly, it mitigates the risks of data misinterpretation and errors, ensuring that the data used is reliable and fit for purpose.

That being said, there are some cases where it's hard to avoid. If you're going to be dealing with a "Data Lake" scenario with potentially lots of different datasets being involved then it could be a risk that you end up having to take. In these cases, ensuring that good metadata (data about data) that makes it clear what came from where, and it each system that integrates with it should either bring data in to a known, validated schema before processing it or have robust validation to ensure that changes in the upstream data don't break downstream processes

Data integrity should be maintained

Maintaining data integrity is essential for ensuring that the data used in AI and ML systems is accurate, consistent, and reliable. Data integrity refers to the overall completeness, accuracy, and consistency of data throughout its lifecycle. This does relate to the prior point regarding formatting, but takes things a step further to ensure that data is sufficiently protected.