Cleaning Up Your Data Lake


The term “data lake” came from a blog post composed by James Dixon, CTO of Pentaho. Dixon wrote the post in 2010 when trying to distinguish this new type of data store from databases and data marts.

 “…the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Tremendous value can be found in an organization’s data lake. This is especially true with the decreasing cost of storage during the last few years. The ability to analyze original data for a new insight, or to combine it with new data obtained from an acquisition or merging of departments, can bring value to those holding the data. In this way, the data is an actual commodity. If it is not processed and stored properly from the beginning, however, it is more of a deteriorating commodity than a gem or a metal.

Faulty storage is as simple as files being stored without extensions in a multitude of formats in the data lake. This may seem like an innocent mistake, but fast forward a few fiscal quarters to when someone wants to utilize that data. Opening a folder of files only to find mixed formats will delay progress significantly for a project and frustrate everyone involved.

When architected properly, data lakes have several advantages for retaining, supporting, and adapting to data changes. If there are errors in the architecture, they turn into detriments, threatening to transform your data lake into a data swamp.


Let’s take a moment to review three practices believed to constitute the architectural practices of a data lake:

Data Lakes Retain All Data
This is especially true when migrating from a licensed based on premise data warehouse to a cloud-based blob of storage based data lake. After the migration, there are plenty of ways to add the “streams” to the data lake, including many of the streaming data solutions such as Kinesis Streams and Apache Flume.

Data Lakes Support All Data Types
The very nature of a data lake is that it is somewhere for data to flow into without validation. Hence, they support all data types. This includes files from a multitude of operating systems holding numerous special characters and carriage returns, headers or no headers, just to name a few examples of anything and everything excepted.

Data Lakes Easily Adapt to Changes
If you or your program suddenly decide to change data formats, let’s say go from CSV to ORC, the last one that would ever throw an error message would be the data lake. That is not so easily said for any other system or software.

Where is the Governance?

When creating a data lake, it’s important to first consider governance. Tough questions surrounding security and how to tier the data from a raw-to-processed standpoint will make all the difference between a usable data lake and a data swamp.

These decisions are more easily made when architecting and constructing the system, rather than retrofitting the governance later.

Does your team know the value of metadata?

Simply put, metadata is data about data. The most common form of metadata is technical metadata. An easy way to think about metadata is to imagine the table of contents or index for a book. Although the table of contents doesn’t add to the actual knowledge base or story of the book itself, it will allow you to quickly find the section of the book you are seeking. Without metadata, however, you are forced to randomly select a page.

In the same way that metadata allows you to quickly jump to correct page of the book, a data lake contains a metadata store. With the metadata store, you already have the Data Definition Language (DDL) when you connect to the data lake and have an instance catalog of what is contained in it.

The Big Data team at Relus has engaged with many customers dealing with data lake issues. The team implements a solution that doesn’t drain the lake, but maintains a status quo for projects going forward – no matter what the incoming file format. Best of all, you aren’t required to be a data scientist to access this new purified data lake. If you know how to query a database or make a connection in your favorite visualization software, you are the person our team kept in mind when we designed this game-changing solution.

If you would like assistance with cleaning and maintaining your data lake, contact our AWS Big Data Competent team today.

Always be in the Know, Subscribe to the Relus Cloud Blog!