What is a Data Lake, and how is it beneficial?
Have you ever heard of a Data Lake? You're probably familiar with a Data Warehouse, but a Data Lake is a storage location with a large amount of structured and unstructured data. Structured data is highly organised data stored in a predetermined format. In contrast, unstructured data is a collection of different types of data stored in its native format.
A Data Lake differs from perhaps the more familiar data warehouse in that the data lake holds its data in a flat structure without categorising it into files or folders.
Being a centralised storage location of structured or unstructured data, Data Lakes allow businesses to scale. Users can access data whenever and however needed, and data scientists can easily apply analytics to gain valuable insights. What does this all mean? In comparison to the data warehouse, a Data Lake provides flexibility.
Let's consider some examples to understand the value of a Data Lake. Tweets, images, voice and streaming data are unstructured as they are all in different formats, come from various sources and are different sizes. Regardless of how slow or fast, the information is ingested in a Data Lake, no matter its structure, it is designed to store and manage this information effectively.
Here are some formats for how information is stored in a Data Lake:
- Structured data, such as rows and columns from relational database tables (data organised in a predefined relationship).
- Semistructured data, like delimited flat text files and schema-embedded files. Examples are photo and video files containing meta tags related to the location and date. Still, the information within the file has no structure. So, this is considered semistructured data.
- Unstructured data. The most abundant data available today is data that is not in a structured database format. Unstructured data can be human-generated or machine-generated and may appear in text form or in image, video or voice form. This type of data may come from social media, the Internet of Things, and other sources.
The Value of a Data Lake
A Data Lake is a storage repository that is accessed by a variety of tools and programs. It is open-standard, meaning that the data is accessible and readily usable by businesses, breaking down existing silos and enabling businesses to get the most out of their data. In addition, they can use various analytics services to process their data and scale storage and processing capacity over time. Here are just a few examples where Data Lakes have added value to a business:
Improved customer interactions
A Data Lake can help companies better understand their customers -from churn rates, intent, and value - by combining data from your CRM platform, social media analytics, and other sources.
Improve R&D innovation choices
A Data Lake can be used to design cutting-edge products and services in a more streamlined way. It allows the product design teams to validate and test their hypotheses efficiently. The rewards of this practice might include more effective products for the customer and less cost of bug fixing.
Increase operational efficiencies
Companies in the Internet of Things (IoT) space can benefit from a Data Lake to demonstrate the value of such a system to increase operational efficiencies. Such companies collect vital manufacturing process data and real-time data from connected devices. A Data Lake makes it easy to store and run analytics on machine-generated IoT data and can help discover ways to reduce operational costs and increase quality.
Spark & Workhuman
Spark customer, Workhuman, is a cloud-based provider of human capital management solutions. The company has several products and integrates into many third-party applications and platforms. Available to internal and external users, Workhuman utilises a data platform to monitor and understand how these products are deployed and used. The data platform sources from Oracle, but data updating and error recovery problems exist. To address this problem, Spark optimised Workhuman's Data Lake.
Updating and migrating data in the current state occur daily. However, because the data updating happens only once daily and does not differentiate, the ability for Workhuman to receive and analyse accurate data takes far too long. Moreover, if there is an error, the process must restart from the beginning. Spark's goal is to update the migration process so that only new data migrates. The value this creates will reduce the time to access near real-time data, reducing the load on their system and subsequent migration process costs.
Additionally, Spark outlined and executed the strategy to add additional data sources to the existing Oracle ones, building a Data Lake that is accessible and usable for all Workhuman's data. Given the technical complexity, the result is a Data Lake that is easy to use, understand, modify and scale with the business needs.
Spark Senior Data Engineer, Joao Miranda, speaks fondly about the relationship between Spark and Workhuman,
"I'd say that they (Workhuman) are very understanding and adaptable, so it's easy to have a conversation with them and come to a consensus on any obstacle that might appear in the project. They are also very technically knowledgeable, so most of the technical solutions we come up with are often collaborative instead of just Spark coming up with something and getting it approved. Everyone we have direct contact with has been consistently helpful and made day-to-day work very easy."