Foundation of Data Lakes: A Partnership with Workhuman

2 Minutes Read

What is a Data Lake, and how is it beneficial?


A data lake is a centralised storage solution that allows businesses to store all their structured and unstructured data at any scale. As a result, business users can access the data whenever needed, and data scientists can apply analytics to gain valuable insights. A data lake is more flexible than a data warehouse. 



It is ideal for storing big unstructured data like tweets, images, voice, and streaming data. Saying that, it can be used to store any data – no matter what its source or size, no matter how slow or fast the information comes in, and no matter its structure.


Some of the formats for information stored in data lakes are:

  • Structured data, such as rows and columns from relational database tables.
  • Semistructured data, like delimited flat text files and schema-embedded files.
  • Unstructured data – including social media content and data from the Internet of Things (IoT) – as well as documents, images, voice and video.

The Value of a Data Lake

Data Lakes can make it easier to access a more significant amount of data, from a wider variety of sources, in less time. This helps businesses make better decisions because collaboration and data analysis can be done in many different ways. Examples, where Data Lakes have added value, include:


Improved customer interactions

 A Data Lake can combine customer data from a CRM platform, social media analytics, a marketing platform, or any other dataset to empower the business to understand its most profitable customer cohort and the cause of customer churn.


Improve R&D innovation choices 

 A data lake can also help your R&D teams test their hypotheses, refine assumptions, and assess results. This will lead to better product design and a greater understanding of customer needs.


Increase operational efficiencies 

 The Internet of Things (IoT) is expanding the ways companies collect data on manufacturing processes, enabling real-time data collection from connected devices. A data lake makes it easy to store and run analytics on machine-generated IoT data that could help you discover ways to reduce operational costs and increase quality.



Spark & Workhuman


At Spark, we have worked with Workhuman to optimise their DataLake. We have not created this project from scratch but are updating and improving it.

Workhuman’s current implementation migrates all new data daily, and Spark’s goal is to update migration processes so that only new data must be migrated. Because migration is only done once a day, Workhuman has to wait long periods to get an accurate, updated version of the data. Spark's goal for this project is to reduce that wait significantly and allow Workhuman near real-time access to updated data. This will help Workhuman by substantially reducing the load for their system and the migration process costs.

Additionally, part of our plan is to add more data sources other than Oracle, making it a tangible data lake for all the data that Workhuman uses. There is a large amount of technical complexity to the project, and the result will be a data lake that is easy to use, understand, and modify.

One of our Senior Data Engineers, João spoke fondly about the relationship between Spark and Workhuman. “I’d say that they are very understanding and adaptable, so it’s easy to have a conversation with them and come to a consensus on any obstacle that might appear in the project. They are also highly technically knowledgeable, so most of the technical solutions we come up with are often collaborative instead of just Spark coming up with something and getting it approved. Everyone we have direct contact with has been consistently helpful and made day-to-day work very easy.”


Interested to learn more? Book a discovery workshop and let us help your business achieve success.