The Happy Data Platform: A Personified Perspective (Part I)

Let's start engineering impact together

GlobalLogic provides unique experience and expertise at the intersection of data, design, and engineering.

Get in touch

AnalyticsData EngineeringCross-Industry

Introduction

“True happiness comes only by making others happy.” – David O. McKay

Taking a lead from the quote above, a data platform can be truly happy if it can make others happy. “Others” in this context would be the actors/teams with whom the data platform interacts. Below are the key actors that typically have interactions with the data platform:

Data Engineers
Data Consumers
- Data Analysts
- Data Scientists/Machine Learning Engineers
- External Data Consumers like partners & data buyers
DataOps Engineers
Data Stewards & Admins (for Data Governance)

This blog identifies the common expectations that Data Engineers and Data Consumers have for a data platform, and it demonstrates how to meet these expectations. DataOps and Data Governance are also extremely important aspects of a comprehensive, end-to-end data platform, so we will cover the perspectives of DataOps Engineers and Data Stewards & Admins in Part II of this blog series.

Great (User) Expectations

Happy Data Engineers

Data Engineers typically expect the following from a data platform:

If something is already done, I should not waste my time recreating the same.
I should have the means to discover and re-use the already existing data platform components (e.g., extractors, transformers, loaders, connectors etc) and data assets (e.g., ingested data sets in this case).
I should have access to a framework that allows me to stitch together the modular components reusing the existing available ones.
I don’t expect all the data pipeline scenarios to be done using the existing components only. I know there might be a need to extend the existing components or create new ones. I want the framework to allow me to extend and create new components and stitch together the same while creating new pipelines.
For me to be able to do that effectively, I should know exactly how the components must be created in order to stitch the pipelines effectively.
I want to be able to create the components in a manner so that my work can be utilized not only by me, but also by the larger data engineering community within the organization.
I want CI/CD integration at a high level, including easy-to-access resources and services to start on the job, as well as being able to move data pipelines across different environments.
I would like to have the ability to store versions of the pipeline in a smooth, integrated manner.

Happy Data Consumers

Data consumers typically expect the following from a data platform:

I should know exactly which Golden Records/Versions of processed data already exist.
I should be able to trust that the data is trustworthy and fit for purpose. (E.g., I should be able to check the lineage to confirm that I am looking at the appropriate data that is needed for the requirement.)
I should know the exact process required to access the data sets.
Based on the exact need of the use case, I should be able to leverage different kinds of access patterns like streaming, bulk export/copy, Query, APIs etc.
If I need a new data set, I should be able to get it serviced quickly.
I should be able to share datasets and collaborate with other users.
I should be able to add custom metadata like tags and comments.

Building a Happy Data Platform

“Efforts and courage are not enough without purpose and direction.” – John F. Kennedy

Approach 1: Build for Platform Feature

Below is a traditional, technology-driven approach:

Ingest all the data from multiple different systems.
Build tightly coupled pipelines for each use case, from ingestion to data processing and storage.
Some approaches work towards generating all possible components for extracting, processing, loading, and exposing data — including batch and stream processing.

However, there are few issues with this approach:

It doesn’t take long for a user to accumulate a lot of data in the lake and not know what to do with it. The data lake converts into a data swamp, and it becomes increasingly difficult to derive value from the data.
There may be limited reuse in a case of tightly coupled pipelines.
Return on investment and time to value might be a big challenge.

Approach 2: Build for Purpose

Below is an approach driven by business needs:

Create a framework that allows extensibility, modularity, and flexibility by using configurations, templates, etc.
Explore and discover already existing data and data platform component assets that can be reused.
Implement specific, prioritized, business-driven use cases by leveraging the framework — creating reusable data platform component assets.
Get the needed data for the specific use case.
Platform components should be created based on the framework to allow reuse.
Build Data Apps like Data Validator, Schema Mapper, etc.

While a DataOps mindset is a complete topic unto itself, it is worth mentioning on a high level that it is important to bring a DevOps and Agile approach to a data project. DataOps encompasses all aspects, including infrastructure management, services setup and management, environment setup, access management of data and components, quality, security and compliance, deployments, version control, and monitoring.

Paying attention to the high-level team setup also enables you to clearly separate team concerns:

The Core Platform team works on architecting, designing, and creating technical components for the data platform (e.g., data extractors, loaders, processors and transformers, CI/CD, Infrastructure-as-a-Code, etc.).
The Use Case Implementation team stitches together a pipeline using the components created by the Core Platform team; configures/extends it as needed; and writes the domain/business logic specific to the use case.

Accelerate Your Own Data Journey

The objective of a data platform is to eventually enable purposeful, actionable insights that can lead to business outcomes. Additionally, if the data platform puts the right emphasis on the journey and process (i.e., how it can make the job easier for its key actors while delivering the prioritized projects), then it will deliver an ecosystem that is fit for purpose, minimize waste, and enable a “reuse” mindset.

At GlobalLogic, we are continuously improving our Data Platform Accelerator, which is based on a similar approach. This digital accelerator enables enterprises to immediately manifest a solution that can gather, transform, and enrich data from across their organization. We are excited to work with our clients to accelerate their data journeys, and we would be happy to discuss your own needs through the below contact form.

If You Build Products, You Should Be Using Digital Twins

Deploying a Landing Zone with AWS Control Tower – Part 3