Building Data as a Service

Director, Engineering at ThousandEyes

Problem

At PubMatic, the Bigdata & Analytics team is a horizontal team that focuses on building data processing pipelines for different products. Each team presents the Analytics organization with the requirements needed. Eventually, based on the ROI of a certain feature and advantage in the market, requirements are prioritized and then developed by the Analytics team. This is a long process that causes multiple teams to depend on the Analytics team to develop solutions for reporting and analytics. Even if individual teams decided to develop their own data processing pipelines it still brings about challenges. Examples of these challenges included the steep learning curve to use data processing technologies like Spark and Map Reduce, the writing of optimal code for better resource efficiency, as well as the necessity to deliver quality releases.

Actions taken

To address these challenges we needed to develop a solution that provided a self-service data platform (DaaS) to different teams across PubMatic. We began doing so by building an API service that allowed users to submit data processing jobs (Hadoop/Spark) to the underlying cluster. Later on we eventually added support for HIVE & Presto as well. Additionally, we built a simple ACL model to restrict users to their own Yarn queues, such that we could isolate the resources for different teams. This service also monitored the status of each job instance submitted by the users and provided them progress status updates using webhook events.

"This API opened up the data processing cluster resources to other development teams that were required to build their own data processing jobs/pipelines. It also enabled programmatic submission of jobs to the cluster, allowing us to create a routing model that could route heavy queries to Hive/Spark SQL processing, instead of overloading our fast data stores. Webhook events for job statuses were very useful in enabling post processing pipelines for the generated data. It also improved resilience of the pipeline with re-tries or fallback approaches on job failures."

Furtherstill, we built a generic framework for data processing, which was purely configuration-driven to allow any developer to quickly develop data processing jobs.

The first step was to build a common business function library. To do this we created a data dictionary of all potential data sources. We stored details of the available columns as well as derived columns (i.e. columns that were a function of two or more raw columns) in a database. The actual function for deriving columns was a pluggable logic that was developed in simple classes implemented through a common interface (Scala). This allowed us to exclude all business logic from the instances of data processing jobs.
With the help of the framework a new job could be defined just by developing a new configuration file that specified raw sources, input and output columns, filter conditions, partial aggregates, output format (CSV/Parquet) etc., and is completely agnostic of the data sources being processed and output written back to HDFS. During the execution of the job, the derived column logic is dynamically plugged into the job based on the column configuration in data dictionary.
This framework provided many advantages. One very important advantage was that it improved the efficiency of our development and QA cycle by more than 50%, helping us release new jobs into production faster than before. This framework also allowed developers from other teams to write jobs by leveraging the common business library of columns and defining new custom job configurations. Unit testing and code coverage jumped to more than 90%, improving code quality. More so, the QA automation team was able to automate more than 95% of the jobs and developed a very configurable framework for testing. Finally, this approach allowed us to completely separate company specific business logic from Spark framework. Though Spark is still a very core part of this framework, moving to a new technology in the future is possible without significant effort.

Lessons learned

As a central data engineering org in a growing organization serving multiple departments, we had our struggles: addressing all of the requirements, constantly dealing with competing priorities, delivering new features at high velocity, as well as maintaining stability within the platform. We realized that it was important to identify bottlenecks in the process as well as in the execution. It was also important to build the right tools for the platform so that it would improve the efficiency of our highly data-driven organization.
It was hard to quantify the large development efforts and the backend platform efforts because they did not have immediate customer impact. We estimated the amount of development efforts/time (eventually $ value) saved due to a significantly less amount of effort required to deliver future projects, to deliver high quality with QA automation, reduction of time to handle production issues, reduced bottleneck for the different teams to build their own data processing pipelines, etc.
We underestimated the time to tune Spark and make it efficient for our use-cases. We learned that it was important to set some time and effort aside as a buffer for the new technology adoption phase during the product delivery planning.

Be notified about next articles from Kunal Umrigar

Kunal Umrigar

Director, Engineering at ThousandEyes

Engineering Management Team & Project Management Agile, Scrum & Kanban Performance Metrics Training & Mentorship Leadership Training Feedback & Reviews Technical Expertise Technical Skills Software Development

We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.

Building Data as a Service

Problem

Actions taken

Lessons learned

Be notified about next articles from Kunal Umrigar

Connect and Learn with the Best Eng Leaders