Building an Efficient Data Science Team While Still Being Agile
28 July, 2020
When the company as a whole follows a uniform process based on the software development lifecycle (SDLC), it interferes with and hinders the workflow process of implementing data science projects. An efficient data science team operates on its own methodology more conducive to the exploratory nature of data science.
I identified three key problems stemming from the efforts to enforce the SDLC to data science teams.
Software development can be easily timeboxed -- with clearly defined deliverables you can work on one piece of software for two weeks as planned, but data science is all about unknowns and resists being time-bound.
Conducting reviews in data science fundamentally differ from code reviews. If there is a problem with the software, a code review should detect it. The evaluation process for data science is entirely different since it includes not only data but the results and analysis of the process. Code review is just a small part of an overall evaluation process.
- Incorporating SDLC best practice.
While SDLC is not a perfect match for data science projects, it doesn’t mean that data science can’t benefit from many good SDCL practices.
Each of those problems requires a specific approach:
Machine learning data science has multiple components: the engineering (data exploration and interpretation), experimental (obtaining the training data and building the training model), and operational component(deployment into production). The engineering and operational components are highly deterministic and could use the Jira board. However, Jira is not suitable for the experimental component but combining Jira with the Kanban board is very practical because it is still Agile, but instead of timeboxing, it does boxing by projects. For example, I was exploring some data and I wanted to train the model. To track my progress I would use Kanban because of its time flexibility. It would still use the same terminology and is a part of the Jira ecosystem, but it naturally follows the way in which data science is done.
Instead of code reviews, data science relies on analytic reviews. If someone is working on a model, their goal is not to produce a piece of code but to do a presentation in front of the team. An analytical review outlines data that have been used, features that have been developed, model parameters, and details of model evaluation. In data science, every problem could be divided into the supervised, semi-supervised, and unsupervised type and a template for each type along with evaluation metrics has to be created.
For example, if I have a classification problem, I would be looking at confusion metrics, but if I have a regulation problem I would use accuracy metrics. Analytical reviews differ based on the type of problem. The code reviews are included as a part of the implementation of an algorithm.
To motivate my team, I came up with the Champion Challenge Problem (aka Kaggle like ) that reflects the nature -- and difference to software development -- of data science. For any given problem in data science, you could try multiple approaches to solve it. Unlike software development where two people would be working on two different parts of the problem, due to the experimental nature of data science, two people would be paired up to work on the same problem but aiming to come up with different solutions. Analytical reviews are used to assess which one is better. This particular challenge encourages creativity and a competitive spirit.
Many good practices from SDLC could be integrated into data science. My favorite is peer programming when two people are working together to solve a problem. When translated to data science, it would undergo some modifications: a. you pair up a senior and junior data scientist and their relationship should resemble a mentoring relationship; b. a geographically distributed data team would allow for continuation and complementation due to the cascading nature of multiple steps data science projects consist of.
- Timeboxing data science projects and experiments is not a smart idea as it doesn’t follow the natural cycle of data science projects.
- Do analytical instead of code reviews.
- Some data scientists come from software development and are more cognizant of and willing to introduce best practices. Nevertheless, be open-minded and ready to learn from SDLC.
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Vadim Antonov, Engineering Manager at Meta, dictates how he brought a brand new team into the remote learning process by ramping up onboarding and creating a mentor system.
Engineering Manager at Facebook
Richard Maraschi, VP of Data Products & Insights at WarnerMedia, shares his insight on incorporating data science, AI, and product management to overcome slowing growth of the company.
VP Data Product Management at WarnerMedia
Richard Maraschi, VP of Data Products and Insights at WarnerMedia, speaks on how he increased user engagement using measurements and experimentation.
VP Data Product Management at WarnerMedia
Deepesh Makkar, Sr Director of Engineering at SunPower Corporation, shares his experience transitioning his organization from contractors to a 50/50 split of full-time employees and international vendors.
Sr Director of Engineering at SunPower Corporation
Shubhro Roy, Engineering Manager at Box, stresses the importance of the holistic nature of Agile methodology; picking and choosing à la carte may cause more problems than it solves.
Engineering Manager at box
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.