Building an Efficient Data Science Team While Still Being Agile
28 July, 2020
When the company as a whole follows a uniform process based on the software development lifecycle (SDLC), it interferes with and hinders the workflow process of implementing data science projects. An efficient data science team operates on its own methodology more conducive to the exploratory nature of data science.
I identified three key problems stemming from the efforts to enforce the SDLC to data science teams.
Software development can be easily timeboxed -- with clearly defined deliverables you can work on one piece of software for two weeks as planned, but data science is all about unknowns and resists being time-bound.
Conducting reviews in data science fundamentally differ from code reviews. If there is a problem with the software, a code review should detect it. The evaluation process for data science is entirely different since it includes not only data but the results and analysis of the process. Code review is just a small part of an overall evaluation process.
- Incorporating SDLC best practice.
While SDLC is not a perfect match for data science projects, it doesn’t mean that data science can’t benefit from many good SDCL practices.
Each of those problems requires a specific approach:
Machine learning data science has multiple components: the engineering (data exploration and interpretation), experimental (obtaining the training data and building the training model), and operational component(deployment into production). The engineering and operational components are highly deterministic and could use the Jira board. However, Jira is not suitable for the experimental component but combining Jira with the Kanban board is very practical because it is still Agile, but instead of timeboxing, it does boxing by projects. For example, I was exploring some data and I wanted to train the model. To track my progress I would use Kanban because of its time flexibility. It would still use the same terminology and is a part of the Jira ecosystem, but it naturally follows the way in which data science is done.
Instead of code reviews, data science relies on analytic reviews. If someone is working on a model, their goal is not to produce a piece of code but to do a presentation in front of the team. An analytical review outlines data that have been used, features that have been developed, model parameters, and details of model evaluation. In data science, every problem could be divided into the supervised, semi-supervised, and unsupervised type and a template for each type along with evaluation metrics has to be created.
For example, if I have a classification problem, I would be looking at confusion metrics, but if I have a regulation problem I would use accuracy metrics. Analytical reviews differ based on the type of problem. The code reviews are included as a part of the implementation of an algorithm.
To motivate my team, I came up with the Champion Challenge Problem (aka Kaggle like ) that reflects the nature -- and difference to software development -- of data science. For any given problem in data science, you could try multiple approaches to solve it. Unlike software development where two people would be working on two different parts of the problem, due to the experimental nature of data science, two people would be paired up to work on the same problem but aiming to come up with different solutions. Analytical reviews are used to assess which one is better. This particular challenge encourages creativity and a competitive spirit.
Many good practices from SDLC could be integrated into data science. My favorite is peer programming when two people are working together to solve a problem. When translated to data science, it would undergo some modifications: a. you pair up a senior and junior data scientist and their relationship should resemble a mentoring relationship; b. a geographically distributed data team would allow for continuation and complementation due to the cascading nature of multiple steps data science projects consist of.
- Timeboxing data science projects and experiments is not a smart idea as it doesn’t follow the natural cycle of data science projects.
- Do analytical instead of code reviews.
- Some data scientists come from software development and are more cognizant of and willing to introduce best practices. Nevertheless, be open-minded and ready to learn from SDLC.
Arun Krishnaswamy, Director of Data Science at Workday, describes how to build a data science team emphasizing the difference between software development lifecycle and data science methodology.
Director at Workday
Shyam Prabhakar, Engineering Manager at Stitch Fix, explains how design sprints helped him fix problems caused by the lack of sufficient research and overall improve his company’s products.
Engineering Manager at Stitch Fix
Shridharan Muthu, VP of Engineering at Zoosk, describes how he quickly agreed to adopt new workflows, a mistake he later regretted, and how he handled the situation by spending the time to course correct and taking a stab at making things easier for his team.
VP of Engineering, Backend Applications at Zoosk
Pierre Bergamin, VP of Engineering at Assignar, outlines some useful tips for decoupling releases from deployment and increasing deployments by a huge factor, speeding up reverts and planning releases in a better way.
VP of Engineering at Assignar
Ben Coats, Solutions Architect at InfoArmor, recalls how he had to adjust his personal working style of non-linear thinking and coding, to the shifting priorities and incremental delivery of Agile development.
Solutions Architect / Principal Engineer at InfoArmor
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.