Building an Efficient Data Science Team While Still Being Agile

Arun Krishnaswamy

Director at Workday

Problem

"When the company as a whole follows a uniform process based on the software development lifecycle (SDLC), it interferes with and hinders the workflow process of implementing data science projects. An efficient data science team operates on its own methodology more conducive to the exploratory nature of data science."

"An efficient data science team operates on its own methodology more conducive to the exploratory nature of data science."

I identified three key problems stemming from the efforts to enforce the SDLC to data science teams.

Timeboxing.
"Software development can be easily timeboxed -- with clearly defined deliverables you can work on one piece of software for two weeks as planned, but data science is all about unknowns and resists being time-bound."
Reviews.
"Conducting reviews in data science fundamentally differ from code reviews. If there is a problem with the software, a code review should detect it. The evaluation process for data science is entirely different since it includes not only data but the results and analysis of the process. Code review is just a small part of an overall evaluation process."
Incorporating SDLC best practice.
"While SDLC is not a perfect match for data science projects, it doesn’t mean that data science can’t benefit from many good SDCL practices."

Actions taken

Each of those problems requires a specific approach:

Timeboxing
"Machine learning data science has multiple components: the engineering (data exploration and interpretation), experimental (obtaining the training data and building the training model), and operational component(deployment into production). The engineering and operational components are highly deterministic and could use the Jira board. However, Jira is not suitable for the experimental component but combining Jira with the Kanban board is very practical because it is still Agile, but instead of timeboxing, it does boxing by projects."

"For example, I was exploring some data and I wanted to train the model. To track my progress I would use Kanban because of its time flexibility. It would still use the same terminology and is a part of the Jira ecosystem, but it naturally follows the way in which data science is done."

Code reviews
"Instead of code reviews, data science relies on analytic reviews. If someone is working on a model, their goal is not to produce a piece of code but to do a presentation in front of the team. An analytical review outlines data that have been used, features that have been developed, model parameters, and details of model evaluation. In data science, every problem could be divided into the supervised, semi-supervised, and unsupervised type and a template for each type along with evaluation metrics has to be created."

"For example, if I have a classification problem, I would be looking at confusion metrics, but if I have a regulation problem I would use accuracy metrics. Analytical reviews differ based on the type of problem. The code reviews are included as a part of the implementation of an algorithm."

"To motivate my team, I came up with the Champion Challenge Problem (aka Kaggle like ) that reflects the nature -- and difference to software development -- of data science. For any given problem in data science, you could try multiple approaches to solve it. Unlike software development where two people would be working on two different parts of the problem, due to the experimental nature of data science, two people would be paired up to work on the same problem but aiming to come up with different solutions. Analytical reviews are used to assess which one is better. This particular challenge encourages creativity and a competitive spirit."

Best practices
"Many good practices from SDLC could be integrated into data science. My favorite is peer programming when two people are working together to solve a problem. When translated to data science, it would undergo some modifications: a. you pair up a senior and junior data scientist and their relationship should resemble a mentoring relationship; b. a geographically distributed data team would allow for continuation and complementation due to the cascading nature of multiple steps data science projects consist of."

Lessons learned

Timeboxing data science projects and experiments is not a smart idea as it doesn’t follow the natural cycle of data science projects.
Do analytical instead of code reviews.
Some data scientists come from software development and are more cognizant of and willing to introduce best practices. Nevertheless, be open-minded and ready to learn from SDLC.

Be notified about next articles from Arun Krishnaswamy