What Agile for Data Science projects

Photo by Yosh Ginsu on Unsplash

Nowadays, Agile is considered one of the most effective approach to Project Management and it is very popular in Digital Transformation programs.

Agile project management methodologies are born in the second half of the XX century. In that period it was recognized that project management frameworks typically applied to civil or mechanical engineering failed to respond to the business needs of software development projects.

The rise of Agile methodologies was mainly an answer to the need of having a faster validation of the output and avoid the creation of products the user won’t like when no budget for change was available.

Nowadays, Agile is also applied to the management of broader Digital Programs, and in general, to any case where the final outcome cannot be defined without continuous adaptations.

One of the most popular Agile methodologies is Scrum, which assumes that the development of complex products is better done when small, diversified, and self-organizing teams are given objectives rather than specific assignments.

According to Scrum methodology, the team is free to determine the best way of meeting those objectives, and the workstream is organized in time-boxed iterative and incremental development cycles called Sprints, whose goal is to deliver working versions of the product

Why Agile is relevant for Data Science?

In the Data-Driven business era, more and more companies are naturally engaged in the definition and management of projects that are data-centric.

Very often these kinds of projects involve a cross-functional teams, where people from Finance, Marketing, Operations, HR work with technical guys like Data Engineers, IT architects, System Administrators and Data Scientists in order to deliver data-products.

Some of these cross-functional data projects have the main objective to engineer the access and exploitation of available data from company users, customers, and applications. Others aim to build transformative business solutions, leveraging the business insights distilled using Machine Learning or the prescription derived from Optimization techniques. In the following, I will refer to the first kind of projects as Data Engineering Projects and the second one Data Science Projects.

Examples of Data Engineering Projects are those aiming at building company websites, business dashboards, e-commerce sites, payment systems, data platforms. These projects have different degrees of complexity, but they are all characterized by usually having the knowledge required to build the final product inside the cross-functional team. Hence, the most suitable management methodology for these kinds of projects needs to excel in organizing the available knowledge and transforming it into an actionable and effective plan. For these cases, Agile is considered to be the most effective approach, especially Scrum.

On the other hand, examples of Data Science Projects include the creation of alert systems to predict machine failures, recommender systems, energy generation optimizer, marketing automation systems, customer value maximizers. These kinds of projects share an additional degree of complexity with respect to Data Engineering Projects: the knowledge required to build the product is not only distributed across team members but also hidden in the available Data. In this case, project management methodology not only has to be good at organizing the knowledge available on the team or understand the user needs, but also to extract the knowledge hidden into data properly and transform it into an actionable and effective plan. In these cases, Scrum is not always the best Agile version to choose, unless you adapt it.

Why Scrum can fail in cross-functional Data Science projects

Implementing Scrum in the management of cross-functional Data Science projects faces some challenges.

Most of them come from the fact that the knowledge needed for building the product is not available at the beginning of the project, and only data scientists can extract the required knowledge from available data, following a continuous, non-linear and uncertain process.

The main consequence of this situation is that the knowledge extraction process led by data scientists produces insights at irregular time point, which are very often not aligned with Sprint’s schedule.

The impossibility to align knowledge discoveries with Sprint deliveries is the main reason why Scrum could result ineffective. And it also why many Data Scientists prefer other Agile methodologies, like Kanban, where time-boxing is not required.

This feeling is particularly strong at the beginning of the project because the amount of insights collected from data is higher. As long as the knowledge is transferred into insights and models, also the need to recouple its extraction with Sprint’s scheduling is reduced, and once completed the Data Science project becomes a Data Engineering project.

How to keep using Scrum using Divide-and-Conquer

Photo by ᴊᴀᴄʜʏᴍ ᴍɪᴄʜᴀʟ on Unsplash

Nowadays Scrum is a very well known methodology and many companies adopt it because of three main strengths:

  1. Scrum is particularly effective in reducing the time between business validation of the project and its delivery, thanks to continuous integration and iterations.
  2. Scrum helps to deliver higher product quality and customer/user satisfaction, thanks to testing automation, sprint retrospectives, and engagement of the user throughout the project.
  3. Scrum reduces project risk and helps controlling project outcome, thanks to scheduled sprint meetings and quick adaptation to changes in user’s needs.

If companies want to keep benefiting from Scrum advantages in cross-functional Data Science projects they need to follow the Divide-and-Conquer strategy.

Divide-and-Conquer Scrum for Data Science simply consists in separating the whole project in two interconnected streams — the Mainstream and the Data Science streams — and manage them separately:

  • The Mainstream involves the whole cross-functional team and respects the usual ceremonies of Scrum. The only exception is that Data Scientist team acts through a representative (for example the Data Scientist Team Lead) during Sprints planning, Standups, and Reviews in order to reduce the burden of Scrum ceremonies and focus on the knowledge discovery process.
  • The Data Science stream leverages Data Scientists’ capabilities and the knowledge they can extract from data. The way this stream can be managed varies accordingly to the difficulty of the knowledge extraction process, and to how much the business goal is clear.

Diving into Data Science stream

In my professional experience, the Data Science stream can be managed according to different methodologies. I successfully followed Scrum, Kanban, Waterfall or a mix of them.

A number of tricks help to succeed in Divide-and-Conquer Scrum for Data Science

  • Facilitate the interconnection between the two streams. For example by sharing deadlines and milestones for the Data Science Stream with the whole Team, organizing daily wrap-up at the end of the day, plan dedicated deep-diving with the rest of the team, do Sprint retrospective all together.
  • Box the right amount of time to consolidate the knowledge in the codebase. I use to label the days inside the Main stream’s sprints in green and red ones. Although knowledge extraction is a continuous process, only the findings collected during the green periods will be part of the delivery of the sprint, while the others will be incorporated in the next sprint’s delivery. During the reds the Data Science team concentrates on model/code refactoring to be delivered at the end of the Main Sprint.
  • Follow an iterative approach rather than an incremental one. To be iterative means that you attack multiple faces of the problem at the same time, and you advance toward a solution from multiple starting points. To be incremental means you attack and solve one slice of the problem at a time. I like the first approach because it generates more knowledge at the early stages, helps to gain momentum for the DS team, and helps the two streams to synchronize earlier and spot on dead ends, hence reducing project risk.
  • In planning the Mainstream set the length of the first sprints longer than the last ones. This helps to get concrete deliveries in the first part of the project from data scientists and better exploit the first feedbacks from users/customers.

Summary

In this article, I explained why Scrum needs to be adapted for being successfully applied to cross-functional Data Science projects. The reason is the asynchrony between regular sprint deadlines and uncertainties in the data knowledge discovery process. To keep using Scrum advantages in cross-functional Data Science projects I suggest using the Divide-and-Conquer strategy and follow some tricks.

I hope you enjoyed my article. Please write me for any comments, I’ll be happy to listen your points and share thoughts with you.

Senior Principal

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store