Why Normalizing Velocities Doesn't Matter

Rational behavior requires theory. Reactive behavior requires only reflex action.
— W. Edward Demming

The topic of normalizing velocities across teams is inevitable when working across multiple agile teams. This can get to be a religious debate so it is important to remember that normalizing velocities doesn't really matter to avoid conflict and misunderstanding.

As a quick refresher, team velocities are used by an individual team to assess its own capacity. The scalar value given in a sprint is a single data point and in isolation it provides little value. It is in comparison to earlier sprints within this team that brings value, as the team comes to understand its moving average capacity and variability.

Some healthy examples in assessing velocity

  • Assessment of team capacity
    • "Our 3-sprint average is 32 story points, let's see how far down in the backlog that gets us."
  • Waste in flow or process indicator
    • "Every time the engineering director makes us take on unplanned work we lose 5 story points in context switching."
  • Confirmation of process experimentation
    • "We added an extra 30 minutes of refinement to our 2-week sprint and have seen an increase in velocity of 10%."

Evaluating velocity trend on a team is reasonable and encouraged. However, this comparison does not scale across teams, to the distant past nor to the portfolio, which is why normalizing velocities in these ways does not matter, regardless of how tempting (and easy) it might be.

Comparing Teams

Interestingly, it is often the teams themselves that drive this behavior and it is leadership who must explicitly discourage it. Teams that compete like this may behave in ways that work against desired outcomes, like increased quality and better predictability. Imagine these 3 teams in your organization:

  • Team A (3 sprint average): 32 story points
  • Team B (3 sprint average): 9 story points
  • Team C (3 sprint average): 27 story points

Doesn't this make you want to ask, "What is the problem with Team B?" It is easy to come to judgement on Team B however the data may tell another story. For example, perhaps Team B is much smaller than A and C, or it may be ramping up on a new and exploratory piece of work, or it may be as simple as they estimate lower. Regardless, the pull to compete with other teams is natural and should be discouraged to ensure the focus remains on the delivery of high quality software with predictability. The danger is that the team "trying to keep up" will take short cuts to artificially increase velocity and feel better about their numbers. This is not uncommon, so be cognizant and try to get to root cause if this behavior is observed. It manifests in many ways, but commonly:

  • Story splitting late or at the end of a sprint - that 13 pointer that gets split to an 8 and a 5 to get credit for the work done
  • Breaking off test automation or other DoD inclusive tasks as separate stories
  • "Point Creep" - a story that would have been a 3 in the past is now an 8

The desire to increase velocity can be hard to uproot. I was working with a team that was very attached to increasing their velocity in comparison to other teams. After explaining core concepts several times, they would not give it up. As a last resort, I instructed the team to double their estimates, that seemed to settle the issue.

Portfolio Forecasting

Operating at scale requires early estimation of large blocks of work without necessarily consulting the teams that may take on that work. This is necessary to minimize wasting team time on features that may not provide good ROI or be aligned to strategy. Quickly the question turns to velocity normalization, based on some incorrect assumptions.

  • Incorrect: Early-estimators must know the target team and its individual velocity
  • Incorrect: Teams must conform their scale to meet that of the early-estimators
  • Incorrect: All teams must normalize their velocities across teams to make an accurate forecast

Achieving these conditions would be challenging in optimal conditions and it is unnecessary to pursue this level of precision. The truth is, if the early-estimators are consistent and largely accurate in their relative assessment across features, normalization by teams does not matter. The reason is that individual teams must re-estimate that work according to their own scale and it will not have a significant impact on duration. It's a matter of changing units, e.g. Fahrenheit to Celsius, which should not impact real value. For this reason, I recommend that the SWAG not be in story points, but in some other measure like t-shirt size (or dog breeds) to help detach from the natural tendency to fit one scale into another. After some data has been gathered, that SWAG can be a good indicator of duration, given some assumptions. But if the early-estimators are measuring in story points, teams can still independently estimate in their own scales.

As is true with agile practice, estimation is a skill so improves the more the team practices. When starting out the following conditions may not be true. This is ok, it is important to start somewhere so the learning can begin.

Work towards:

  1. Teams that may take on the work are reasonably fungible
  2. Estimates are accurate, developed by knowledgeable experts, and units are not tied to time
  3. The estimating team has good historical data that indicates duration, as a SWAG

Let's assume a SWAG for a feature is given as "Small" on a scale of XS, S, M, L, XL and that historically, the organization can deliver a "Small" in 5 sprints. Let's also assume that this work might be taken on by one of two teams, both with domain understanding of the content, perhaps one more mature than the other. The data in the table is manufactured, however when you try this in practice you will find it holds, given the assumptions.

The trick here is to remember that these are estimates, do not choose to go with Team B because they were 0.3 sprints 'cheaper' than Team A, it would be a misuse of the data. 

Teams must be allowed to re-estimate according to their schemes and not feel compelled to make it work in the SWAG scheme.

Distant past (more than 3 sprints)

It is tempting to look back far in a team's history to establish a long-running average or compare that team to performance deep in history. The rule of thumb is to assess the last three sprints and keep the count low as it's the best indicator of current performance. Looking back 10 sprints is like looking back at the weather 3 months ago to decide what to wear tomorrow. High-performing teams learn and the work shifts quickly. So, keep the history short and in general do not worry about team performance more than 3 sprints in the past as a decision point.

Take aways

  • If teams care comparing themselves, one to another, it is up to leadership to actively discourage it if teams cannot do it themselves.
  • If management is comparing team velocities, one team to another, stop it, stop it now.
  • If using normalized velocities to aid in forecasting at the portfolio, use the approach above to abstract away individual team outliers.
  • Teams should assess their velocities over recent time to discover their natural capacity and variability for informed decision making.