(Alternate Subtitle: How the heck do you do estimation and forecasting in Kanban?)
One of the biggest changes for many team who are adopting agile is in the way they slice, track and measure their work. They learn about “User Stories” which are sized using “Story Points”. The team adds up the number of Story Points in a given “Sprint” which gives them their “Velocity”.
In fact, there are some very good elements to this. The story format can be quite helpful for focusing teams on what it is they are trying to deliver. Further, the idea of breaking down the work into small, incremental chunks is fantastic. And story points have a great attribute about them – they are relative. You aren’t trying to measure the exact amount, but rather a relative estimation (“This story is twice as hard as this other one, so we’ll call this a 4 and that one a 2″).
But the one drawback comes when you begin interacting with other teams, or looking at different methodologies. Specifically:
- How do you deal with teams with different sprint lengths?
- How do you know teams are estimating the same way?
- Can you ever really trust velocity numbers across teams?
- How do you forecast estimates across teams?
- What happens when you remove the timebox of the sprint completely?
- How do you “credit” stories which aren’t completed during the Sprint?
In fact, there are several solutions to the above problem. For example, you can have an estimation session where members from each team estimate a set of items, and the teams use that as the baselines. You can do other mathematical tricks to make it work, too. So it is possible.
But what if you didn’t have to do that? What if there was a way to get a more quantifiable estimate without resorting to detailed hour estimates, without removing the relative estimation and whose accuracy and confidence levels could be proven?
To understand, let’s take a look at a scenario that I see played out quite often. We have a user story which a team has committed to, say to change a user entry screen. The team “completes” the user story. In fact, all of the teams complete their user stories. But yet, the application still isn’t shipping, there are lots of delays, and it isn’t clear why.
Well, at least at the surface it isn’t clear why. If you step back, you discover that after the team is finished, it goes into User Acceptance Testing, and then packaging, and then deployment. In other words, the User Story isn’t actually done – even though the team said it was. This may not seem like a huge deal, but let’s look at the consequences:
- The team isn’t honoring their commitment to their process
- Forecasting based on the velocity is useless, since there is additional work happening behind the scenes not being accounted for
- The team grows frustrated because they know the velocity number isn’t reality
Note that none of the above is the fault of the methodology itself. The team isn’t adhering to it, and from that are causing potentially very large problems. So, if this is our reality, and it isn’t possible to shrink the UAT/Packaging/Deploy to the size of the sprint, what should the team do? And since these are variable length stories, how can we accurately forecast and estimate?
Luckily we actually have a very simple method which combines relative estimation, Cycle Time and Classes of Service. Let’s tackle the first two first. Going back to our story above, the team pulls a story to change the screen. In Scrum, they would discuss the story with the customer, break it down into tasks, and estimate it using Story Points. Instead of doing story points, let’s imagine they used T-Shirt sizes – Extra Small, Small, Medium, Large, Extra Large. So far it isn’t that much different – I generally advised teams that anything larger than an 8 shouldn’t go in a sprint.
But here’s where it gets different, Rather than add up the story points completed during the sprint, they measure the cycle time of the stories. The Cycle Time is the amount of time that it takes for a story to go from initially being worked on through shipped. It includes any loopbacks or rework. Now, let’s imagine we charted the T-Shirt Size and Cycle Time for each of our user stories. We might get a chart that looks like this:
T-Shirt Size | Cycle Time |
S | 4 days |
S | 5 days |
M | 13 days |
S | 3 days |
M | 16 days |
L | 26 days |
M | 15 days |
L | 31 days |
S | 5 days |
L | 23 days |
You can see a couple of things. First, we can quickly calculate the Average Cycle Time for a given size story:
- Small – ~4 days
- Medium – ~15 days
- Large – ~27 days
But, more importantly, we know the confidence in those estimates. For Small stories, we know that the minimum is 3 days, and the max is 5 days, with a standard deviation of only about 1 day. But for Large stories, the minimum is 23 days, and the maximum is 31 days, giving us a standard deviation of just over 4 days. So the risk is higher that we won’t meet the Average Cycle Time.
So now that we have our Average Cycle Time, we can use that for forecasting. If a certain feature consists of 5 medium stories, 3 large stories, and 10 small stories, we can estimate out approximately how long it will take us – (5*15) + (3*27) + (10*4) = 196 days. But in reality, it could take us as long as (5*16) + (3*31) + (10*5) = 223 days if every story took the maximum amount of time.
What’s nice about this way of looking at forecasting and estimating is that the calculations are based on the actual time it takes the teams to do the work. And it is easy to calculate across teams, since you can take into account variances and differences as part of the calculations.
If you remember earlier, I mentioned there was one more element to Average Cycle Time for forecasting – Classes of Service. Jeff Anderson has a good blog post on the concept, but in a nutshell, classes of service are a recognition that all work is not the same, and should likely not be treated the same.
What this means for us is that we can extend out our Average Cycle Time estimation to include Classes of Service. Let’s say that we’ve defined three classes of service: Normal work, Bug Fixes and Critical Fixes. By simply having those marked on our board, we can track the Average Cycle Time for each category of work. In fact, I use this with teams to help them understand how to get a better handle when some work requires the assistance of a certain part of the organization, and when other work requires a different part.
Let’s imagine that we have a separate department which houses the Database Administrators. When we make a change that requires a database change, it has to be handed off to them. That’s a type of work which has a different workflow, and different policy requirements – meaning a good candidate to be labeled as a different class of service. Now our table might look like:
Normal Work | Database Work | |
Small | 4 days | 5 days |
Medium | 15 days | 18 days |
Large | 27 days | 34 days |
Should you actually break your work and forecasting down to this level? Not necessarily. After all, if you have flexible dates, or can cut scope to account for work changes, then going through all of the work to understand the averages, deviations and probabilities may not be worth it. But if you have a high need for better estimates, then using Average Cycle Time may give you what you are looking for.
One final note – even if you are using Scrum, I still highly recommend tracking the Average Cycle Time of the stories. You don’t have to throw out your Story Points, but you might find that you look better wearing a T-Shirt instead.
Interesting… very interesting. I’ll have to re-read through this again.
Very nice. I think that the statistical information gained from the extra dimension of measurement is very important for the organization that wishes to adopt agile across the board. More here: http://seanmehan.globat.com/blog/2011/06/28/story-point-estimation-vs-cycle-time/
Measuring cycle time and chart it versus story size can yield surprising results. In one team last year, I did this, and we found out that cycle time was almost independent of story size up to 8 story points. Above 8 story points, cycle time went up significantly.
We analyzed this and found the following reason: The stories spent more than 50% of their cycle time in queues. Those queues had nothing to do with story size.
So, if you get surprising results like this, watch out for queues and for WIP before you draw conclusions that depend on story size.
Cory,
I read this above: “(5*15) + (3*27) + (10*4) = 196 days”. You simply add cycle time of different user stories under development. However, this does not reflect the situation in a real team, because the stories will be developed in parallel. When WIP=2, you will get 98 days, with WIP=4 you will get 49 days to develop the same stories.
You have to say under which WIP you have measured the cycle time values. Then, sum them up and divide them by the same WIP to get an idea, how long the batch of stories will take.
Cheers,
Matthias
You are actually measuring lead time, not cycle time? Cycle time is just the inverse of the throughput. According to Little’s law, wip/throughput equals lead time, not cycle time.
Cory, I’ve been looking for an adequate answer to the question, “How long will project X take” or “How long will this bucket of features take to complete”?
The best I’ve seen is to use throughput, like # stories completed per unit of time, say per month. That doesn’t feel accurate enough for me, as it doesn’t take t-shirt size of story into play. One could calculate that measure by t-shirt size, but seems unnecessarily complex to apply to a pile of sized stories.
So for now, I stick with story points for forecasting a pile of work, and cycle time for individual items.
Great post! It’s funny when you are struggling with something, you try to do it in different way and google leads you to a blog post of someone that has already gone through all that pain and shares similar ideas.
Thanks for sharing!
Agree with Matthias, we have actually seen how a “S” size user story can have a cycle time of 40 days!!! It’s always the same problem, development team estimates, but they don’t take into account acceptance, testing, deployment, system testing, etc.
How you cant count standard deviation of two of threee numbers? You overused this concept.