tisdag 29 september 2020

A probability distribution for IT project planning and the question that has answer 42

Reading The Black Swan by Nassim Nicholas Taleb (page 159) I came across the idea of looking at the conditional expectation of a random variable \(X\), given that \(X>k\) for some \(k\). If you life expectancy is 80 years, but you are already 87 years old, the conditional life expectancy is now, say 92 years. In math notation \(E(X|X>87)=92\). Look here for real data. The point here is that the more extreme the condition, the less the added additional expectation. So for example, \(E(X|X>119) \le 120\).

Now think about an IT project. A similar project took 18 weeks to complete. Now already 40 weeks have passed. We are interested in \(E[X|X>40)\). Anyone who has worked in IT knows that the new estimate for this project is unlikely to be 41 weeks. To get any further, we need to make assumptions about the probability distribution of \(X\). If \(X\) has a Poisson distribution with mean 18, it turns out \(E[X|X>40)=40.7\) so we are obviously looking for a wilder distribution. The wildest distribution I know is the Pareto distribution (also called power law).

The Pareto distribution takes two parameters, the minimum value and a shape parameter \(\alpha\). Let's say the shortest time a project can take is 1 (week or any unit of time). Say, from experience, that 1 estimated week in a few historical projects actually turned out to be 1, 3, 7, 2, 1, 2, 1, 4 weeks. These number are not outrageous - I expect you to sigh out of boredom: This is just the nature of IT projects. Ok, then we can use the maximum likelihood method to fit the shape parameter of the Pareto distribution. That's a oneliner in Mathematica. It comes out at \(\alpha=1.375\).

Now we can generate some random numbers out of this and see what we can expect in future projects! I get 1, 1, 1, 1, 2, 2, 3, 3, 7, 70. That's not exactly matching my intuition. We get something close to the expected value most of the time, with an occasional wild outlier. I'd rather see something a bit more in between the expected value and wild outliers like 70. Before moving on, let's summarize in Mathematica what we did so far:


I want a distribution somewhat like the Poisson distribution below, but that is fatter in the right-end tail, something that falls off less quickly.

The Gamma distribution comes to mind. It's density function has a polynomial factor like \(x^2\) that dominates when \(x\) is small and an exponential factor \(e^{-\beta x}\) that dominates for large \(x\), ensuring rapid fall-off of probability in the right tail. To attenuate the fall-off, let's take the square root of \(x\) in the exponential. We get our candidate distribution for project planning

$$f_0(x):=\int_0^\infty x^2 e^{-\sqrt{x}}$$

Well, the total probability has to be 1 so we must normalize it. I suppose you can make the substitution \(t=\sqrt{x}\) and grind it with integration by parts. But Mathematica can do it symbolically and it comes out as 240. So we actually have

$$f(x):=\frac{1}{240}\int_0^\infty x^2 e^{-\sqrt{x}}$$

Let's look at a plot of it!


Ok, that is somewhat promising. What is the expected value? Later on, we can add location and shape parameters, but let's just consider this "canonical" version first. Again trusting Mathematica on the details, lets just enter the definition of expected value \(\int_0^\infty x f(x)\) and press enter. It comes out as


That's right - the answer to the Ultimate Question of Life, the Universe, and Everything! Wow. After the initial shock, let's try to generate some random variates from this to see if we get results that matches our experience from IT projects. For that, we can pass in uniform \([0,\,1]\) random numbers into the inverse of the cumulative distribution function, that's a well known result in statistics. I get 10, 10, 11, 22, 23, 28, 43, 60, 61, 170. The most probable outcome is around 20 at the peak of the density function. But the expected value is 42 since occasional tail outcomes raise the average with nothing compensating from below. This is maybe not a great distribution to model IT projects, but it has the nice property that you can legitimately hope for getting the 20 outcome but should expect 42 weeks until completion...

This blog post is already too long. I will leave you with the last piece of Mathematica code.


What distribution do YOU use to model your IT projects? Comment below!