Bar chart pitfalls: monthly buckets

It's April 7th and Sam from Sales is presenting the last three months worth of data for one of the company's key metrics. “So you can see a definite upward trend, if we keep this up we should be on target to see around 35 pronks this month, and nearly 40 next month!”

Of course someone has to ask why April comes in under 10, and Sam is quick with the answer: it's only April 7th today, that bar is showing incomplete data.

But then Merry from Marketing asks if Sam could possibly show the data on a day-by-day basis. Nothing easier:

Oops. Where did that trend go?

If we go to more effort than Sam did, we'll find that the data actually starts exactly three months ago: on the 7th of January. January's bar, just like April's, is showing incomplete data. The rest of the “trend” comes from February having only 28 days (in 2015) while March has 31. The data in fact reflect an average of exactly one pronk per day, with a small amount of noise added. (You can verify this with View Source.) In April we can expect only 30 pronks, and 31 in May.

Of course Sam should adjust the chart to include the first days of January. But we can do better than this: if we show the daily average recorded over each month, instead of the total, then we eliminate the influence of the different month lengths. And if Sam for some reason needs to stick to that odd choice for the date range, well, that works too.We could even go further, and let the width of each bar indicate the size of the sampling bucket. In that case the total area of a bar shows the size of the sample. I don't show an example because this more complex kind of chart is not supported by the charting library I'm using (nvd3) and nobody knows how to read it anyway.

Lessons:

Beware incomplete data at both ends of a bar chart sampling over a time range.
Months make bad sampling buckets, because they aren't all the same length.
Dividing the value reported by the size of the sample bucket makes it possible to compare buckets of different sizes.