Roger, aka the Head of the Geekforce has had a bit of time on his hands, with two of our major clients serving the Hotel and Catering industries. I spoke to him about the new Covid-19 graphs he has put up on the Gamma Science Website.
Now one of the things we do for our clients is Business Analytics, and Data Science, specifically, designing bespoke systems to analyse and interpret complex datasets and present them to decision makers, often in visual and graphical formats.
S: So what inspired you to put the graphs on the website?
R: I'm an engineer, and I wanted to understand what was happening! I got fed up not being able to see a consistent picture from the published graphs. I had the tools, so I did it!
S: So why were the graphs in the media not helpful?
R: The mass media tends to present the "Graph of the day" compiled from different data sources, so you are not getting a consistent picture over time. And most of them initially used a linear scale, which just shows the line going up and up. A logarithmic scale gives a better comparison for exponentially increasing data, and allows you to meaningfully interpret what is going on in the early stages, while still being able to see changes to the data relative to the straight exponential increase line, as the numbers grow.
S: Exponentially increasing data?
R: When the numbers just keep on doubling over a regular interval. If you see the scale, it starts at 1 -100 in the first section, the next section is 100- 1000, the next 1,000-10,000, to 100,000 and so on. The straight line shows the numbers going on doubling, over the reference period, for comparison. It's a common tool in data science, especially when you are dealing with rapid growth, and really silly numbers, like here! These steep curves are very difficult to judge by eye, so the logarithmic scale allows you to squash the lines at the top of the scale, so if you put the straight exponential growth line in, it gives you a reference.
S: So when we see the curve bending over, is that what the Government mean by "flattening the curve"?
R: Not exactly. It's complicated!
S: We've been at this for five minutes, and that's your first "It's Complicated!". Are you getting better at this, or am I? So what does it mean, exactly?
R: Well most of the graphs where they talked about flattening the curve were linear, and the hope was it would flatten early on, while numbers of new cases were relatively low. And they are a plot of active cases, eg the number of people in hospital. But the figures I can find for recoveries are not reliable, so we can't plot that graph with any confidence. Anyway, we've gone past that now, but if you see angle flattening, and the graph crossing over that straight reference line, it means that at least the rate of doubling has slowed, which you can see it has, as the lock down begins to take effect, on both the new cases and the total cases. And if you look at New Zealand, where the line is virtually parallel to the base line, it means that the numbers are growing at the same rate, not doubling and redoubling.
S: Yes, I was going to ask you about that? The rest of the graphs are in Europe, why include New Zealand?
R: Apart from the fact you are from there, and you asked me to, you mean? Well, New Zealand had advance warning, so they had a lot of advantages. They closed the borders and locked down people's movement early. It's as close as we've got to a best case scenario.
S: So why are there are some dips in the lines?
R: The difference data is quite choppy. There may be lags in reporting, and gaps in the data, especially around the long weekends. And different countries report numbers calculated in slightly different way, which doesn't exactly help. But, imperfect as it is, it's the best data I could find. I'm not an epidemiologist, just a data scientist trying to make sense of something using the tools I've got. Which is why we presented it without comment on the website.
S: So you say you had the data tools, what were they?
R: We are mostly a Python House, most of our Data Science is done in Python. I used Jupyter, and Jupyter notebook, which are useful data analytics tools. I wanted to see the graphs, and see whether the data backed up what was being reported in the news. I tweaked what I was doing and tried different things. I wanted to get it out reliably each day, so starting with Jupyter, I wrote a Python Program and described each graph I wanted, added a bit of branding and put them on the website. Then I got interested in different aspects, so I added more graphs. I started with two, and now I've got six. I used Plotly as a presentation tool. It was a pretty trivial little project, from a simple online dataset, but I thought it might help some people to better understand what was going on, to present it this way.
S: So which dataset did you use?
R: I used John Hopkins CSSE, which is really easy to download using standard tools. Theres a lot more on there than what we are doing, but these were the ones I found interesting.
S: Thank you Roger! I hope other people find it interesting as well. (Mutters darkly about slide tables and log tables, aka implements of perplexing torture at school, and wanders off!)