Graphing the pandemic with open data

Visualize

© Lead Image © lucadp, 123RF

© Lead Image © lucadp, 123RF

Author(s):

A lot of COVID-19 data is available through online REST APIs. With a little ingenuity and some open source tools, you can extract and analyze the data yourself.

Travel is broadening. You experience different cultures, see issues from different angles, and meet fun and unique people. I love living abroad, but I have to admit that I have less access to the news from home. Unfortunately, news outlets today spend more time on opinion than facts, and sometimes I just want the unvarnished truth. During the pandemic era, I am especially anxious to learn about the challenges faced by my family back home.

The good news is that a lot of open data on COVID-19 is available on the Internet via REST API calls. This data might be too dry for some, but if you want to get your own impressions of the COVID-19 crisis, without the sometimes intrusive "analysis" of newscasters and commentators, this free Internet data is a valuable resource. This article describes how to access and display freely available COVID-19 data using open source tools. And, if you've already had your fill of COVID-19 information, the techniques I'll describe in this article will also help you with other kinds of government and academic data available through REST APIs.

CovidAPI

The CovidAPI project [1] provides COVID data based on the well respected Johns Hopkins University dataset [2]. The original Johns Hopkins data is available in CSV form. Argentine software developer Rodrigo Pomba converted the data to JSON time series format. According to the documentation [3], the goal of the project is to make the data "queryable in a manner in which it could be easily consumed to build public dashboards."

The CovidAPI data is organized by country using the list of ISO country codes [4]. Use the curl command in a terminal window to send a URL that will access the data for a specific country and date:

curl https://covidapi.info/api/v1/country/USA/2020-06-15

Calling this API, in this case with curl, will return the following JSON object.

{
  "count": 1,
  "result": {
    "2020-06-15": {
      "confirmed": 2114026,
      "deaths": 116127,
      "recovered": 576334
    }
  }
}

This command is an easy way to get the daily report for a specific country and date, but if you want to visualize and analyze the data yourself, you might prefer to request the values for all dates. If you leave off the date, you'll get the data for all available dates:

curl https://covidapi.info/api/v1/country/USA

This command returns one giant JSON message containing the records for every day in the dataset. However, I ran into problems parsing out the individual days due to the dashes that were part of the date. As an alternative approach, I chose to write a small Bash script to fetch the count of the day records then iterate through the list of days to retrieve the COVID-19 information for each day (Listing 1). Most of the steps are self-explanatory if you are familiar with Bash scripts, but see the comment lines for additional information.

Listing 1

covid19.sh

001 #!/bin/bash
002
003 get_count()
004 {
005   GATHERCOUNTRY=$1
006
007   # get the count of days since the start
008   CNT=`curl https://covidapi.info/api/v1/country/$GATHERCOUNTRY 2>/dev/null | jq '.count'`
009   echo $CNT
010 }
011
012 gather_state()
013 {
014   GATHERSTATE=$1
015   cnt=$2
016
017   echo gather state $GATHERSTATE $cnt days
018   DATAFILE=covid19_${GATHERSTATE}.Data
019
020   # from beginning until yesterday
021   IDX=$cnt
022
023   # absolute values, followed by daily delta
024   if [ ! -f $DATAFILE ]
025   then
026     echo "date  positive  hospitalized  deaths  " > $DATAFILE
027   fi
028
029   while [ $IDX -gt 0 ]
030   do
031     #DATE=`date --date="$IDX days ago"  +%Y%m%d`
032     DATE=`date --date="12:00 today -$IDX days"  +%Y%m%d`
033     FILEDATE=`date --date="12:00 today -$IDX days"  +%Y-%m-%d`
034
035     CMD="curl https://api.covidtracking.com/v1/states/${GATHERSTATE}/${DATE}.json"
036
037     grep $FILEDATE $DATAFILE >/dev/null
038     if [ $? -eq 1 ]
039     then
040       SINGLE=`$CMD 2>/dev/null `
041       error=`echo $SINGLE | jq ".error"`
042       if [ $error == "true" ]
043       then
044         # nothing to output
045         # echo oops looks bad $DATE
046
047         positive=0
048         hospitalized=0
049         deaths=0
050       else
051         positive=`echo $SINGLE | jq ".positive"`
052         deaths=`echo $SINGLE | jq ".death"`
053         hospitalized=`echo $SINGLE | jq ".hospitalizedCurrently"`
054
055         if [ $positive == "null" ]; then positive=0; fi
056         if [ $deaths == "null" ]; then deaths=0; fi
057         if [ $hospitalized == "null" ]; then hospitalized=0; fi
058         echo $DATE $IDX
059       fi
060       echo "$FILEDATE  $positive  $hospitalized  $deaths  "  >> $DATAFILE
061
062     #else
063     #   echo not doing $FILEDATE
064     fi
065
066
067     IDX=$(($IDX - 1))
068   done
069 }
070
071
072 gather_data()
073 {
074   GATHERCOUNTRY=$1
075   cnt=$2
076
077   echo gather $GATHERCOUNTRY
078   DATAFILE=covid19_${GATHERCOUNTRY}.data
079
080   # absolute values, followed by daily delta
081   if [ ! -f $DATAFILE ]
082   then
083     echo initializing
084     echo "date  confirm  deaths  recover  " > $DATAFILE
085   fi
086
087   # from beginning until yesterday
088   IDX=$cnt
089
090   deltadeaths=0
091   deltaconfirm=0
092   deltarecover=0
093
094   while [ $IDX -gt 0 ]
095   do
096     #DATE=`date --date="$IDX days ago"         +%Y-%m-%d`
097     DATE=`date --date="12:00 today -$IDX days"  +%Y-%m-%d`
098
099     CMD="curl https://covidapi.info/api/v1/country/${GATHERCOUNTRY}/${DATE}"
100
101     grep $DATE $DATAFILE >/dev/null
102     if [ $? -eq 1 ]
103     then
104
105       #
106       # we only do this if this date hasn't been retrieved
107       #
108       SINGLE=`$CMD 2>/dev/null `
109       ERR=`echo $SINGLE | grep "404 Not Found" | wc -l`
110
111       #
112       # only if date found
113       #
114       if [ $ERR -eq 0 ]
115       then
116         deaths=`echo $SINGLE   | jq '.' | grep deaths | sed 's/.*: //' | sed 's/,//' `
117         confirm=`echo $SINGLE  | jq '.' | grep confirm | sed 's/.*: //' | sed 's/,//' `
118         recover=`echo $SINGLE  | jq '.' | grep recover | sed 's/.*: //' | sed 's/,//' `
119
120         echo $DATE $IDX
121         echo "$DATE  $confirm  $deaths  $recover  "  >> $DATAFILE
122       #else
123       #  echo not doing $DATE
124       fi
125
126     fi
127
128     IDX=$(($IDX - 1))
129   done
130 }
131
132 CNT=`get_count USA`
133 echo $CNT days
134 gather_data USA $CNT
135
136 # just use state 2 letter code (ie. ny for New York)
137 gather_state mn $CNT
138 gather_state ca $CNT
139 gather_state ia $CNT
140 gather_state mo $CNT
141 gather_state mt $CNT
142
143
144 CNT=`get_count DEU`
145 gather_data DEU $CNT
146
147 CNT=`get_count ESP`
148 gather_data ESP $CNT
149
150 CNT=`get_count GBR`
151 gather_data GBR $CNT
152
153 gnuplot graphs.gp

One part of the script that might not be obvious is how I calculate the date.

DATE=`date --date="12:00 today -$IDX days" +%Y-%m-%d`

The date command subtracts a given number of days from the current date and formats the output as a YYYY-MM-DD string.

Of course, it would be inefficient to download hundreds of days worth of data each time if I just want yesterday's data. Because of this, the script verifies if the data has been retrieved before making the REST API call to retrieve the data. The first time you run the script, you get all the data, and on subsequent runs, you only get the new data.

Data by State

Retrieving COVID-19 figures for a whole country is useful for comparing one country against another, but it is less than helpful if you want to know what is really happening locally. The Covid Tracking Project [5] provides COVID-19 data by US state. (Similar projects track pandemic data for other countries – consult your local health resources.)

Just like at the national level, it is possible to retrieve all COVID-19 information by US state for a given date with REST API calls. For instance, to obtain data on the state of Minnesota for August 21, 2020:

curl https://api.covidtracking.com/v1/states/mn/20200821.json | jq "."

The state data, unlike the national data, contains an amazing number of statistics submitted by the health authorities. The sheer number of values provided can perhaps only truly be appreciated by an epidemiologist or a statistician.

You can see how many people were hospitalized on a given day or the total number of hospitalizations up until that day.(Listing 2) Also included were the incre- mental changes in positive as well as neg- ative tests results. Using this information, I could have graphed how quickly COVID-19 is spreading by graphing positiveCasesViral vs totalTestsViral or by graphing hospitalizedCurrently over time.

Listing 2

Hospitalizations and Deaths by State

{
  "date": 20200415,
  "state": "MN",
  "positive": 2321,
  "negative": 41245,
  "pending": null,
  "hospitalizedCurrently": 197,
  "hospitalizedCumulative": 445,
  "inIcuCurrently": 93,
  "inIcuCumulative": 175,
  "onVentilatorCurrently": null,
  "onVentilatorCumulative": null,
  "recovered": 853,
  "dataQualityGrade": "A",
  "lastUpdateEt": "4/14/2020 17:00",
  "dateModified": "2020-04-14T17:00:00Z",
  "checkTimeEt": "04/14 13:00",
  "death": 87,
  "hospitalized": 445,
  "dateChecked": "2020-04-14T17:00:00Z",
  "totalTestsViral": 43566,
  "positiveTestsViral": null,
  "negativeTestsViral": null,
  "positiveCasesViral": null,
  "fips": "27",
  "positiveIncrease": 156,
  "negativeIncrease": 1540,
  "total": 43566,
  "totalTestResults": 43566,
  "totalTestResultsIncrease": 1696,
  "posNeg": 43566,
  "deathIncrease": 8,
  "hospitalizedIncrease": 40,
  "hash": "9521e0ce1f2b1ef5aaf1a81bec48961d85170d78",
  "commercialScore": 0,
  "negativeRegularScore": 0,
  "negativeScore": 0,
  "positiveScore": 0,
  "score": 0,
  "grade": ""
}

I settled on gathering positives, numbers of people hospitalized, and deaths at a state level. I didn't try to verify that all state totals added up at the national level, as I suspect there can be delays in the reporting chain from the local to the national level.

I can imagine that massive effort to come up with a common structure, as well as getting all the participants to gather all of these types of data. Despite all of their efforts, sometimes the data returned contained fields that were blank, had zeros, or simply had the value null.

Comparing Countries

My COVID-19 gathering script will collect the information from four different countries (Great Britain, USA, Spain, and Germany), as well as statistics for a few US states. This data is temporarily stored in a text file but the information that I am gathering essentially looks similar to Table 1.

Table 1

Sample of Downloaded Data

Country

Date

Confirmed

Deaths

Recovered

USA

2/21/20

15

0

5

Germany

2/21/20

16

0

14

England

2/21/20

9

0

8

Spain

2/21/20

2

0

2

Plotting the Data

Tabular data is actually very dense and conveys a lot of information, however, it does have the side effect of being rather dry, and when the volume of data is too great, it can be difficult to spot trends. I remembered the graphing tool gnuplot [6], which I have used in the past to give data a friendlier look (Figure 1).

Figure 1: US COVID-19 statistics.

Gnuplot is a cross-platform 2D and 3D graphing tool. You can use gnuplot to create line graphs, bar charts, histograms, or even candlestick charts.

Gnuplot is a command-line program that can accept its input via a pipe and send its output to standard output. The output from gnuplot can be redirected to a file, but it is also possible to define where the output should be written.

Gnuplot is available in the package repositories of many popular Linux distributions:

sudo apt-get install gnuplot

Despite the name, gnuplot is not affiliated with the GNU project, and, although it is free to use and redistribute, it has an unusual license. Because of this license, it is not possible to redistribute modified versions of the source code: "Modifications are to be distributed as patches to the released version." It is still possible to release your own modified binaries of gnuplot, well, with a few conditions that are covered in the copyright [7] statement.

Gnuplot makes it possible to save your graphed output in quite a few different formats. You can save the output in all the common graphic file formats – PNG, GIF, JPEG, and SVG, but also other unusual types of output such as as a Postscript, PDF, or LaTeX file.

When you start gnuplot as a command interpreter, it creates a graphical window where your graphed data will be displayed. Thus you can interactively test out some plotting options (Figure 2).

Figure 2: International confirmed cases.

One of the additional advantages to running gnuplot as an interpreter is that, once you are satisfied with the results, you can save the plot datafile. Conversely you can also load a datafile into the interpreter.

load "plotcommands.ext"
save "plotcommands.ext"

The actual script for generating graphs from the collected data is quite short (Listing 3). This script actually demonstrates how powerful gnuplot is. The main steps for drawing any graph are:

  • defining the units on the X and Y axis
  • labeling the axis
  • plotting the data

These steps are all depicted in Listing 3.

Listing 3

graphs.gp

01 # Reset all plotting variables to their default values.
02 reset
03 clear
04
05 # set the terminal type (ie output format)
06 # also set the width and height
07 set term png size 1000, 800
08
09 # set x & y axis description
10 set xlabel font ",20" "Time axis"
11 set ylabel font ",20" "Confirmed"
12
13 # setup x and y axis values
14 set xdata time
15 set timefmt "%Y-%m-%d"
16 set xrange [ "2020-01-22":* ]
17 set format x "%Y-%m-%d"
18 set yrange [ 0:* ]
19
20 # graph of usa graph
21 set output 'country.png'
22 set title font ",30" "Covid 19 \nUS Cumulative cases"
23 plot "covid19_USA.data" using 1:2 title 'US confirmed' with boxes, \
24      "" using 1:3 title 'US deaths' with boxes, \
25      "" using 1:4 title 'US recovered' with boxes
26
27
28 # graph of minnesota statistics
29 set output 'statemn.png'
30 set title font ",30" "Covid 19 \nMN Cumulative cases"
31 plot "covid19_mn.Data" using 1:2 title 'positive' , \
32      "" using 1:3 title 'hospitalized' , \
33      "" using 1:4 title 'deaths'
34
35
36 # set legend below the graph
37 set key below font ",15"
38
39 # compare a few states against each other
40 set output 'statecompare.png'
41 set title font ",30" "Covid 19 \nPositive Tests"
42 plot "covid19_mn.Data" using 1:2 title 'Minnesota' , \
43      "covid19_ca.Data" using 1:2 title 'California' , \
44      "covid19_ia.Data" using 1:2 title 'Iowa' , \
45      "covid19_mo.Data" using 1:2 title 'Missouri' , \
46      "covid19_mt.Data" using 1:2 title 'Montana'
47
48
49 # compare USA against other countries
50 set output 'confirm.png'
51 set title font ",30" "Covid 19 \nInternational confirmed cases"
52 plot "covid19_USA.data" using 1:2 title 'USA' , \
53      "covid19_DEU.data" using 1:2 title 'DEU' , \
54      "covid19_ESP.data" using 1:2 title 'ESP' , \
55      "covid19_GBR.data" using 1:2 title 'GBP'

The plot statement in Listing 3 is a bit confusing until you recognize that each set of data can come from a different file, and using 1:2 means that column 1 from the data file will be on the X axis and column 2 will be on the Y axis.

The comparative graph of the individual state infections, Figure 3, is much more helpful than viewing all the US figures in tabular form.

Figure 3: Comparing positive tests by state.

Conclusion

The scripts described in this article are available at the Linux Magazine website [8]. I could have gone even further and collected information from the Twitter accounts of state governors and health departments [9], but I don't think important health information can be summarized into 288 characters. Besides, I am not the biggest follower on Twitter.

The Author

Christopher Dock is a senior consultant at T-Systems on site services GmbH. When he is not working on integration projects, he likes to experiment with Raspberry Pi solutions and other electronics projects. You can read more about his work at http://blog.paranoidprofessor.com. If you email him at mailto:christopher.dock@t-systems.com, he will gladly answer any questions.