Graphing the pandemic with open data

Visualize

© Lead Image © lucadp, 123RF

© Lead Image © lucadp, 123RF

Article from Issue 240/2020
Author(s):

A lot of COVID-19 data is available through online REST APIs. With a little ingenuity and some open source tools, you can extract and analyze the data yourself.

Travel is broadening. You experience different cultures, see issues from different angles, and meet fun and unique people. I love living abroad, but I have to admit that I have less access to the news from home. Unfortunately, news outlets today spend more time on opinion than facts, and sometimes I just want the unvarnished truth. During the pandemic era, I am especially anxious to learn about the challenges faced by my family back home.

The good news is that a lot of open data on COVID-19 is available on the Internet via REST API calls. This data might be too dry for some, but if you want to get your own impressions of the COVID-19 crisis, without the sometimes intrusive "analysis" of newscasters and commentators, this free Internet data is a valuable resource. This article describes how to access and display freely available COVID-19 data using open source tools. And, if you've already had your fill of COVID-19 information, the techniques I'll describe in this article will also help you with other kinds of government and academic data available through REST APIs.

CovidAPI

The CovidAPI project [1] provides COVID data based on the well respected Johns Hopkins University dataset [2]. The original Johns Hopkins data is available in CSV form. Argentine software developer Rodrigo Pomba converted the data to JSON time series format. According to the documentation [3], the goal of the project is to make the data "queryable in a manner in which it could be easily consumed to build public dashboards."

The CovidAPI data is organized by country using the list of ISO country codes [4]. Use the curl command in a terminal window to send a URL that will access the data for a specific country and date:

curl https://covidapi.info/api/v1/country/USA/2020-06-15

Calling this API, in this case with curl, will return the following JSON object.

{
  "count": 1,
  "result": {
    "2020-06-15": {
      "confirmed": 2114026,
      "deaths": 116127,
      "recovered": 576334
    }
  }
}

This command is an easy way to get the daily report for a specific country and date, but if you want to visualize and analyze the data yourself, you might prefer to request the values for all dates. If you leave off the date, you'll get the data for all available dates:

curl https://covidapi.info/api/v1/country/USA

This command returns one giant JSON message containing the records for every day in the dataset. However, I ran into problems parsing out the individual days due to the dashes that were part of the date. As an alternative approach, I chose to write a small Bash script to fetch the count of the day records then iterate through the list of days to retrieve the COVID-19 information for each day (Listing 1). Most of the steps are self-explanatory if you are familiar with Bash scripts, but see the comment lines for additional information.

Listing 1

covid19.sh

001 #!/bin/bash
002
003 get_count()
004 {
005   GATHERCOUNTRY=$1
006
007   # get the count of days since the start
008   CNT=`curl https://covidapi.info/api/v1/country/$GATHERCOUNTRY 2>/dev/null | jq '.count'`
009   echo $CNT
010 }
011
012 gather_state()
013 {
014   GATHERSTATE=$1
015   cnt=$2
016
017   echo gather state $GATHERSTATE $cnt days
018   DATAFILE=covid19_${GATHERSTATE}.Data
019
020   # from beginning until yesterday
021   IDX=$cnt
022
023   # absolute values, followed by daily delta
024   if [ ! -f $DATAFILE ]
025   then
026     echo "date  positive  hospitalized  deaths  " > $DATAFILE
027   fi
028
029   while [ $IDX -gt 0 ]
030   do
031     #DATE=`date --date="$IDX days ago"  +%Y%m%d`
032     DATE=`date --date="12:00 today -$IDX days"  +%Y%m%d`
033     FILEDATE=`date --date="12:00 today -$IDX days"  +%Y-%m-%d`
034
035     CMD="curl https://api.covidtracking.com/v1/states/${GATHERSTATE}/${DATE}.json"
036
037     grep $FILEDATE $DATAFILE >/dev/null
038     if [ $? -eq 1 ]
039     then
040       SINGLE=`$CMD 2>/dev/null `
041       error=`echo $SINGLE | jq ".error"`
042       if [ $error == "true" ]
043       then
044         # nothing to output
045         # echo oops looks bad $DATE
046
047         positive=0
048         hospitalized=0
049         deaths=0
050       else
051         positive=`echo $SINGLE | jq ".positive"`
052         deaths=`echo $SINGLE | jq ".death"`
053         hospitalized=`echo $SINGLE | jq ".hospitalizedCurrently"`
054
055         if [ $positive == "null" ]; then positive=0; fi
056         if [ $deaths == "null" ]; then deaths=0; fi
057         if [ $hospitalized == "null" ]; then hospitalized=0; fi
058         echo $DATE $IDX
059       fi
060       echo "$FILEDATE  $positive  $hospitalized  $deaths  "  >> $DATAFILE
061
062     #else
063     #   echo not doing $FILEDATE
064     fi
065
066
067     IDX=$(($IDX - 1))
068   done
069 }
070
071
072 gather_data()
073 {
074   GATHERCOUNTRY=$1
075   cnt=$2
076
077   echo gather $GATHERCOUNTRY
078   DATAFILE=covid19_${GATHERCOUNTRY}.data
079
080   # absolute values, followed by daily delta
081   if [ ! -f $DATAFILE ]
082   then
083     echo initializing
084     echo "date  confirm  deaths  recover  " > $DATAFILE
085   fi
086
087   # from beginning until yesterday
088   IDX=$cnt
089
090   deltadeaths=0
091   deltaconfirm=0
092   deltarecover=0
093
094   while [ $IDX -gt 0 ]
095   do
096     #DATE=`date --date="$IDX days ago"         +%Y-%m-%d`
097     DATE=`date --date="12:00 today -$IDX days"  +%Y-%m-%d`
098
099     CMD="curl https://covidapi.info/api/v1/country/${GATHERCOUNTRY}/${DATE}"
100
101     grep $DATE $DATAFILE >/dev/null
102     if [ $? -eq 1 ]
103     then
104
105       #
106       # we only do this if this date hasn't been retrieved
107       #
108       SINGLE=`$CMD 2>/dev/null `
109       ERR=`echo $SINGLE | grep "404 Not Found" | wc -l`
110
111       #
112       # only if date found
113       #
114       if [ $ERR -eq 0 ]
115       then
116         deaths=`echo $SINGLE   | jq '.' | grep deaths | sed 's/.*: //' | sed 's/,//' `
117         confirm=`echo $SINGLE  | jq '.' | grep confirm | sed 's/.*: //' | sed 's/,//' `
118         recover=`echo $SINGLE  | jq '.' | grep recover | sed 's/.*: //' | sed 's/,//' `
119
120         echo $DATE $IDX
121         echo "$DATE  $confirm  $deaths  $recover  "  >> $DATAFILE
122       #else
123       #  echo not doing $DATE
124       fi
125
126     fi
127
128     IDX=$(($IDX - 1))
129   done
130 }
131
132 CNT=`get_count USA`
133 echo $CNT days
134 gather_data USA $CNT
135
136 # just use state 2 letter code (ie. ny for New York)
137 gather_state mn $CNT
138 gather_state ca $CNT
139 gather_state ia $CNT
140 gather_state mo $CNT
141 gather_state mt $CNT
142
143
144 CNT=`get_count DEU`
145 gather_data DEU $CNT
146
147 CNT=`get_count ESP`
148 gather_data ESP $CNT
149
150 CNT=`get_count GBR`
151 gather_data GBR $CNT
152
153 gnuplot graphs.gp

One part of the script that might not be obvious is how I calculate the date.

DATE=`date --date="12:00 today -$IDX days" +%Y-%m-%d`

The date command subtracts a given number of days from the current date and formats the output as a YYYY-MM-DD string.

Of course, it would be inefficient to download hundreds of days worth of data each time if I just want yesterday's data. Because of this, the script verifies if the data has been retrieved before making the REST API call to retrieve the data. The first time you run the script, you get all the data, and on subsequent runs, you only get the new data.

Data by State

Retrieving COVID-19 figures for a whole country is useful for comparing one country against another, but it is less than helpful if you want to know what is really happening locally. The Covid Tracking Project [5] provides COVID-19 data by US state. (Similar projects track pandemic data for other countries – consult your local health resources.)

Just like at the national level, it is possible to retrieve all COVID-19 information by US state for a given date with REST API calls. For instance, to obtain data on the state of Minnesota for August 21, 2020:

curl https://api.covidtracking.com/v1/states/mn/20200821.json | jq "."

The state data, unlike the national data, contains an amazing number of statistics submitted by the health authorities. The sheer number of values provided can perhaps only truly be appreciated by an epidemiologist or a statistician.

You can see how many people were hospitalized on a given day or the total number of hospitalizations up until that day.(Listing 2) Also included were the incre- mental changes in positive as well as neg- ative tests results. Using this information, I could have graphed how quickly COVID-19 is spreading by graphing positiveCasesViral vs totalTestsViral or by graphing hospitalizedCurrently over time.

Listing 2

Hospitalizations and Deaths by State

{
  "date": 20200415,
  "state": "MN",
  "positive": 2321,
  "negative": 41245,
  "pending": null,
  "hospitalizedCurrently": 197,
  "hospitalizedCumulative": 445,
  "inIcuCurrently": 93,
  "inIcuCumulative": 175,
  "onVentilatorCurrently": null,
  "onVentilatorCumulative": null,
  "recovered": 853,
  "dataQualityGrade": "A",
  "lastUpdateEt": "4/14/2020 17:00",
  "dateModified": "2020-04-14T17:00:00Z",
  "checkTimeEt": "04/14 13:00",
  "death": 87,
  "hospitalized": 445,
  "dateChecked": "2020-04-14T17:00:00Z",
  "totalTestsViral": 43566,
  "positiveTestsViral": null,
  "negativeTestsViral": null,
  "positiveCasesViral": null,
  "fips": "27",
  "positiveIncrease": 156,
  "negativeIncrease": 1540,
  "total": 43566,
  "totalTestResults": 43566,
  "totalTestResultsIncrease": 1696,
  "posNeg": 43566,
  "deathIncrease": 8,
  "hospitalizedIncrease": 40,
  "hash": "9521e0ce1f2b1ef5aaf1a81bec48961d85170d78",
  "commercialScore": 0,
  "negativeRegularScore": 0,
  "negativeScore": 0,
  "positiveScore": 0,
  "score": 0,
  "grade": ""
}

I settled on gathering positives, numbers of people hospitalized, and deaths at a state level. I didn't try to verify that all state totals added up at the national level, as I suspect there can be delays in the reporting chain from the local to the national level.

I can imagine that massive effort to come up with a common structure, as well as getting all the participants to gather all of these types of data. Despite all of their efforts, sometimes the data returned contained fields that were blank, had zeros, or simply had the value null.

Comparing Countries

My COVID-19 gathering script will collect the information from four different countries (Great Britain, USA, Spain, and Germany), as well as statistics for a few US states. This data is temporarily stored in a text file but the information that I am gathering essentially looks similar to Table 1.

Table 1

Sample of Downloaded Data

Country

Date

Confirmed

Deaths

Recovered

USA

2/21/20

15

0

5

Germany

2/21/20

16

0

14

England

2/21/20

9

0

8

Spain

2/21/20

2

0

2

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Gnuplot

    Use Gnuplot with command-line utilities.

  • Statistics with gawk

    With very little overhead, you can access statistics on the spread of COVID-19 using gawk scripts and simple shell commands.

  • Tutorials – Shell Math

    While Bash is not the most advanced environment for doing and visualizing math, its power will surprise you. Learn how to calculate and display your results with shell scripts.

  • Workspace: ExifTool

    Understanding the full power of ExifTool can be daunting. We show how to put it to practical use.

  • Scientist's Toolbox

    Linux and science are a natural fit. These are a handful of essential software packages both for getting work done and presenting it to others.

comments powered by Disqus

Direct Download

Read full article as PDF:

News