Using Pandas or wget and awk to Parse COVID-19 Data or: How I Learned to Stop Worrying and Love the Virus

There are various places one can obtain COVID-19 data. This post uses data taken from The New York Times and John Hopkins University.

New York Times data can be got here and Hopkins data here.

Install pandas:

$ pip install pandas
from pandas import read_csv

df = read_csv("us-states.csv")

play around some:


>>> print(f"{df.describe().__doc__}")

# get more into it:

>>> df[["state", "cases"]]

# keep going...

>>> number_of_cases_by_state = df[["state", "cases"]]

# maybe this?

>>> number_of_cases_by_state.query("state == 'Washington'").groupby("state").max()

# aha!
>>> number_of_cases_by_state.query("state == 'Washington'").groupby("state").max().to_json()
'{"cases":{"Washington":5588}}'

# pull in urllib now...
$pip install urllib3
>>> import urllib3

>>> http = urllib3.PoolManager()

>>> r = http.request("GET", "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")

>>> r.status

200

>>> df = read_csv(str(r.data))

>>> df.cases.max()

### as of 4/1 @ 9AM PST

83889

wget, awk, and some sed...

It appears the Times doesn't update the CSV with the current day's stats until the next day. That is somewhat annoying. Anyway, I wrote a small bash script to query the CSV by date (rudimentary):

#!/usr/bin/env bash

if [ -z "$1" ]; then
        printf "Please supply a date range!\n (e.g., 2 days ago)";
        exit 1;
fi;

frame=`date --date="$1" +%F`

awk -v time=$frame -F, '$1 == time { deaths[$2] += $5 }
END      { for (name in deaths) print name, deaths[name] }'

awk is quite the powerful little tool. Combine this with sed and you can get the results rather painlessly. -v allows for environment vars and -F, denotes that the separator is a comma. Otherwise awk defaults to blank spaces.

I wonder if there is a better way to get sed to "know" how many lines are within a given file. In any case, I just did this:

sed `wc -l < us-states.csv`q

The q here stands for quit. So, print all these lines and then quit. Stream editors are cool like that. The code above is rather ugly so please forgive me. The reason for the < is to disregard the file name in the output. Why that works, I don't really know but I think I'd better wise up.

Moving on, let's tie in wget, awk, and cron. Then we can really get going:

#!/usr/bin/env bash
wget -q -O data.csv "http://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"

sed `wc -l < data.csv`q data.csv | awk -v date=$(date --date="yesterday" +%F), -F, \
'$1 == date { printf("%s\n\t cases: %d\n\t deaths: %d\n", $2, $4, %5) }' > /dev/tty1

$ Texas
    cases: 8115
    deaths: 160
  Utah 
    cases: 1675
    deaths: 13
...

*/30 * * * * bash /home/you/fetch-nytimes-covid-19-csv.sh

-q flag means do this quietly, please and then save it (-O) to a file named data.csv.

But I Don't Want to Ride the Elevator!

John Hopkins Whiting School of Engineering actually offers a more "robust" CSV which can be found here. The great thing about this data is it's updated on a regularly scheduled basis. That means we can run our cron job at an actual time that makes sense rather than every 30 minutes. The data files are updated once a day at 23:59 UTC.

First, make sure your server is set to UTC time.

On Debian this can be accomplished a number of ways. I'll use the dpkg-reconfigure command.

# make sure it's installed and if not...
$ apt-get install debconf
$ dpkg-reconfigure tzdata

Pick "other" and UTC will be listed.

Here's a close to final draft:

#!/usr/bin/env bash
wget -q -O data.csv "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/$(date --date="yesterday" +%m-%d-%Y).csv"

sed `wc -l < data.csv`q data.csv | LC_ALL=en_US.UTF-8 awk -v updated='2020-04-07 23:00:00' -F, \
'$5 > updated { deaths[$3] += $9; confirmed[$3] += $8 } \ 
END { for (name in deaths) 
printf("%s\n Cases: %'"'"'d\n Deaths: %'"'"'d\n", name, confirmed[name], deaths[name]) }' > output

The $(date --date="yesterday") will become superfluous once we've set up our updated cron job.

LC_ALL=en_US.UTF-8 allows us to place commas between the digits for better readability on output.

If you would like to know more:

[LC_ALL] is typically used in scripts that run particular programs. For example, configure scripts generated by GNU autoconf use LC_ALL to make sure that the configuration tests don’t operate in locale dependent ways. [1]

Ditching sed we're getting closer to an actual usable program:

#!/usr/bin/env bash
wget -q -O data.csv \
"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/$(date --date="yesterday" +%m-%d-%Y).csv"

LC_ALL=en_US.UTF-8 awk -v updated="$(date --date="yesterday" +%y-%m-%d) 23:00:00" -F, \
'FNR==NR && $5 > updated { deaths[$3] += $9; confirmed[$3] += $8; next }
{
        for (state in deaths)
        if (state == $5 && deaths[state] > 0)
                printf("%s\n Population: %'"'"'d\n Deaths: %'"'"'d\n Confirmed: %'"'"'d\n '%' Dead: \n\t%f\n '%' Confirmed: \n\t%f\n", 
                state, $16, deaths[state], 
                confirmed[state], 
                deaths[state] / $16 * 100, 
                confirmed[state] / $16 * 100)
}' $1 $2 > /dev/stdout

[1]https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html

Valid XHTML 4.0!

Valid CSS!

Learn CSS

page last modified: 2020-04-29