March 3, 2020

485 words 3 mins read

open-covid-19/data

open-covid-19/data

Crowd-sourced COVID-19 data

repo name open-covid-19/data
repo link https://github.com/open-covid-19/data
homepage
language Jupyter Notebook
size (curr.) 4238 kB
stars (curr.) 58
created 2020-03-14
license

Open COVID-19 Dataset

This repo contains free datasets of historical data related to COVID-19. The current datasets are:

  • World:

    • Date: ISO 8601 date (YYYY-MM-DD) of the datapoint
    • CountryCode: 2-letter ISO 3166-1 code of the country
    • CountryName: American English name of the country
    • Confirmed: total number of cases confirmed after positive test
    • Deaths: total number of deaths from a positive COVID-19 case
    • Latitude: floatig point representing the geographic coordinate
    • Longitude: floatig point representing the geographic coordinate
  • China:

    • Date: ISO 8601 date (YYYY-MM-DD) of the datapoint
    • Region: American English name of the province
    • CountryCode: 2-letter ISO 3166-1 code of the country
    • CountryName: American English name of the country
    • Confirmed: total number of cases confirmed after positive test
    • Deaths: total number of deaths from a positive COVID-19 case
    • Latitude: floatig point representing the geographic coordinate
    • Longitude: floatig point representing the geographic coordinate
  • USA:

    • Date: ISO 8601 date (YYYY-MM-DD) of the datapoint
    • Region: 2-letter state code (e.g. CA, FL, NY)
    • CountryCode: 2-letter ISO 3166-1 code of the country
    • CountryName: American English name of the country
    • Confirmed: total number of cases confirmed after positive test
    • Deaths: total number of deaths from a positive COVID-19 case
    • Tested: total number of tests performed to determine COVID-19 case
    • Latitude: floatig point representing the geographic coordinate
    • Longitude: floatig point representing the geographic coordinate

Analyze the data

You can find Jupyter Notebooks in the analysis folder with examples of how to load and analyze the data. You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/data/

Why another dataset?

This dataset is heavily inspired by the dataset maintained by Johns Hopkins University. Unfortunately, that dataset is currently experiencing maintenance issues and a lot of applications depend on this critical data being available in a timely manner. Further, the true sources of data for that dataset are still unclear.

Source of data

The world data comes from the daily reports at the ECDC portal. The XLS file is downloaded and parsed using scrapy and pandas.

Data for Chinese regions comes from the daily WHO situation reports, which are automatically parsed from their PDF source using scrapy and ghostscript.

The data is automatically crawled and parsed using the scripts found in the input folder. This is done daily, and as part of the processing some additional columns are added, like country-level coordinates.

Update the data

To update the contents of the output folder, run the following:

# Install dependencies
pip install -r requirements.txt
# Update world data
sh input/update_world_data.sh
# Update China data
sh input/update_china_data.sh
# Update USA data
sh input/update_usa_data.sh

Note that this will only fetch the latest report from the WHO and ECDC sources. If a report is skipped or amended, manual operation will be required.

comments powered by Disqus