covid-19-net/covid-19-community
Community effort to build a Neo4j Knowledge Graph (KG) that links heterogeneous data about COVID-19
repo name | covid-19-net/covid-19-community |
repo link | https://github.com/covid-19-net/covid-19-community |
homepage | |
language | Jupyter Notebook |
size (curr.) | 565 kB |
stars (curr.) | 19 |
created | 2020-03-22 |
license | MIT License |
Covid-19-Community
This project is a community effort to build a Neo4j Knowledge Graph (KG) that links heterogenous data about COVID-19 to help fight this outbreak! It serves as a sandbox and incubator project and the best ideas will be incorporated into the Covid-19-Net KG.
Join “GraphHackers, Let’s Unite to Help Save the World — Graphs4Good 2020".
What kind of data can you contribute? Here are some of our ideas.
How can you contribute?
-
File an issue to discuss your idea so we can coordinate efforts
-
Help with specific issues
-
Suggest publically accessible data sets
-
Suggest graph queries to gain new insights from the KG
-
Add Jupyter Notebooks with data analyses
-
Add data and map visualizations
-
Help improve the data model
-
Report bugs or issues
How to use this project?
This project uses Jupyter Notebooks to download and curate the latest data files, create a Neo4j graph database, and run Cypher queries on the graph database. The results of the queries can then be used in the Jupyter Notebooks for further analysis and visualizations.
(Currently, we don’t have graph visualization working in Jupyter Lab. We are looking for community members to help.)
You can run the Jupyter Notebooks in this repo in your web browser:
Once Jupyter Lab launches, navigate to the notebooks folder and run the following notebooks:
Notebook | Description |
---|---|
1a-Strains | Downloads the latest SARS-CoV-2 strain data and creates node and relationship files in the data directory |
1b-… | Future notebooks that add new node and relationship files |
2-CreateGraph | Creates a Neo4j Knowledge Graph by batch-uploading the node and relationship files |
3-ExampleQueries | Runs Cypher queries on the Knowledge Graph |
A prototype Subgraph that represents relationships for Virus Strains
This subgraph maps the relationships between the Pathogen (SARS-CoV-2) that causes the COVID-19 disease Outbreak, the strains of the virus, the host (human or animal), and the locations where they were found.
Data Creation and Organization
We have separated data download and curation from the graph database creation.
1. Data Download and Curation
Jupyter Notebooks are used to download the latest raw data files, curate and harmonize the data, and then save Nodes and Relationships as .csv files in the /data directory.
The Nodes, Relationships, and their Properties are named according to these conventions. The headers of the Node and Relationship .csv files must be formated according to the Neo4j formatting rules for batch upload.
We use the Node and Relationship names for the data files, for example, the relationships
(:Outbreak)-[:EXPLORE_IN]->(:Dashboard)
(:City)-[:EXPLORE_IN]-(:Dashboard)
are stored in three Node files: Outbreak.csv, Dashboard.csv, City.csv and two Relationship files: Outbreak-EXPLORE_IN-Dashboard.csv, City-EXPLORE_IN-Dashboard.csv.
The graph database is created from the following files:
Directory | Description |
---|---|
cached_data | Raw data files downloaded from resources that do not provide download URLs. These files are manually downloaded and updated as needed. |
reference_data | Node and Relationship .csv files that are manually created and updated |
data | Node and Relationship .csv files created automatically by running the Jupyter Notebooks. These files are overwritten. Do not edit these files. |
2. Batch-up of Node and Relationship files
The 2-CreateGraph.ipynb notebook batch-uploades the .csv files into an empty Neo4j database.
How to run this project locally
1. Fork this project
A fork is a copy of a repository in your GitHub account. Forking a repository allows you to freely experiment with changes without affecting the original project.
In the top-right corner of this GitHub page, click Fork
.
Then, download all materials to your laptop by cloning your copy of the repository, where your-user-name
is your GitHub user name. To clone the repository from a Terminal window or the Anaconda prompt (Windows), run:
git clone https://github.com/your-user-name/covid-19-community.git
cd covid-19-community
2. Create a conda environment
The file environment.yml
specifies the Python version and all packages required by the tutorial.
conda env create -f environment.yml
Activate the conda environment
conda activate covid-19-community
3. Install Neo4j Desktop
Then, launch the Neo4j Browser, create an empty database, and set the password to “neo4jbinder”
4. Set Environment Variable
Set a NEO4J_HOME environment variable with the path to the database installation.
(Example path from Mac OS: /Users/username/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-993db298-6374-4f0a-9a9a-d0783480877a/installation-3.5.14)
5. Launch Jupyter Lab Run the Jupyter Notebooks in order to download the latest data, create a new graph database, and then query then query the graph database.
jupyter lab
6. Browse KG in Neo4j Browser
After you create the graph database by running the Jupyter Notebooks, start the database in Neo4j Browser to interactively explore the KG.