jsonlines/guide
Tutorial on streaming JSON data analysis on the command line
repo name | jsonlines/guide |
repo link | https://github.com/jsonlines/guide |
homepage | |
language | |
size (curr.) | 115 kB |
stars (curr.) | 93 |
created | 2017-01-29 |
license | |
JSON Lines Guide
Tutorial on streaming JSON data analysis on the command line.
Table of Contents
Introduction
JSON Lines (also known as Newline Delimited JSON) is a really simple way to store JSON that makes it very friendly for data processing and analysis. To store data in JSON Lines format you simply write one JSON stringified object per line.
Usually you would store multiple JSON objects in a single, pretty printed JSON structure in a file like this:
[
{
"url": "https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_Census2010/MapServer/WmsServer",
"date": "2017-01-27T03:41:03.083Z"
},
{
"url": "https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServer",
"date": "2017-01-27T03:41:03.121Z"
}
]
The JSON Lines version of the same information above would be this:
{"url":"https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_Census2010/MapServer/WmsServer","date":"2017-01-27T03:41:03.083Z"}
{"url":"https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/PUMA_TAD_TAZ_UGA_ZCTA/MapServer","date":"2017-01-27T03:41:03.121Z"}
This is a little harder to read but the advantage is that it’s much easier to process on the command line using streams, especially if instead of two objects you had a million!
The main advantage of JSON Lines is that you can process each row of the file one at a time without having to read the entire file into memory, which is very important for larger datasets.
This guide assumes you are on a UNIX machine, but all of the json*
tools work on Windows as well.
Installing JSON Lines Tools
The jsonlines
GitHub organization houses a collection of handy modules for processing JSON Lines data on the command line.
You should also check out the ndjson-cli
suite from Mike Bostock which offers even more tools, available at https://github.com/mbostock/ndjson-cli, the concepts in this guide are the same.
To install the tools do the following:
- Install a recent Node.js version (LTS Recommended) on your computer
- Using
npm
, install the JSON Lines CLI Tools
$ npm install -g jsonfilter jsonmap jsonstats
If you receive permissions errors, you may need to use sudo
or change your permissions.
jsonfilter
The jsonfilter
command takes in a stream of JSON Lines and based on a filter expression exports a subset of data out as JSON Lines.
For example purposes this repository contains the metadata for 100 Data.gov datasets in JSON Lines format.
You can download it like this:
curl https://raw.githubusercontent.com/jsonlines/guide/master/datagov100.json > data.json
Then you can use jsonfilter
to find the names of all the datasets:
$ cat data.json | jsonfilter name
"va-national-formulary"
"safer-company-snapshot-safer-company-snapshot"
"fatality-analysis-reporting-system-fars-ftp-raw-data"
"tiger-line-shapefile-2013-state-alabama-current-county-subdivision-state-based"
"tiger-line-shapefile-2013-state-virginia-current-county-subdivision-state-based"
"national-motor-vehicle-crash-causation-survey-nmvccs-nmvccs-xml-case-viewer"
...
You should see output similar to above, lots of strings. This is also valid JSON Lines output. Lines can be any JSON type: Strings, Numbers, Objects, Arrays.
You can also do filters like this:
$ cat data.json | jsonfilter organization.name
"va-gov"
"dot-gov"
"dot-gov"
"census-gov"
"census-gov"
"dot-gov"
"census-gov"
"census-gov"
...
You can use sort
and uniq
to find out which organization is most active.
$ cat data.json | jsonfilter organization.name | sort | uniq -c
58 "census-gov"
14 "dot-gov"
2 "gsa-gov"
1 "nsf-gov"
8 "opm-gov"
1 "ssa-gov"
11 "usgs-gov"
5 "va-gov"
You can view more examples in the jsonfilter README
jsonmap
This command lets you morph data from one form into another. As opposed to jsonfilter
which takes a JSON selector expression, jsonmap
takes a short JavaScript expression as input that gets applied to each row of incoming JSON, and returns a new Object.
For example if we wanted to grab just a couple of fields from each Data.gov metadata item we could do this:
$ cat data.json | jsonmap "{name: this.name, organization: this.organization.name}"
{"name":"va-national-formulary","organization":"va-gov"}
{"name":"safer-company-snapshot-safer-company-snapshot","organization":"dot-gov"}
{"name":"fatality-analysis-reporting-system-fars-ftp-raw-data","organization":"dot-gov"}
...
The variable this
in the expression has the data for the row of JSON. You can also use Template Strings:
$ cat data.json | jsonmap '`Notes: ${this.notes.slice(0, 60)}...`'
"Notes: The VA National Formulary is a listing of products (drugs an..."
"Notes: The Company Snapshot is a concise electronic record of compa..."
"Notes: The program collects data for analysis of traffic safety cra..."
...
If your expression starts with something other than the above two things (objects or template literals) it will simply be used as the function body that gets executed on each row. You can either modify this
which will get returned at the end of the function:
$ cat data.json | jsonmap "if (this.maintainer) this.maintainer = this.maintainer.toUpperCase()"
{"license_title":"Creative Commons CCZero","maintainer":"DON LEES","relationships_...
{"license_title":"Other License Specified","maintainer":"JAMIE VASSER","relationsh...
{"license_title":"U.S. Government Work","maintainer":"LIXIN ZHAO","relationships_a...
...
Or return your own custom data:
$ cat data.json | jsonmap "if (this.license_id === 'cc-zero') { return 'Open' } else { return 'Closed'}"
"Open"
"Closed"
"Closed"
...
For more examples you can check out the jsonmap README.
jsonstats
Sometimes you want to get numerical statistics from your data. For example, the metadata includes some page view count metrics:
$ cat data.json | jsonfilter tracking_summary.total | head -n5
11754
2241
1925
1139
3247
You can pipe a stream of numbers to jsonstats
and when you finish piping, some summary statistics will be printed out:
$ cat data.json | jsonfilter tracking_summary.total | jsonstats | json
{
"max": 11754,
"min": 3,
"n": 100,
"_geometric_mean": 2.0157057404897112e+252,
"_reciprocal_sum": 1.2014730007931738,
"mean": 741.4800000000001,
"ss": 165161200.95999998,
"sum": 74148,
"_seen_this": 1,
"_mode": 11754,
"_mode_valid": true,
"variance": 1651612.0095999998,
"standard_deviation": 1285.1505785704646,
"geometric_mean": 333.4604034827265,
"harmonic_mean": 83.23116702080132,
"mode": 11754,
"_max_seen": 1,
"_last": 189
}
ndjson-reduce
If you want to take some JSON Lines output and combine the lines into a normal single JSON array of objects that you can use with JSON.parse
in your programs, you can use the ndjson-reduce
command.
The ndjson-reduce
command gets installed if you run npm install ndjson-cli -g
, along with some other great tools from the NDJSON CLI package by Mike Bostock, which offers similar functionality to the other tools in this guide.
To generate a single JS array with all of the unique organization names:
$ cat data.json | jsonfilter organization.name | sort | uniq | ndjson-reduce
["census-gov","dot-gov","gsa-gov","nsf-gov","opm-gov","ssa-gov","usgs-gov","va-gov"]
The only catch with ndjson-reduce
is it isn’t streaming, meaning it assumes your data can fit in memory easily, and be small enough for node to call JSON.stringify
on, the limit of which is usually around 500MB or so of JSON.
If you are looking for a streaming alternative to ndjson-reduce
then check out https://www.npmjs.com/package/json-write-stream
.
useful pipelines
Here are some more tools that work really well in combination with JSON Lines tools:
head
Built in to Unix. Lets you see the beginning N lines of a stream:
$ cat data.json | head -n1
{"license_title":"Creative Commons CCZero","maintainer":"Don Lees","relationships_...
json
Available as npm install json -g
from npm, it pretty prints a JSON object!
$ cat data.json | head -n1 | json
{
"license_title": "Creative Commons CCZero",
"maintainer": "Don Lees",
"relationships_as_object": [],
"private": false,
...
wc
Built in to Unix for doing word counts, but can also count lines, very useful for working with JSON Lines to count how many lines are in a file or filter
$ cat data.json | wc -l
100
grep
Built in to Unix, useful if you just want to filter lines based on a regular expression
$ cat data.json | grep CCZero | wc -l
8