axa-group/Parsr

Transforms PDF, Documents and Images into Enriched Structured Data


repo name	axa-group/Parsr
repo link	https://github.com/axa-group/Parsr
homepage
language	TypeScript
size (curr.)	12793 kB
stars (curr.)	2324
created	2019-08-05
license	Apache License 2.0

Turn your documents into data!

Français | 中文

Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

Document Hierarchy Regeneration - Words, Lines, Paragraphs
Headings Detection
Table Detection and Reconstruction
Lists Detection
Table of Contents Detection
Text Order Detection
Named Entity Recognition (Dates, Percentages, etc)
Key-Value Pair Detection (for the extraction of specific form-based entries)
Page Number Detection
Header-Footer Detection
Link Detection
Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, …), email (.EML), Word (.DOCX) or a PDF file and generates the following output formats:

JSON
Markdown
Text
CSV (for tables), or Pandas Dataframes (see here)
PDF

Turn your documents into data!

Getting Started

Installation

– The advanced installation guide is available here –

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

– The advanced usage guide is available here –

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

To use the Jupyter Notebook and the python interface to the Parsr API, follow here.
To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.