Skip to content
/ dossier Public

Extract textual information from PDF documents

License

Notifications You must be signed in to change notification settings

hansmi/dossier

Repository files navigation

Extract information from PDF documents

Latest release CI workflow Go reference

Dossier is a library for extracting textual information from PDF documents. It is written using the Go programming language.

Currently PDF is the only supported format (using MuPDF). Other formats can be implemented using custom parsers or by amending the library.

Sketches provide a declarative approach to locating information as an alternative to imperative/procedural access.

Sketches

Protocol buffers are used to define a sketch. The sketch protobuf definition documents available configuration options. Usually textproto will be the format used for writing sketches.

A web-based viewer is included in the command line utility. Screenshot of the viewer with an example sketch for invoices:

Graphical viewer showing an example invoice analysis

Invocation:

$ dossiercli web ./invoice.pdf ./sketch.textproto
2023/12/31 00:00:00 HTTP server listening on http://[::1]:8080

Installation

go get github.com/hansmi/dossier

Command line utility:

go install github.com/hansmi/dossier/cmd/dossiercli@latest