Peskas - data ingestion

Peskas is an advanced intelligence platform for artisanal fisheries. To provide reports and actionable insights it uses data from multiple sources.

Disaggregated fisheries data is confidential and securely held into Peskas data wharehouse. This repository contains:

Code to securely ingest data into the Peskas data wharehouse. Composed primarily by a series of Google Storage Buckets (cloud based object storage).
Scripts to deploy the ingestion code into serverless containers. The containers are built (and code executed) using Google Cloud Run.

Ingestion parameters

All the details needed to ingest data (ingestion parameters) are specified in params.yaml. This file contains details about the datasets that should be ingested and the storage service where they're saved.

Parameters can be dynamically evaluated using the inline R convention (r foo()) and support specifing parameters specifically for development or production environments. For example, with the following specification. The parameter "var" will take the "abc" value in development and "ABC" in production.

var: 
  dev: "abc"
  prod: "ABC"

Datasets fields:

Multiple datasets are allowed. Each of them should have the following fields:

interface: The interface used to retrieve data. Supported values are:
- api: Retrieves data using an HTTP GET RESTFUL request.
data_format: The format of the retrieved data. Supported values are:
- json
- csv
name: Name of the data. This name will be used in the storage service. The file extension is inferred from the data_format
Other fields depend on the interface used. For api, it requires url, path, and other GET request details.

Storage fields:

Only one storage service is allowed. It should have the following fields:

provider: The provider of the storage service. Supported values are:
- google: Saves data Google Cloud Storage Service
Other fields dependning on the provider. For google it requires bucket and auth_file

Environment variables

The script requires the following environment variables to be configured:

Required:

ENV: Specifies whether the code should be built in a development (EVN=dev) or production (ENV=prod) environment.

Optional:

The following environment variables are only required if they are specified through the params.yaml file.

KOBO_HUMANITARIAN_TOKEN: Token to connect to the KOBO API. Required if retrieving data from the Kobo Humanitarian server.
GCS_AUTH_FILE: Path to the Google Cloud Services .json authentication file.

Deployment

Cloud Build, Cloud Run, Container Registry, and Resource Manager APIs. Ensure Storage API is enabled
Ensure Cloud Run API is enabled
Enable Secret Manager API
Create a Service Account for the data ingestion
Provide storage access to the service account
Provide secret access to the cloud run service account
Make sure the storage buckets needed exist

Support & Contributing

For general questions about the Peskas Platform, contact Alex Tilley. For questions about Peskas' code and technical infrastructure, contact Fernando Cagua.

Authors

Fernando Cagua

Website: http://www.cagua.co/
Github: https://github.com/efcagua/

Alex Tilley

Email: a.tilley@cgiar.org

License

The code is available under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.devcontainer		.devcontainer
R		R
deployment		deployment
tests		tests
.dockerignore		.dockerignore
.env.R		.env.R
.gitignore		.gitignore
.here		.here
Dockerfile		Dockerfile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
cloudbuild_customcopy.yaml		cloudbuild_customcopy.yaml
copy-objects.R		copy-objects.R
data-ingestion.Rproj		data-ingestion.Rproj
main.R		main.R
params.yaml		params.yaml

efcaguab/data-ingestion

Folders and files

Latest commit

History

Repository files navigation

Peskas - data ingestion

Ingestion parameters

Environment variables

Deployment

Support & Contributing

Authors

License

About

Topics

Resources

Stars

Watchers

Forks

Languages