Backends for data storage¶

Ralph supports various backends that can be accessed to read from or write to (learning events or random data). Implemented backends are listed below along with their configuration parameters. If your favourite data storage method is missing, feel free to submit your implementation or get in touch!

Key concepts¶

Each backend has its own parameter requirements. These parameters can be set as command line options or environment variables; the later is the recommended solution for sensitive data such as service credentials. For example, the os_username (OpenStack user name) parameter of the OpenStack Swift backend, can be set as a command line option using swift as the option prefix (and replacing underscores in its name by dashes):

ralph list --backend swift --swift-os-username johndoe # [...] more options

Alternatively, this parameter can be set as an environment variable (in upper case, prefixed by the program name, e.g. RALPH_):

export RALPH_BACKENDS__DATA__SWIFT__OS_USERNAME="johndoe"
ralph list --backend swift # [...] more options

The general patterns for backend parameters are:

--{{ backend_name }}-{{ parameter | underscore_to_dash }} for command options, and,
RALPH_BACKENDS__DATA__{{ backend_name | uppercase }}__{{ parameter | uppercase }} for environment variables.

Elasticsearch¶

Elasticsearch backend is mostly used for indexation purpose (as a datalake) but it can also be used to fetch indexed data from it.

Elasticsearch data backend default configuration.

Attributes:

Name	Type	Description
`ALLOW_YELLOW_STATUS`	`bool`	Whether to consider Elasticsearch yellow health status to be ok.
`CLIENT_OPTIONS`	`dict`	A dictionary of valid options for the Elasticsearch class initialization.
`DEFAULT_INDEX`	`str`	The default index to use for querying Elasticsearch.
`HOSTS`	`str or tuple`	The comma-separated list of Elasticsearch nodes to connect to.
`LOCALE_ENCODING`	`str`	The encoding used for reading/writing documents.
`POINT_IN_TIME_KEEP_ALIVE`	`str`	The duration for which Elasticsearch should keep a point in time alive.
`READ_CHUNK_SIZE`	`int`	The default chunk size for reading batches of documents.
`REFRESH_AFTER_WRITE`	`str or bool`	Whether the Elasticsearch index should be refreshed after the write operation.
`WRITE_CHUNK_SIZE`	`int`	The default chunk size for writing batches of documents.

MongoDB¶

MongoDB backend is mostly used for indexation purpose (as a datalake) but it can also be used to fetch collections of documents from it.

MongoDB data backend default configuration.

Attributes:

Name	Type	Description
`CONNECTION_URI`	`str`	The MongoDB connection URI.
`DEFAULT_DATABASE`	`str`	The MongoDB database to connect to.
`DEFAULT_COLLECTION`	`str`	The MongoDB database collection to get objects from.
`CLIENT_OPTIONS`	`MongoClientOptions`	A dictionary of MongoDB client options.
`LOCALE_ENCODING`	`str`	The locale encoding to use when none is provided.
`READ_CHUNK_SIZE`	`int`	The default chunk size for reading batches of documents.
`WRITE_CHUNK_SIZE`	`int`	The default chunk size for writing batches of documents.

ClickHouse¶

The ClickHouse backend can be used as a data lake and to fetch collections of documents from it.

ClickHouse data backend default configuration.

Attributes:

Name	Type	Description
`HOST`	`str`	ClickHouse server host to connect to.
`PORT`	`int`	ClickHouse server port to connect to.
`DATABASE`	`str`	ClickHouse database to connect to.
`EVENT_TABLE_NAME`	`str`	Table where events live.
`USERNAME`	`str`	ClickHouse username to connect as (optional).
`PASSWORD`	`str`	Password for the given ClickHouse username (optional).
`CLIENT_OPTIONS`	`ClickHouseClientOptions`	A dictionary of valid options for the ClickHouse client connection.
`LOCALE_ENCODING`	`str`	The locale encoding to use when none is provided.
`READ_CHUNK_SIZE`	`int`	The default chunk size for reading.
`WRITE_CHUNK_SIZE`	`int`	The default chunk size for writing.

The ClickHouse client options supported in Ralph can be found in these locations:

OVH - Log Data Platform (LDP)¶

LDP is a nice service built by OVH on top of Graylog to follow, analyse and store your logs. Learning events (aka tracking logs) can be stored in GELF format using this backend.

Read-only backend

For now the LDP backend is read-only as we consider that it is mostly used to collect primary logs and not as a Ralph target. Feel free to get in touch to prove us wrong, or better: submit your proposal for the write method implementation.

To access OVH’s LDP API, you need to register Ralph as an authorized application and generate an application key, an application secret and a consumer key.

While filling the registration form available at: eu.api.ovh.com/createToken/, be sure to give an appropriate validity time span to your token and allow only GET requests on the /dbaas/logs/* path.

OVH LDP (Log Data Platform) data backend default configuration.

Attributes:

Name	Type	Description
`APPLICATION_KEY`	`str`	The OVH API application key (AK).
`APPLICATION_SECRET`	`str`	The OVH API application secret (AS).
`CONSUMER_KEY`	`str`	The OVH API consumer key (CK).
`DEFAULT_STREAM_ID`	`str`	The default stream identifier to query.
`ENDPOINT`	`str`	The OVH API endpoint.
`READ_CHUNK_SIZE`	`str`	The default chunk size for reading archives.
`REQUEST_TIMEOUT`	`int`	HTTP request timeout in seconds.
`SERVICE_NAME`	`str`	The default LDP account name.

For more information about OVH’s API client parameters, please refer to the project’s documentation: github.com/ovh/python-ovh.

OpenStack Swift¶

Swift is the OpenStack object storage service. This storage backend is fully supported (read and write operations) to stream and store log archives.

Parameters correspond to a standard authentication using OpenStack Keystone service and configuration to work with the target container.

Swift data backend default configuration.

Attributes:

Name	Type	Description
`AUTH_URL`	`str`	The authentication URL.
`USERNAME`	`str`	The name of the openstack swift user.
`PASSWORD`	`str`	The password of the openstack swift user.
`IDENTITY_API_VERSION`	`str`	The keystone API version to authenticate to.
`TENANT_ID`	`str`	The identifier of the tenant of the container.
`TENANT_NAME`	`str`	The name of the tenant of the container.
`PROJECT_DOMAIN_NAME`	`str`	The project domain name.
`REGION_NAME`	`str`	The region where the container is.
`OBJECT_STORAGE_URL`	`str`	The default storage URL.
`USER_DOMAIN_NAME`	`str`	The user domain name.
`DEFAULT_CONTAINER`	`str`	The default target container.
`LOCALE_ENCODING`	`str`	The encoding used for reading/writing documents.
`READ_CHUNK_SIZE`	`str`	The default chunk size for reading objects.
`WRITE_CHUNK_SIZE`	`str`	The default chunk size for writing objects.

Amazon S3¶

S3 is the Amazon Simple Storage Service. This storage backend is fully supported (read and write operations) to stream and store log archives.

Parameters correspond to a standard authentication with AWS CLI and configuration to work with the target bucket.

S3 data backend default configuration.

Attributes:

Name	Type	Description
`ACCESS_KEY_ID`	`str`	The access key id for the S3 account.
`SECRET_ACCESS_KEY`	`str`	The secret key for the S3 account.
`SESSION_TOKEN`	`str`	The session token for the S3 account.
`ENDPOINT_URL`	`str`	The endpoint URL of the S3.
`DEFAULT_REGION`	`str`	The default region used in instantiating the client.
`DEFAULT_BUCKET_NAME`	`str`	The default bucket name targeted.
`LOCALE_ENCODING`	`str`	The encoding used for writing dictionaries to objects.
`READ_CHUNK_SIZE`	`str`	The default chunk size for reading objects.
`WRITE_CHUNK_SIZE`	`str`	The default chunk size for writing objects.

File system¶

The file system backend is a dummy template that can be used to develop your own backend. It is a “dummy” backend as it is not intended for practical use (UNIX ls and cat would be more practical).

The only required parameter is the path we want to list or stream content from.

FileSystem data backend default configuration.

Attributes:

Name	Type	Description
`DEFAULT_DIRECTORY_PATH`	`str or Path`	The default target directory path where to perform list, read and write operations.
`DEFAULT_QUERY_STRING`	`str`	The default query string to match files for the read operation.
`LOCALE_ENCODING`	`str`	The encoding used for writing dictionaries to files.
`READ_CHUNK_SIZE`	`int`	The default chunk size for reading files.
`WRITE_CHUNK_SIZE`	`int`	The default chunk size for writing files.

Learning Record Store (LRS)¶

The LRS backend is used to store and retrieve xAPI statements from various systems that follow the xAPI specification (such as our own Ralph LRS, which can be run from this package). LRS systems are mostly used in e-learning infrastructures.

LRS data backend default configuration.

Attributes:

Name	Type	Description
`BASE_URL`	`AnyHttpUrl`	LRS server URL.
`USERNAME`	`str`	Basic auth username for LRS authentication.
`PASSWORD`	`str`	Basic auth password for LRS authentication.
`HEADERS`	`dict`	Headers defined for the LRS server connection.
`LOCALE_ENCODING`	`str`	The encoding used for reading statements.
`READ_CHUNK_SIZE`	`int`	The default chunk size for reading statements.
`STATUS_ENDPOINT`	`str`	Endpoint used to check server status.
`STATEMENTS_ENDPOINT`	`str`	Default endpoint for LRS statements resource.
`WRITE_CHUNK_SIZE`	`int`	The default chunk size for writing statements.

WebSocket¶

The webSocket backend is read-only and can be used to get real-time events.

If you use OVH’s Logs Data Platform (LDP), you can retrieve a WebSocket URI to test your data stream by following instructions from the official documentation.

Websocket data backend default configuration.

Attributes:

Name	Type	Description
`CLIENT_OPTIONS`	`dict`	A dictionary of valid options for the websocket client connection. See `WSClientOptions`.
`URI`	`str`	The URI to connect to.

Client options for websockets.connection.

For mode details, see the websockets.connection documentation

Attributes:

Name	Type	Description
`close_timeout`	`float`	Timeout for opening the connection in seconds.
`compression`	`str`	Per-message compression (deflate) is activated by default. Setting it to `None` disables compression.
`max_size`	`int`	Maximum size of incoming messages in bytes. Setting it to `None` disables the limit.
`max_queue`	`int`	Maximum number of incoming messages in receive buffer. Setting it to `None` disables the limit.
`open_timeout`	`float`	Timeout for opening the connection in seconds. Setting it to `None` disables the timeout.
`origin`	`str`	Value of the `Origin` header, for servers that require it.
`ping_interval`	`float`	Delay between keepalive pings in seconds. Setting it to `None` disables keepalive pings.
`ping_timeout`	`float`	Timeout for keepalive pings in seconds. Setting it to `None` disables timeouts.
`read_limit`	`int`	High-water mark of read buffer in bytes.
`user_agent_header`	`str`	Value of the `User-Agent` request header. It defaults to “Python/x.y.z websockets/X.Y”. Setting it to `None` removes the header.
`write_limit`	`int`	High-water mark of write buffer in bytes.