Skip to content

Backends for data storage

Ralph supports various backends that can be accessed to read from or write to (learning events or random data). Implemented backends are listed below along with their configuration parameters. If your favourite data storage method is missing, feel free to submit your implementation or get in touch!

Key concepts

Each backend has its own parameter requirements. These parameters can be set as command line options or environment variables; the later is the recommended solution for sensitive data such as service credentials. For example, the os_username (OpenStack user name) parameter of the OpenStack Swift backend, can be set as a command line option using swift as the option prefix (and replacing underscores in its name by dashes):

ralph list --backend swift --swift-os-username johndoe # [...] more options

Alternatively, this parameter can be set as an environment variable (in upper case, prefixed by the program name, e.g. RALPH_):

export RALPH_BACKENDS__DATA__SWIFT__OS_USERNAME="johndoe"
ralph list --backend swift # [...] more options

The general patterns for backend parameters are:

  • --{{ backend_name }}-{{ parameter | underscore_to_dash }} for command options, and,
  • RALPH_BACKENDS__DATA__{{ backend_name | uppercase }}__{{ parameter | uppercase }} for environment variables.

Elasticsearch

Elasticsearch backend is mostly used for indexation purpose (as a datalake) but it can also be used to fetch indexed data from it.

Elasticsearch data backend default configuration.

Attributes:

Name Type Description
ALLOW_YELLOW_STATUS bool

Whether to consider Elasticsearch yellow health status to be ok.

CLIENT_OPTIONS dict

A dictionary of valid options for the Elasticsearch class initialization.

DEFAULT_INDEX str

The default index to use for querying Elasticsearch.

HOSTS str or tuple

The comma-separated list of Elasticsearch nodes to connect to.

LOCALE_ENCODING str

The encoding used for reading/writing documents.

POINT_IN_TIME_KEEP_ALIVE str

The duration for which Elasticsearch should keep a point in time alive.

READ_CHUNK_SIZE int

The default chunk size for reading batches of documents.

REFRESH_AFTER_WRITE str or bool

Whether the Elasticsearch index should be refreshed after the write operation.

WRITE_CHUNK_SIZE int

The default chunk size for writing batches of documents.

MongoDB

MongoDB backend is mostly used for indexation purpose (as a datalake) but it can also be used to fetch collections of documents from it.

MongoDB data backend default configuration.

Attributes:

Name Type Description
CONNECTION_URI str

The MongoDB connection URI.

DEFAULT_DATABASE str

The MongoDB database to connect to.

DEFAULT_COLLECTION str

The MongoDB database collection to get objects from.

CLIENT_OPTIONS MongoClientOptions

A dictionary of MongoDB client options.

LOCALE_ENCODING str

The locale encoding to use when none is provided.

READ_CHUNK_SIZE int

The default chunk size for reading batches of documents.

WRITE_CHUNK_SIZE int

The default chunk size for writing batches of documents.

ClickHouse

The ClickHouse backend can be used as a data lake and to fetch collections of documents from it.

ClickHouse data backend default configuration.

Attributes:

Name Type Description
HOST str

ClickHouse server host to connect to.

PORT int

ClickHouse server port to connect to.

DATABASE str

ClickHouse database to connect to.

EVENT_TABLE_NAME str

Table where events live.

USERNAME str

ClickHouse username to connect as (optional).

PASSWORD str

Password for the given ClickHouse username (optional).

CLIENT_OPTIONS ClickHouseClientOptions

A dictionary of valid options for the ClickHouse client connection.

LOCALE_ENCODING str

The locale encoding to use when none is provided.

READ_CHUNK_SIZE int

The default chunk size for reading.

WRITE_CHUNK_SIZE int

The default chunk size for writing.

The ClickHouse client options supported in Ralph can be found in these locations:

OVH - Log Data Platform (LDP)

LDP is a nice service built by OVH on top of Graylog to follow, analyse and store your logs. Learning events (aka tracking logs) can be stored in GELF format using this backend.

Read-only backend

For now the LDP backend is read-only as we consider that it is mostly used to collect primary logs and not as a Ralph target. Feel free to get in touch to prove us wrong, or better: submit your proposal for the write method implementation.

To access OVH’s LDP API, you need to register Ralph as an authorized application and generate an application key, an application secret and a consumer key.

While filling the registration form available at: eu.api.ovh.com/createToken/, be sure to give an appropriate validity time span to your token and allow only GET requests on the /dbaas/logs/* path.

OVH LDP (Log Data Platform) data backend default configuration.

Attributes:

Name Type Description
APPLICATION_KEY str

The OVH API application key (AK).

APPLICATION_SECRET str

The OVH API application secret (AS).

CONSUMER_KEY str

The OVH API consumer key (CK).

DEFAULT_STREAM_ID str

The default stream identifier to query.

ENDPOINT str

The OVH API endpoint.

READ_CHUNK_SIZE str

The default chunk size for reading archives.

REQUEST_TIMEOUT int

HTTP request timeout in seconds.

SERVICE_NAME str

The default LDP account name.

For more information about OVH’s API client parameters, please refer to the project’s documentation: github.com/ovh/python-ovh.

OpenStack Swift

Swift is the OpenStack object storage service. This storage backend is fully supported (read and write operations) to stream and store log archives.

Parameters correspond to a standard authentication using OpenStack Keystone service and configuration to work with the target container.

Swift data backend default configuration.

Attributes:

Name Type Description
AUTH_URL str

The authentication URL.

USERNAME str

The name of the openstack swift user.

PASSWORD str

The password of the openstack swift user.

IDENTITY_API_VERSION str

The keystone API version to authenticate to.

TENANT_ID str

The identifier of the tenant of the container.

TENANT_NAME str

The name of the tenant of the container.

PROJECT_DOMAIN_NAME str

The project domain name.

REGION_NAME str

The region where the container is.

OBJECT_STORAGE_URL str

The default storage URL.

USER_DOMAIN_NAME str

The user domain name.

DEFAULT_CONTAINER str

The default target container.

LOCALE_ENCODING str

The encoding used for reading/writing documents.

READ_CHUNK_SIZE str

The default chunk size for reading objects.

WRITE_CHUNK_SIZE str

The default chunk size for writing objects.

Amazon S3

S3 is the Amazon Simple Storage Service. This storage backend is fully supported (read and write operations) to stream and store log archives.

Parameters correspond to a standard authentication with AWS CLI and configuration to work with the target bucket.

S3 data backend default configuration.

Attributes:

Name Type Description
ACCESS_KEY_ID str

The access key id for the S3 account.

SECRET_ACCESS_KEY str

The secret key for the S3 account.

SESSION_TOKEN str

The session token for the S3 account.

ENDPOINT_URL str

The endpoint URL of the S3.

DEFAULT_REGION str

The default region used in instantiating the client.

DEFAULT_BUCKET_NAME str

The default bucket name targeted.

LOCALE_ENCODING str

The encoding used for writing dictionaries to objects.

READ_CHUNK_SIZE str

The default chunk size for reading objects.

WRITE_CHUNK_SIZE str

The default chunk size for writing objects.

File system

The file system backend is a dummy template that can be used to develop your own backend. It is a “dummy” backend as it is not intended for practical use (UNIX ls and cat would be more practical).

The only required parameter is the path we want to list or stream content from.

FileSystem data backend default configuration.

Attributes:

Name Type Description
DEFAULT_DIRECTORY_PATH str or Path

The default target directory path where to perform list, read and write operations.

DEFAULT_QUERY_STRING str

The default query string to match files for the read operation.

LOCALE_ENCODING str

The encoding used for writing dictionaries to files.

READ_CHUNK_SIZE int

The default chunk size for reading files.

WRITE_CHUNK_SIZE int

The default chunk size for writing files.

Learning Record Store (LRS)

The LRS backend is used to store and retrieve xAPI statements from various systems that follow the xAPI specification (such as our own Ralph LRS, which can be run from this package). LRS systems are mostly used in e-learning infrastructures.

LRS data backend default configuration.

Attributes:

Name Type Description
BASE_URL AnyHttpUrl

LRS server URL.

USERNAME str

Basic auth username for LRS authentication.

PASSWORD str

Basic auth password for LRS authentication.

HEADERS dict

Headers defined for the LRS server connection.

LOCALE_ENCODING str

The encoding used for reading statements.

READ_CHUNK_SIZE int

The default chunk size for reading statements.

STATUS_ENDPOINT str

Endpoint used to check server status.

STATEMENTS_ENDPOINT str

Default endpoint for LRS statements resource.

WRITE_CHUNK_SIZE int

The default chunk size for writing statements.

WebSocket

The webSocket backend is read-only and can be used to get real-time events.

If you use OVH’s Logs Data Platform (LDP), you can retrieve a WebSocket URI to test your data stream by following instructions from the official documentation.

Websocket data backend default configuration.

Attributes:

Name Type Description
CLIENT_OPTIONS dict

A dictionary of valid options for the websocket client connection. See WSClientOptions.

URI str

The URI to connect to.

Client options for websockets.connection.

For mode details, see the websockets.connection documentation

Attributes:

Name Type Description
close_timeout float

Timeout for opening the connection in seconds.

compression str

Per-message compression (deflate) is activated by default. Setting it to None disables compression.

max_size int

Maximum size of incoming messages in bytes. Setting it to None disables the limit.

max_queue int

Maximum number of incoming messages in receive buffer. Setting it to None disables the limit.

open_timeout float

Timeout for opening the connection in seconds. Setting it to None disables the timeout.

origin str

Value of the Origin header, for servers that require it.

ping_interval float

Delay between keepalive pings in seconds. Setting it to None disables keepalive pings.

ping_timeout float

Timeout for keepalive pings in seconds. Setting it to None disables timeouts.

read_limit int

High-water mark of read buffer in bytes.

user_agent_header str

Value of the User-Agent request header. It defaults to “Python/x.y.z websockets/X.Y”. Setting it to None removes the header.

write_limit int

High-water mark of write buffer in bytes.