Data Lake¶

Diagram¶

Description¶

The Data lake service is built around the following basic ideas :

All the nodes already forward their syslog messages their rsyslog islet servers.
Those rsyslog servers, in turn, write those file to the gluster FS.

Let’s sieze this opportunity to also forward them to a data lake service for the following features :

Easy log searching via different means :
- Web Front End (Kibana)
- Command line (lstail, to be installed)
Logs can be queried via several criterias : node names, syslog types, severity … but also wildcarded words or even regexes. It can also be filtered based on log time …
Alerting based text queries in syslog (has to be setup by the admins)
Metrics based on syslogs
Able to store large amount of data

Technical aspects¶

The chosen technology for this architecture is the OSS version of ElasticSearch. Elastic nodes run in VMs hosted by dedicated Worker servers. If necessary, Elastic can be setup to run on the bare-metal. Benchmarks have to be performed in order to appreciate the performance overhead induced by the use of VMs. Performance may greatly vary depending on the topology of data ingested by the Elasticserach server as well as the type of queries they receive.

It is preferable those worker use ssd disks in order to maximize disk throughput and overall Elasticsearch performance.

With the amount of data growing, the Elasticsearch cluster performance might decrease, be it for ingestion or query requests. In this case, it is possible to add more nodes in order to spread the load accross more nodes.

The general rules of thumb for ElasticSearch sizing is described here : https://www.elastic.co/guide/en/elasticsearch/reference/7.10/size-your-shards.html

It is up to the administrators to properly adapt the cluster tuning to the usecase(s)

Todo

The ETL service will be described in this blueprint