Supported cluster architectures =============================== The Ocean-stack structure is two-fold, an administration island and the managed islands. * The administration island storage and network must be sized to meet the needs of the islands it will manage. All the management data will be stored and accessed from there. Most of the cluster's services will be hosted there. * The administration island nodes must be deployed and configured in triplets, for resiliency and quorum algorithm reasons. * It is strongly recommended that managed island should be connected to BGP-enabled switches that also supports L3 interfaces (Cisco Nexus 9k models does). * From a network point-of-view, an island is a separate entity. No broadcast nor multicast traffic is supported between islands. * Any OS is support on managed nodes as soon as it can use standard tools (PXE, iPXE, HTTP, DHCP, ...) to boot or deploy. This documentation will provide you an overview on how Ocean-stack is architectured and the required processes to install and operate the cluster. This document also contain a handbook that contains documentation about debugging, automating or any product-specific documentation that should be distributed with Ocean-Stack. You should head to this documentation when you are: * Planning to install a cluster using Ocean-Stack * Installing the administration island of Ocean-Stack * Operating the administration island of Ocean-Stack Management architecture ======================= Network architecture -------------------- The reference Ocean Stack deployment scheme consists of a group of management servers that provides services for the entire cluster (designated as the *top* or *worker* management servers) and several server groups that provide services for a particular part (or islet) of the cluster (designated as *proximity* management servers). Such islets may contain compute nodes, IP or LNET routers or anything else that forms the HPC cluster itself. The network architecture is derived from this deployment scheme and follows the following rules: * Services provided by management servers may be bandwith-hungry. Network capacity planning must be taken into account. The network where those services are provided will be designated as the *administration* network. * Management servers or related equipment must be reachable through a trusted network (designated as the *equipment* network) isolated from user traffic. * The *administration* network may evolve over the cluster life time. Performance scalability and administration flexibility must be taken into account. * Each and every network brick must be redundant. Known single point of failure should be listed and kept in mind during the installation process. In our reference deployment, the *equipment* network is as simple as possible. And to keep it dead simple, this network will be a single flat network consisting of stacked switches. All the management servers and equipments must be connected to this network. The *administration* network consists of: * The core layer, where *top* and *worker* management servers are connected. * The aggregation layer, where *proximity* management servers are connected. * The (optional) acces layer, where end-nodes (like compute nodes) are connected. This last layer may be provided by the compute node drawers. It can also be merged with the aggregation layer if the number of end-nodes is limited. To give a full picture, here is a concrete example. Imagine the following: * 2 compute-node islets with 6 racks, each of them have an aggregated bandwitch of 40Gb/s. 240Gb/s aggregrated bandwith in total. * Each compute-node islet have 2 dedicated *proximity* management server. Connected with single 40Gb/s links. * 3 *top* and 3 *worker* management servers with 40Gb/s links. With such a cluster, the network architecture would be the following. ``isletXX`` servers designates *proximity* management servers, and islets are numbered 10 and 20. .. image:: deepblue_net.png Here, the core layer is composed of two 40G/100G switches. The aggregation layer is composed of two 40G switches per islet and connected with four 40Gb/s links with the core layer. The acces layer is abstracted as it could be provided by the compute-node manufacturer and each rack is connected with four 10Gb/s links with the aggregation layer. Each *proximity* management server is connected to a different aggregation switch (the previous schema does not show this for clarity). *top* and *worker* management nodes are dispatched on both core switches. With such a deployment, a service, like a shared FS, using all the *top* and *worker* management servers, can use up to 240Gb/s of network bandwidth. This resource is shared between all the compute-node islets, where each of them can use up to 160Gb/s of bandwidth. On the other side, *proximity* management servers have 80Gb/s of per-islet dedicated bandwidth. Each and every islet (and related *proximity* management server) forms a VLAN group where IP routing is achieved by the aggregation layer. The *top* and *worker* management servers are also grouped together. IP routing between each group is either performed by the core and aggregation layers (2 hop routing) or dynamically routed by using all the links between the core and aggregation layers as a separate L2 network. Such VLANs can be represented as follows: .. nwdiag:: nwdiag{ group { islet10 islet11 } group { islet20 islet21 } group { eth_switch10 eth_switch11 eth_switch20 eth_switch21 } group { eth_switch1 eth_switch2 } network top_vlans{ label = "Management VLANs" top1 top2 top3 worker1 worker2 worker3 eth_switch1 eth_switch2 } network ethernet_fabric{ label = "Transit VLAN" eth_switch1 eth_switch2 eth_switch10 eth_switch11 eth_switch20 eth_switch21 } network islet10_vlan{ label = "Islet 10 VLANs" islet10 islet11 eth_switch10 eth_switch11 } network islet20_vlan{ label = "Islet 20 VLANs" islet20 islet21 eth_switch20 eth_switch21 } } Services architectures ---------------------- .. todo:: Which, how and where services are hosted The Ocean Stack project is based on a set of low level services and medium level services enabling to provide a set of high level services used to fulfill the requirements of cluster resources in terms of administration and management as well as the requirements of end-users in terms of services to access those resources on the cluster. Low-level services bring a simple highly-available hyperconverged infrastructure enabling the provisioning of virtual machines on the set of available management nodes. This infrastructure is leverage to host the mid and high level services. Low-level services include : * Glusterfs * Fleet * PCOCC Mid-level services bring the required network flexibility to scale the management stack regarding the amount of resources to manage in the cluster. Mid-level services include : * Named * HAProxy * Keepalived High-level services provides all the traditional services commonly in production on production clusters to fulfill their management and usage requirements. High-level services include: * HTTP * Puppet * LDAP * Rsyslog * Conman * Prometheus * Slurm * BMGR * External network IP routing * iSCSId * ... Services design details can be found in the following sections. .. raw:: latex \clearpage .. toctree:: :maxdepth: 2 blueprints/dns blueprints/dhcp blueprints/tftp blueprints/http blueprints/puppet blueprints/ldap blueprints/rsyslog blueprints/conman blueprints/ntp blueprints/gluster blueprints/gwip blueprints/monitoring blueprints/slurm blueprints/diskless blueprints/wrap_up