1 files changed, 118 insertions, 0 deletions
diff --git a/docs/databases.txt b/docs/databases.txt
new file mode 100644
index 0000000..254674e
--- /dev/null
+++ b/docs/databases.txt
@@ -0,0 +1,118 @@
+- The storage for HA datbases is problematic. There is several ways to organize storage. I list major 
+  characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it
+  significantly faster and takes 5x disk space, but it normalizes...)
+
+ Method         Database                Performance     Clnt/Cache      MySQL           Gluster         HA
+ HostMount      MyISAM/INNODB           8 MB/s          fast            250%            -               Nope. But otherwise least problems to run.        
+ Gluster        MyISAM (no logs)        1 MB/s          unusable        150%            600-800%        Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery!
+ Gluster/Block  MyISAM (no logs)        5 MB/s          slow, but OK    200%            ~ 50%           No problems on reboot, but requires manual work if node crashes to detach volume.
+ Galera         INNODB                  3.5 MB/s        fast            3 x 200%        -               Should be perfect, but I am not sure about automatic recovery...
+ MySQL Slaves   INNODB                  6-8 exp.        fast                                            Available data is HA, but caching is not. We can easily turn the slave to master.
+ DRBD           MyISAM (no logs)        4-6 exp.        ?                                               I expect it as an faster option, but does not fit complete concept.
+ 
+
+ Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete.
+The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until
+the failed pod (in Unknown state) is killed with 
+    oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq
+Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes
+and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system
+while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. 
+
+ Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit
+tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we 
+can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization
+is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data.
+ 
+So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA.
+ - MySQL with slaves.  The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working 
+ (i.e. status displays), current data is available. And we can easily switch the master if necessary.
+
+The other reasonable options have some problems at the moment and can't be used.
+ - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the 
+ resync is a big issue.
+ - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the
+ MyISAM recovery is quite slow.
+ - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data.
+
+Gluster/Block
+=============
+ The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices
+ with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients.
+
+ There are couple of problems with configuration and run-time.
+ - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container
+ and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container
+ and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it 
+ should). We additionally also need 'iscsi-initiator-utils' in the container.
+
+ - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more.
+ The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state
+ until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating 
+ phase until then). The pods in 'Unknown' state is not easy to kill. 
+    oc delete --force --grace-period=0 pod/mysql-1-m4wcq
+
+ - Heketi is buggy. 
+  * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the 
+  volumes and recover database from backup, but it is time consuming and unreliable for HA solution.
+  * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used
+  to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by
+  openshift) before crashing.
+  * Overall, I'd rather only use the manual provisioning.
+
+ - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We
+ can try again with GlusterFS 4.1. There are probably multiple problems, but
+  * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster
+  daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout.
+  
+Galera
+======
+ - To bring new cluster up, there is several steps.
+  * All members need to initialize standard standalone databases
+  * One node should perform initialization and other nodes join after it is completed.
+  * The nodes will delete their mysql folders and re-synchronize from the first node.
+  * Then, cluster will be up and all nodes in so called primary state.
+
+ - The procedure is similar for crash recovery:
+  * If a node leaves the cluster, it may just come back and be re-sycnronized from other
+  cluster members if there is a quorum. For this reason, it is necessary to keep at le
+  ast 3 nodes running.
+  * If all nodes crashed, then again one node should restart the cluster and others join
+  later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'.
+  The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1
+  in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic,
+  but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of 
+  the nodes.
+ 
+ - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned
+ on (yes, it is possible to turn it off and there is no complains, but only the table names are
+ synchronized, no data is pushed between the nodes).
+ 
+ - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first
+ node and waits until it is running before starting next one. 
+  * Now the nodes need to talk between each other. The 'headless' service is used for that. 
+  Unlinke standard service, the DNS does not load balance service pods, but returns IPs of
+  all service members if appropriate DNS request is send (SRV). In Service spec we specify.
+        clusterIP: None                                 - old version
+  For clients we still need a load-balancing service. So, we need to add a second service 
+  to serve their needs.
+  * To decide if it should perform cluster initialization, the node tries to resolve members
+  of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other
+  members already registered in the service. The problem is that by default, OpenShift only
+  will add member when it is ready (Readyness check). Consequently, all nodes will try to 
+  initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, 
+  but it is no harm to use both for now). 
+    The new is to set in Service spec:
+        publishNotReadyAddresses: True
+    The old is to specify in Service metadata.annotations:
+        service.alpha.kubernetes.io/tolerate-unready-endpoints: true
+  * Still, we should quickly check for peers until other pods had chance to start.
+  * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName'
+  in the StatefulSet spec.
+      serviceName: adei-ss
+  There are few other minor differences. For instance, the 'selector' have more flexible notation
+  and should include 'matchLabels' before specifying the 'pod' selector, etc.
+  
+ - To check current status of the cluster
+        SHOW STATUS LIKE 'wsrep_cluster_size';
+ 
+\ No newline at end of file