diff options
Diffstat (limited to 'docs/databases.txt')
-rw-r--r-- | docs/databases.txt | 118 |
1 files changed, 118 insertions, 0 deletions
diff --git a/docs/databases.txt b/docs/databases.txt new file mode 100644 index 0000000..254674e --- /dev/null +++ b/docs/databases.txt @@ -0,0 +1,118 @@ +- The storage for HA datbases is problematic. There is several ways to organize storage. I list major + characteristics here (INNODB is generally faster, but takes about 20% more disk space. Initially it + significantly faster and takes 5x disk space, but it normalizes...) + + Method Database Performance Clnt/Cache MySQL Gluster HA + HostMount MyISAM/INNODB 8 MB/s fast 250% - Nope. But otherwise least problems to run. + Gluster MyISAM (no logs) 1 MB/s unusable 150% 600-800% Perfect. But too slow (up to completely unusable if bin-logs are on). Slow MyISAM recovery! + Gluster/Block MyISAM (no logs) 5 MB/s slow, but OK 200% ~ 50% No problems on reboot, but requires manual work if node crashes to detach volume. + Galera INNODB 3.5 MB/s fast 3 x 200% - Should be perfect, but I am not sure about automatic recovery... + MySQL Slaves INNODB 6-8 exp. fast Available data is HA, but caching is not. We can easily turn the slave to master. + DRBD MyISAM (no logs) 4-6 exp. ? I expect it as an faster option, but does not fit complete concept. + + + Gluster is a way too slow for anything. If node crashes, MyISAM tables may be left in corrupted state. The recovery will take ages to complete. +The Gluster/Block is faster, but HA suffers. The volume is attached to the pod running on crashed node. It seems not detached automatically until +the failed pod (in Unknown state) is killed with + oc -n adei delete --force --grace-period=0 pod mysql-1-m4wcq +Then, after some delay it is re-attached to the new running pod. Technically, we can run kind of monitoring service which will detect such nodes +and restart. Still, this solution is limited to MyISAM with binary logging disabled. Unlike simple Gluster solution, the clients may use the system +while caching is going, but is quite slow. The main trouble is MyISAM corruption, the recovery is slow. + + Galera is slower when Gluster/Block, but is fully available. The clients have also more servers to query data from. The cluster start-up is a bit +tricky and I am not sure that everything will work smoothely now. Some tunning may be necessary. Furthermore, it seems if cluster is crashed, we +can recover from one of the nodes, but all the data will be destroyed on other members and they would pull the complete dataset. The synchronization +is faster when caching (~ 140 MB/s), but it wil still take about 10 hours to synchronize 5 TB of KATRIN data. + +So, there is no realy a full HA capable solution at the moment. The most reasonable seems compromising on caching HA. + - MySQL with slaves. The asynchronous replication should be significantly faster when Galera. The passthrough to source databases will be working + (i.e. status displays), current data is available. And we can easily switch the master if necessary. + +The other reasonable options have some problems at the moment and can't be used. + - Galera. Is a fine solution, but would need some degree of initial maintenance to work stabily. Furthermore, the caching is quite slow. And the + resync is a big issue. + - Gluster/Block would be a good solution if volume detachment is fixed. As it stands, we don't have HA without manual intervention. Furthermore, the + MyISAM recovery is quite slow. + - HostMount will be using our 3-node storage optimally. But if something crashes there is 1 week to recache the data. + +Gluster/Block +============= + The idea is pretty simple. A standard gluster file system is used to store a 'block' files (just a normal files). This files are used as block devices + with single-pod access policy. GFApi interface is used to access the data on Gluster (avoiding context switches) and is exposed over iSCSI to the clients. + + There are couple of problems with configuration and run-time. + - The default Gluster containers while complain about rpcbind. We are using host networking in this case and the required ports (111) between container + and the host system conflicts. We, however, are able just to use the host rpcbind. Consequently, the rpcbind should be removed from the Gluster container + and the requirements removed from gluster-blockd systemd service. It is still worth checking that the port is accessible from the container (but it + should). We additionally also need 'iscsi-initiator-utils' in the container. + + - Only a single pod should have access to the block device. Consequnetly, when the volume is attached to the client, other pods can't use it any more. + The problem starts if node running pod dies. It is not perfectly handled by OpenShift now. The volume remains attached to the pod in the 'Unknown' state + until it manually killed. Only, then, after another delay it is detached and available for replacement pod (which will struggle in ConteinerCreating + phase until then). The pods in 'Unknown' state is not easy to kill. + oc delete --force --grace-period=0 pod/mysql-1-m4wcq + + - Heketi is buggy. + * If something goes wrong, it starts create multitudes of Gluster volumes and finally crashes with broken database. It is possible to remove the + volumes and recover database from backup, but it is time consuming and unreliable for HA solution. + * Particularly, this happens if we try to allocate more disk-space when available. The OpenShift configures the size of Gluster file system used + to back block devices. It is 100 GB by default. If we specify 500Gi in pvc, it will try to create 15 such devices (another maximum configured by + openshift) before crashing. + * Overall, I'd rather only use the manual provisioning. + + - Also without heketi it is still problematic (may be it is better with official RH container running on GlusterFS 3.7), but I'd not check... We + can try again with GlusterFS 4.1. There are probably multiple problems, but + * GlusterFS may fail on one of the nodes (showing it up and running). If any of the block services have problems communicating with local gluster + daemon, most requests (info/list will still work, but slow) to gluster daemon will timeout. + +Galera +====== + - To bring new cluster up, there is several steps. + * All members need to initialize standard standalone databases + * One node should perform initialization and other nodes join after it is completed. + * The nodes will delete their mysql folders and re-synchronize from the first node. + * Then, cluster will be up and all nodes in so called primary state. + + - The procedure is similar for crash recovery: + * If a node leaves the cluster, it may just come back and be re-sycnronized from other + cluster members if there is a quorum. For this reason, it is necessary to keep at le + ast 3 nodes running. + * If all nodes crashed, then again one node should restart the cluster and others join + later. For older versions, it is necessary to run mysqld with '--wsrep-new-cluster'. + The new tries to automatize it and will recover automatically if 'safe_to_bootstrap' = 1 + in 'grstate.dat' in mysql data folder. It should be set by Galera based on some heuristic, + but in fact I always had to set it manually. IMIMPORTANT, it should be set only on one of + the nodes. + + - Synchrinization only works for INNODB tables. Furthermore, binary logging should be turned + on (yes, it is possible to turn it off and there is no complains, but only the table names are + synchronized, no data is pushed between the nodes). + + - OpenShift uses 'StatefulSet' to perform such initialization. Particularly, it starts first + node and waits until it is running before starting next one. + * Now the nodes need to talk between each other. The 'headless' service is used for that. + Unlinke standard service, the DNS does not load balance service pods, but returns IPs of + all service members if appropriate DNS request is send (SRV). In Service spec we specify. + clusterIP: None - old version + For clients we still need a load-balancing service. So, we need to add a second service + to serve their needs. + * To decide if it should perform cluster initialization, the node tries to resolve members + of the service. If it is alone, it initializes the cluster. Otherwise, tries to join the other + members already registered in the service. The problem is that by default, OpenShift only + will add member when it is ready (Readyness check). Consequently, all nodes will try to + initialize. There is two methods to prevent it. One is working up to 3.7 and other 3.8 up, + but it is no harm to use both for now). + The new is to set in Service spec: + publishNotReadyAddresses: True + The old is to specify in Service metadata.annotations: + service.alpha.kubernetes.io/tolerate-unready-endpoints: true + * Still, we should quickly check for peers until other pods had chance to start. + * Furthermore, there is some differneces to 'dc' definition. We need to specify 'serviceName' + in the StatefulSet spec. + serviceName: adei-ss + There are few other minor differences. For instance, the 'selector' have more flexible notation + and should include 'matchLabels' before specifying the 'pod' selector, etc. + + - To check current status of the cluster + SHOW STATUS LIKE 'wsrep_cluster_size'; +
\ No newline at end of file |