Finalize storage failure on ipekatrin1: scripts & logsHEAD master

author: Suren A. Chilingaryan <csa@suren.me> 2025-12-09 16:14:26 +0000
committer: Suren A. Chilingaryan <csa@suren.me> 2025-12-09 16:14:26 +0000
commit: 77aa9c433f9255d713394e3b25987fa2b4a03a1a (patch)
tree: ddc5d87bf838bd589f36b43b53955ad8207796a2 /logs/2025.11.03.storage-log.txt
parent: d35216ee0cbf9f1a84a6d4151daf870b1ff00395 (diff)
download: ands-master.tar.gz
ands-master.tar.bz2
ands-master.tar.xz
ands-master.zip
1 files changed, 140 insertions, 0 deletions
diff --git a/logs/2025.11.03.storage-log.txt b/logs/2025.11.03.storage-log.txt
new file mode 100644
index 0000000..a95dc57
--- /dev/null
+++ b/logs/2025.11.03.storage-log.txt
@@ -0,0 +1,140 @@
+Status
+======
+ - Raid controller failed on ipekatrin1
+ - The system was not running stable after replacement (disk disconnect after 20-30m operation)
+ - ipekatrin1 was temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped)
+ - Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved.
+ - New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing 
+   is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount.
+ - Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers)
+   and I don't want to put it on new RAID.
+
+Recovery Logs
+====
+ 2025.09.28
+    - ipekatrin1:
+        * Raid controller don't see 10 disks and behaves erratically.
+        * Turned of the server and ordered a replacement.
+    - Sotrage:
+        * Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
+        * Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
+        * Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.
+
+ 2025.10.23
+    - ipekatrin1: 
+        * Replaced RAID controller. Make attempt to rebuild, but disks are disconnected after about 30-40 minutes (recovered after shutoff, not reboot)
+        * Checked power issues: cabling bypassing PSU and monitoring voltages (12V system should not go bellow 11.9V). No change, voltages seemed fine.
+        * Checked cabling issues disconnecting first one cable and then another (supported mode, single cable connects all disks). No change
+        * Tried to imrpove cooling, setting fan speeds to maximum (kept) and even temporarily installing external cooler. Radiators were cool, also checked reported temperatures. No change, still goes down in 30-40 minutes.
+        * Suspect backplane problems. The radiators were quite hot before adjusting cooling. Seems known stability problems due to bad signal management in firmware if overheated. Firmware updates are suggest to stabilize.
+        * No support by SuperMicro. Queried Tootlec about possibility of getting firmware update or/and ordering backplane [Order RG_014523_001_Chilingaryan form 16.12.2016, Angebot 14.10, Contract: 28.11]
+          Hardware: Chassis CSE-846BE2C-R1K28B, Backplan BPN-SAS3-846EL2), 2x MCX353A-FCB ConnectX-3 VPI
+        * KATRINBackupServ1 (3-years older) has backplane with enough bays to mount disks. We still need to be able to put Raid-card and Mellanox ConnectX-3 board/boards with 2 ports (can leave with 1).
+    - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem. 
+        * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.
+        * Setup temperature monitoring of RAID card, currently 76-77C
+
+ 2025.10.27
+    - ipekatrin1:
+        * Disconnected all disks from the server and start preparing it as an application node
+    - Software:
+        * I have temporarily suspended all ADEI cronJobs to avoid resource contention on ipekatrin2 (as restart would be dangerous now) [clean (logs,etc.)/maintain (re-caching,etc.)/update(detecting new databases)]
+    - Research:
+        * DaemonSet/GlusterFS selects nodes based on the following nodeSelector
+            $ oc -n glusterfs get ds glusterfs-storage -o yaml | grep -B 5 -A 5 nodeSelector 
+                  nodeSelector:
+                    glusterfs: storage-host
+          All nodes has corresponding labels in their metadata:
+            $ oc get node/ipekatrin1.ipe.kit.edu --show-labels -o yaml | grep  -A 20 labels:
+                  labels:
+                    ...
+                    glusterfs: storage-host
+                    ...
+        * Thats removed now from ipekatrin1 and should be recovered if we bring storage back
+            oc label --dry-run node/ipekatrin1.ipe.kit.edu glusterfs-
+        * We further need to remove 192.168.12.1 from 'endpoints/gfs' (per namespaces) to avoid possible problems. 
+        * On ipekatrin1, /etc/fstab glusterfs mounts should be changed from 'localhost' to some other server (or commented all-together). GlusterFS mounts 
+        should be changed from localhost to (or probably just 12.2 as it only host containing data and going via intermediary makes no sense)
+            192.168.12.2,192.168.12.3:<vol>  /mnt/vol  glusterfs  defaults,_netdev  0 0
+        * All raid volumes be also temporarily commented in /etc/fstab and systemd
+	    systemctl list-units --type=mount | grep gluster
+        * Further configuration changes required to run node without glusterfs causing no damage to the rest of the system
+            GlusterFS might be referenced via: /etc/hosts, /etc/fstab, /etc/systemd/system/*.mount /etc/auto.*, scripts/cron
+                endpoints (per namespace), inline gluster volumes in PV (gloabl), 
+                gluster-block endpoints / tcmu gateway list, sc (heketi storageclass) and controllers (ds,deploy,sts); just in case check heketi cm/secrets), 
+    - Plan:
+        * Prepare application node [double-check before implementing]
+            + Adjust node label 
+	    + Edit 'gfs' endpoints in all namespaces.
+            + Check glusterblock/heketi, strange pv's. 
+	    + Check Ands monitoring & maintenance scirpts
+            + Adjust /etc/fstab and check systemd based mounts. Shall we do soemth with hosts?
+	    + /etc/nfs-ganesha on ipekatrin1 & ipekatrin2
+            + Check/change cron & monitoring scipts
+	    + Check for backup scripts, it probably written on raid controller.
+	    + Grep in OpenShift configs (and /etc globally) just in case
+            + Google above other possible culprits.
+            + Boot ipekatrin1 and check that all is fine
+        * cronJobs
+            > Set affinity to ipekatrin1. 
+            > Restart cronJobs (maybe reduce intervals)
+	* copy cluster backups out
+        * ToDo
+            > Ideally eliminating cronJobs all together for rest of KaaS1 life-time and replacing with continuously running cron daemon iside container
+            > Rebuild ipekatrinbackupserv1 as new gluster node (using disks) and try connecting it to the cluster
+
+  2025.10.28-31
+    - Hardware
+	* Re-assemled ipekatrin1 disks in ipekatrinbackupserv1 backplane using new LSI 9361-8i raid controller. Original LSI 9271-8i removed.
+	* Put old (SAS2) disks from ipekatrinbackupserv1 into ipekatrin1. Imported RAID configs, RAID started and seems works stable using SAS2 setup.
+    - Software
+	* Removed glusterfs & fat_storage labels from ipekatrin1.ipe.kit.edu node
+	    oc label node/ipekatrin1.ipe.kit.edu glusterfs-
+	    oc label node/ipekatrin1.ipe.kit.edu fat_storage-
+	* Indentified all endpoints used in PVs (no PV specifies IPs directly). No PV hardcode IPs directly (and it seems unsupported anyway)
+	  Editied endpoints: gfs glusterfs-dynamic-etcd glusterfs-dynamic-metrics-cassandra-1 glusterfs-dynamic-mongodb glusterfs-dynamic-registry-claim glusterfs-dynamic-sharelatex-docker
+	* Verified that no glusterblock devices is used by pods or outside (no iscsi devics). Checked that heketi storageClass can be safely disabled without affecting existing volumes
+	  Teminated heketi/glusterblock services, removed storageclasses
+	* Checked ands-distributed scripts & crons. No referring to gluster. Monitoring checks raid status, but this probably is not critical as it would just report error (which is true)
+	* Set nfsganesha cluster nodes to andstorage2 only on ipekatrin1/2 (no active server on ipekatrin3). Service is inactive at the moment
+	  Anyway double-check to disable on ipekatrin1 on a first boot
+	* Found active 'block' volume in glusterfs. Checked it is empty and is not used by any active 'pv'. Stopped and deleted.
+	* Backup is done on /mnt/provision which should work in new configuration. So, no changes are needed.
+	* Mount points adjusted.
+    - First Boot:
+	* Disable nfs-ganesha on first boot on ipekatrin1
+	* Verified that glusterfs is not started and gluster mounts are healthy
+	* etcd is running and seem healthy
+	    ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
+	    curl -v --cert /etc/etcd/peer.crt      --key /etc/etcd/peer.key      --cacert /etc/etcd/ca.crt  -s https://192.168.13.1:2379/v2/stats/self
+	* origin-master-api and origin-master-controllers are runnign
+	* origin-node and docker failed. /var/lib/docker is on the raid (mounted /var/lib/docker, but used via lvm thin pool).
+	* Created '/var/lib/docker-local for now and configured docker to user overlay2 in /etc/sysconfig/docker-storage
+	    DOCKER_STORAGE_OPTIONS="--storage-driver=overlay2 --graph=/var/lib/docker-local"
+	* Adjusted selinux contexts
+	    semanage fcontext -a -e /var/lib/docker /var/lib/docker-local
+	    restorecon -R -v /var/lib/docker-local
+	* Infrastructure pods are running on ipekatrin1
+	* Check Status and monitoring scripts are working [ seems reasonable to me ]
+	    > Raid is not optimal and low data space is report (/mnt/ands is not mounted)
+	    > Docker is not reporting available Data/Metadata space (as we are on local folder)
+	* Check /var/lib/docker-local  space usage is monitored
+	    > Via data space usage
+    - Problems
+	* We have '*-host' pvs bound to /mnt/hostdisk which are used adei/mysql (nodes 2&3) and as katrin temporary data folder. Currently keep node1 as master, but disable scheduling
+	    oc adm cordon ipekatrin1.ipe.kit.edu
+    - Backup
+	* Backups from 'provision' volume are taken to 'kaas-manager' VM
+    - Monitor
+	* Usage in /var/lib/docker-local [ space usage ]
+    - ToDo
+	* Try building storage RAID in ipekatrinbackupserv1 (SFF-8643 to SFF-8087 cable needed, RAID-to-backplane). Turn on, check data is accessible and turn-off.
+	* We shall order larger SSD for docker (LVM) and KATRIN temporary files (/mnt/hostraid). Once done, uncordon jobs on katrin2
+	    oc adm uncordon ipekatrin1.ipe.kit.edu
+	* We might try building a smaller RAID from stable disk bays and move ADEI replica here (discuss!) or a larger from SAS2 drives if it proves more stable.
+	* We might be able to use Intel RES2SV240 or LSISAS2x28 expander board to reduce SAS3 to SAS2 speeds...
+
+ 2025.11.01-03
+    - Document attempts to recover storage raid
+    - GlusterFS changes and replication
+
author	Suren A. Chilingaryan <csa@suren.me>	2025-12-09 16:14:26 +0000
committer	Suren A. Chilingaryan <csa@suren.me>	2025-12-09 16:14:26 +0000
commit	77aa9c433f9255d713394e3b25987fa2b4a03a1a (patch)
tree	ddc5d87bf838bd589f36b43b53955ad8207796a2 /logs/2025.11.03.storage-log.txt
parent	d35216ee0cbf9f1a84a6d4151daf870b1ff00395 (diff)
download	ands-master.tar.gz ands-master.tar.bz2 ands-master.tar.xz ands-master.zip