1 files changed, 33 insertions, 65 deletions
diff --git a/log.txt b/log.txt
index 3c44166..8ee02bb 100644
--- a/log.txt
+++ b/log.txt
@@ -1,81 +1,49 @@
- System
- -------
- 2025.09.28
-    - ipekatrin1:
-        * Raid controller don't see 10 disks and behaves erratically.
-        * Turned of the server and ordered a replacement.
-    - Sotrage:
-        * Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
-        * Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
-        * Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.
-
- 2025.10.27
-    - ipekatrin1:
-        * Disconnected all disks from the server and start preparing it as an application node
-    - Software:
-        * I have temporarily suspended all ADEI cronJobs to avoid resource contention on ipekatrin2 (as restart would be dangerous now) [clean (logs,etc.)/maintain (re-caching,etc.)/update(detecting new databases)]
-    - Research:
-        * DaemonSet/GlusterFS selects nodes based on the following nodeSelector
-            $ oc -n glusterfs get ds glusterfs-storage -o yaml | grep -B 5 -A 5 nodeSelector 
-                  nodeSelector:
-                    glusterfs: storage-host
-          All nodes has corresponding labels in their metadata:
-            $ oc get node/ipekatrin1.ipe.kit.edu --show-labels -o yaml | grep  -A 20 labels:
-                  labels:
-                    ...
-                    glusterfs: storage-host
-                    ...
-        * Thats removed now from ipekatrin1 and should be recovered if we bring storage back
-            oc label --dry-run node/ipekatrin1.ipe.kit.edu glusterfs-
-        * We further need to remove 192.168.12.1 from 'endpoints/gfs' (per namespaces) to avoid possible problems. 
-        * On ipekatrin1, /etc/fstab glusterfs mounts should be changed from 'localhost' to some other server (or commented all-together). GlusterFS mounts 
-        should be changed from localhost to
-            192.168.12.2,192.168.12.3:<vol>  /mnt/vol  glusterfs  defaults,_netdev  0 0
-        * All raid volumes be also temporarily commented in /etc/fstab 
-        * Further configuration changes required to run node without glusterfs causing no damage to the rest of the system
-            GlusterFS might be referenced via: /etc/hosts, /etc/fstab, /etc/systemd/system/*.mount /etc/auto.*, scripts/cron
-                endpoints (per namespace), inline gluster volumes in PV (gloabl), 
-                gluster-block endpoints / tcmu gateway list, sc (heketi storageclass) and controllers (ds,deploy,sts); just in case check heketi cm/secrets), 
-    - Plan:
-        * Prepare application node [double-check before implementing]
-            > Adjust /etc/fstab and check systemd based mounts. Shall we do soemth with hosts?
-            > Check/change cron & monitoring scipts
-            > Adjust node label and edit 'gfs' endpoints in all namespaces.
-            > Check glusterblock/heketi, stange pvs. 
-            > Google above other possible culprits.
-            > Boot ipekatrin1 and check that all is fine
-        * cronJobs
-            > Set affinity to ipekatrin1. 
-            > Restart cronJobs (maybe reduce intervals)
-        * ToDo
-            > Ideally eliminating cronJobs all together for rest of KaaS1 life-time and replacing with continuously running cron daemon iside container
-            > Rebuild ipekatrinbackupserv1 as new gluster node (using disks) and try connecting it to the cluster
-
  Hardware
  --------
  2024
     - ipekatrin1: Replaced disk in section 9. LSI software reports all is OK, but hardware led indicates a error (red). Probably indicator is broken.
 
  2025.09 (early month)
-    - ipekatrin1: Replaced 3 disks (don't remeber slots). two of them was already once replaced.
+    - ipekatrin2: Replaced 3 disks (don't remeber slots). two of them was already once replaced.
     - Ordered spare disks
 
  2025.10.23
-    - ipekatrin1: 
-        * Replaced RAID controller. Make attempt to rebuild, but disks are disconnected after about 30-40 minutes (recovered after shutoff, not reboot)
-        * Checked power issues: cabling bypassing PSU and monitoring voltages (12V system should not go bellow 11.9V). No change, voltages seemed fine.
-        * Checked cabling issues disconnecting first one cable and then another (supported mode, single cable connects all disks). No change
-        * Tried to imrpove cooling, setting fan speeds to maximum (kept) and even temporarily installing external cooler. Radiators were cool, also checked reported temperatures. No change, still goes down in 30-40 minutes.
-        * Suspect backplane problems. The radiators were quite hot before adjusting cooling. Seems known stability problems due to bad signal management in firmware if overheated. Firmware updates are suggest to stabilize.
-        * No support by SuperMicro. Queried Tootlec about possibility of getting firmware update or/and ordering backplane [Order RG_014523_001_Chilingaryan form 16.12.2016, Angebot 14.10, Contract: 28.11]
-          Hardware: Chassis CSE-846BE2C-R1K28B, Backplan BPN-SAS3-846EL2), 2x MCX353A-FCB ConnectX-3 VPI
-        * KATRINBackupServ1 (3-years older) has backplane with enough bays to mount disks. We still need to be able to put Raid-card and Mellanox ConnectX-3 board/boards with 2 ports (can leave with 1).
     - ipekatrin2: Noticed and cleared RAID alarm attributed to the battery subsystem. 
         * No apparent problems at the moment. Temperatures are all in order. Battery reports healthy. Systems works as usual.
-        * Setup temperature monitoring of RAID card, currently 76-77C
- 
+
+ 2025.09.28 - 2025.11.03
+    - ipekatrin1: Raid controller failed. The system was not running stable after replacement (disk disconnect after 20-30m operation)
+    - ipekatrin1: Temporarily converted in the master-only node (apps scheduling disabled, glusterfs stopped)
+    - ipekatrin1: New disks (from ipekatrinbackupserv1) were assembled in the RAID, assembled in gluster, and manual (file walk-trough) healing 
+      is executed. Expected to take about 2-3 weeks (about 2TB per day rate). No LVM configured, direct mount.
+    - Application node will be recovered once we replace system SSDs with larger ones (as there currently no space for images/containers)
+      and I don't want to put it on new RAID.
+    - Original disks from ipekatrin1 are assembled in ipekatrinbackupserv1. Disconnect problem preserve as some disks stop answerin
+      SENSE queries and backplane restarts a whole bunch of 10 disks. Anyway, all disks are accessible in JBOD mode and can be copied.
+	* XFS fs is severely damaged and needs reapirs. I tried accessing some files via xfs debugger, it worked. So, directory structure
+	and file content is, at least partially, are good and repair should be possible.
+        * If recovery would be necessary: buy 24 new disks, copy one-by-one, assemble in RAID, recover FS.
+
+ 2025.12.08
+    - Copied ipekatrin1 system SSDs to new 4TB drives and reinstalled in the server (only 2TB is used due to MBR limitations)
+
 Software
 --------
  2023.06.13
     - Instructed MySQL slave to ignore 1062 errors as well (I have skipped a few manually, but errors appeared non-stop)
     - Also ADEI-KATRIN pod got stuck. Pod was running, but apache was stuck and not replying. This caused POD state to report 'not-ready' but for some reason it was still 'live' and pod was not restarted.
+
+ 2025.09.28
+    - Restarted degraded GlusterFS nodes and make them work on remaining 2 nodes (1 replica + metadata for most of our storage needs).
+    - Turned out 'database' volume is created in Raid-0 mode and it used backend for KDB database. So, data is gone.
+    - Recovered KDB database from backups and moved it to glusterfs/openshift volume. Nothing left on 'database' volume. Can be turned off.
+
+ 2025.09.28 - 2025.11.03
+    - GlusterFS endpoints temporarily changed to use only ipekatrin2 (see details in dedicated logs)
+    - Heketi and gluster-blockd were disabled and will be not available further. Existing heketi volumes preserved.
+
+ 2025.12.09
+    - Renabled scheduling on ipekatrin1..
+    - Manually run 'adei-clean' on katrin & darwin, but keep 'cron' scripts stopped for now.
+    - Restored configs: fstab restored, */gfs endpoints. Heketi/gluster-block stays disabled. No other system changes.
+    - ToDo: Re-enable 'cron' scripts if we decide to keep system running in parallel with KaaS2.