summaryrefslogtreecommitdiffstats
path: root/docs/troubleshooting.txt
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2019-10-06 05:00:55 +0200
committerSuren A. Chilingaryan <csa@suren.me>2019-10-06 05:00:55 +0200
commitba144fab071258a97cf3c42a0defeb0aae41a353 (patch)
tree2e738d4e4774d754b56d79021cc8781b3c0835a5 /docs/troubleshooting.txt
parentefe4b9bbe3c9cb950378de9697eed2030ac49ca2 (diff)
downloadands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.gz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.bz2
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.tar.xz
ands-ba144fab071258a97cf3c42a0defeb0aae41a353.zip
Document latest problems with docker images and resource reclaimation, add docker performance checks in the monitoring scripts, helpers to filter the logs
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r--docs/troubleshooting.txt103
1 files changed, 79 insertions, 24 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index ea987b5..2290901 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -134,9 +134,53 @@ etcd (and general operability)
pods (failed pods, rogue namespaces, etc...)
====
- - The 'pods' scheduling may fail on one (or more) of the nodes after long waiting with 'oc logs' reporting
- timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure to clean-up
- after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric.
+ - OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the
+ heavily loaded systems: cpu, io, interrputs, etc.
+ * This may be indicated in the logs with various errors reporting inability to stop containers/processes, free network
+ and storage resources. A few examples (not complete)
+ dockerd-current: time="2019-09-30T18:46:12.298297013Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 00a456097fcf8d70a0461f05813e5a1f547446dd10b3b43ebc1f0bb09e841d1b: rpc error: code = 2 desc = no such process"
+ origin-node: W0930 18:46:11.286634 2497 util.go:87] Warning: Unmount skipped because path does not exist: /var/lib/origin/openshift.local.volumes/pods/6aecbed1-e3b2-11e9-bbd6-0cc47adef0e6/volumes/kubernetes.io~glusterfs/adei-tmp
+ Error syncing pod 1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6 ("adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)"), skipping: failed to "CreatePodSandbox" for "adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei(1ed138cd-e2fc-11e9-bbd6-0cc47adef0e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"adei-smartgrid-maintain-1569790800-pcmdp_adei\" network: CNI request failed with status 400: 'failed to Statfs \"/proc/28826/ns/net\": no such file or directory\n'"
+ * A more severe form is then PLEG (POD Lifecycle Event Generator) errors are reported:
+ origin-node: I0925 07:52:00.422291 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s]
+ This indicates a severe delays in communication with docker daemon (can be checked with 'docker info') and may result in node marked
+ temporarily NotReady causing 'pod' eviction. As pod eviction causes extensive load on the other nodes (which may also be affected of the
+ same problem), the initial single-node issue may render all cluster unusable.
+ * With mass evictions, the things could get even worse causing faults in etcd communication. This is reported like:
+ etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader)
+ * Apart from overloaded nodes (max cpu%, io, interrupts), PLEG issues can be caused by
+ 1. Excessive amount of resident docker images on the node (see bellow)
+ 2. This can cause and will be further amplified by the spurious interfaces on OpenVSwich (see bellow)
+ x. Nuanced issues between kubelet, docker, logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case).
+ https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-496818225
+ * The problem is not bound to CronJobs, but having regular scheduled jobs make it presence significantly more visible.
+ Furthermore, CronJobs especially scheduling fat containers, like ADEI, significantly add to the I/O load on the system
+ and may cause more severe form.
+
+ - After a while, the 'pods' schedulling may get more-and-more sluggish, in general or if assigned to a specific node.
+ * The docker images are accumulating on the nodes over time. After a threshold it will start adding the latency to the
+ operation of docker daemon, slow down the pod scheduling (on the affected nodes), and may cause other sever side effects.
+ The problems will start appearing at around 500-1000 images accumulated at a specific node. With 2000-3000, it will get
+ severe and almost unusable (3-5 minutes to start a pod). So, eventually the unused images should be cleaned
+ oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --confirm
+ or alternatively per-node:
+ docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
+ * Some images could be orphanned by OpenShift infrastructure (there was not a major number of orphaned images on KaaS yet).
+ OpenShift supports 'hard' prunning to handle such images.
+ https://docs.openshift.com/container-platform/3.7/admin_guide/pruning_resources.html
+ * Even afterwards, a significant number of images may stay resident. There is two inter-related problems:
+ 1. Docker infrastructure relies on the intermediate images. Consequently, very long Dockerfiles will create a LOT of images.
+ 2. OpenShift keeps history of 'rc' which may refence several versions of old docker images. This will be not cleaned by the
+ described approach. Furthermore, stopped containers lost by OpenShift infrastructure (see above) also prevent clean-up of
+ the images
+ Currenly, a dozen KDB pods produce about 200-300 images. In some cases, optimization of dockerfiles and, afterwards, a trough
+ cleanup of old images may become necessity. The intermediate images can be found with 'docker images -a' (all images with
+ <none> as repository and the name), but there is no easy way to find pod populating them. One, but not very convinient is the following
+ project (press F5 on startup): https://github.com/TomasTomecek/sen
+
+ - In a more sever4 form, the 'pods' scheduling may fail all together on one (or more) of the nodes. After a long waiting,
+ the 'oc logs' will report timeout. The 'oc describe' reports 'failed to create pod sandbox'. This can be caused by failure
+ to clean-up after terminated pod properly. It causes rogue network interfaces to remain in OpenVSwitch fabric.
* This can be determined by errors reported using 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log'
which may quickly grow over 100MB quickly.
could not open network device vethb9de241f (No such device)
@@ -149,7 +193,7 @@ pods (failed pods, rogue namespaces, etc...)
* The issue is discussed here:
https://bugzilla.redhat.com/show_bug.cgi?id=1518684
https://bugzilla.redhat.com/show_bug.cgi?id=1518912
-
+
- After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
* kube-service-catalog/controller-manager
* openshift-template-service-broker/api-server
@@ -180,26 +224,24 @@ pods (failed pods, rogue namespaces, etc...)
* OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
* ... i don't know if install, etc. May cause the trouble...
- - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
- work for a long time. It worth
- * Determining the host running failed pod with 'oc get pods -o wide'
- * Going to the pod and killing processes and stopping the container using docker command
- * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
- - This can be done with 'find . -name heketi*' or something like...
- - There could be problematic mounts which can be freed with lazy umount
- - The folders for removed pods may (and should) be removed.
-
- - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
- * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
- The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
- * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
- - We can find and remove the corresponding container (the short id is just first letters of the long id)
- docker ps -a | grep aa28e9c76
- docker rm <id>
- - We further can just destroy all containers which are not running (it will actually try to remove all,
- but just error message will be printed for running ones)
- docker ps -aq --no-trunc | xargs docker rm
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage) remaining "Deleting" state, etc.
+ There are two possible situations:
+ * The containers are actually already terminated, but OpenShift is not aware of it for some reason.
+ * The containers are actually still running, but OpenShift is not able to terminate them for some reason.
+ It is relatively easy to find out which is the case:
+ * Finding the host running the failed pod with 'oc get pods -o wide'
+ * Checking if associated containers are still running on the host with 'docker ps'
+ The first case it relatively easy to handle, - one can simply enforce pod removal with
+ oc delete --grace-period=0 --force
+ In the second case we need
+ * To actually stop containers before proceeding (enforcing will just leave them running forever). This can
+ be done directly using 'docker' commands.
+ * It also may be worth trying to clean associated resources. Check 'maintenace' documentation for details.
+ - Permission problems will arise if non-KaaS namespace (using high range supplemental-group for GlusterFS mounts) is converted
+ to KaaS (gid ranges within 1000 - 10,000 at the moment). The allowed gids should be configured in the namespace specification
+ and the pods should be allowed to access files. Possible errors:
+ unable to create pods: pods "mongodb-2-" is forbidden: no providers available to validate pod request
@@ -219,6 +261,14 @@ Storage
Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
destroy'. So, it should be cleaned manually.
+
+ - Too many parallel mounts (above 500 per node) may cause systemd slow-down/crashes. It is indicated by
+ the following messages in the log:
+ E0926 09:29:50.744454 93115 mount_linux.go:172] Mount failed: exit status 1
+ Output: Failed to start transient scope unit: Connection timed out
+ * Solution is unclear, there are some suggestions to use 'setsid' in place of 'systemd-run' to do mounting,
+ but not clear how. Discussion: https://github.com/kubernetes/kubernetes/issues/79194
+ * Can we do some rate-limiting?
- Problems with pvc's can be evaluated by running
oc -n openshift-ansible-service-broker describe pvc etcd
@@ -271,7 +321,12 @@ MySQL
The remedy is to restart slave MySQL with 'slave_parallel_workers=0', give it a time to go, and then
restart back in the standard multithreading mode.
-
+Administration
+==============
+ - Some management tasks may require to login on ipekatrin* nodes. Thereafter, the password-less execution of
+ 'oc' may fail on master nodes complaining on invalid authentication token. To fix it, it is necessary to check
+ /root/.kube/config and remove references on logged users keeping only 'system:admin/kaas-kit-edu:8443' alkso check
+ listed contexts and current-context.
Performance
===========