OPENSHIFT API-SERVER TLS handshake error
DEBUG
Původce: Objevily se problémy s přihlašováním. Všechny api-resources poskytovane openshift-api jsou částečně nedostupné. Jelikož je problém objevuje jen na 1/3 podů, cluster je částečně funkční.Přihlašování dělám přez token jednoho z podů openshift-apiserver –> /run/secrets/kubernetes.io/serviceaccount/token
Gather Informations
oc get proxy.config cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Proxy
spec:
httpProxy: http://10.88.233.244:3128
httpsProxy: http://10.88.233.244:3128
noProxy: .cluster.local,.svc,127.0.0.1,172.30.0.0/16,api-int.oaz-dev.azure.sudlice.cz,etcd-0.oaz-dev.azure.sudlice.cz,etcd-1.oaz-dev.azure.sudlice.cz,etcd-2.oaz-dev.azure.sudlice.cz,localhost,10.88.233.192/28,10.88.233.32/27,.oaz-dev.azure.sudlice.cz,10.128.0.0/14
oc get clusteroperator
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.0-0.nightly-2020-07-24-111750 True False True 5d19h
image-registry 4.6.0-0.nightly-2020-07-24-111750 True False True 6d20h
monitoring 4.6.0-0.nightly-2020-07-24-111750 False True True 35h
openshift-apiserver 4.6.0-0.nightly-2020-07-24-111750 False False False 18h
# involved events
oc get events --all-namespaces -o json|jq -r '.items[]|{obj: .involvedObject.name,namespace: .involvedObject.namespace,message: .message,last: .lastTimestamp}'
oc get events -n openshift-apiserver-operator
# api resources
oc api-resources
error: unable to retrieve the complete list of server APIs:
authorization.openshift.io/v1: the server is currently unable to handle the request,
oauth.openshift.io/v1: the server is currently unable to handle the request,
packages.operators.coreos.com/v1: the server is currently unable to handle the request,
route.openshift.io/v1: the server is currently unable to handle the request,
security.openshift.io/v1: the server is currently unable to handle the request
# log agregation
stern -n openshift-apiserver apiserver
apiserver-hv5p5 openshift-apiserver I0524 22:17:43.670943 1 log.go:172] http: TLS handshake error from 10.131.0.1:45118: EOF
apiserver-hv5p5 openshift-apiserver I0524 22:17:47.656841 1 log.go:172] http: TLS handshake error from 10.131.0.1:45152: EOF
apiserver-hv5p5 openshift-apiserver I0524 22:17:57.658147 1 log.go:172] http: TLS handshake error from 10.131.0.1:45240:
# kube-apiserver
stern -n openshift-kube-apiserver kube-apiserver|grep -Eo "E[[:digit:]]{4}.*"
E0805 05:01:25.907233 17 controller.go:114] loading OpenAPI spec for "v1.build.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
E0805 05:01:45.937641 17 controller.go:114] loading OpenAPI spec for "v1.image.openshift.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
Debug Description
1. tcpdump on api-server pod
oc debug node/<nodename>
# get container pid
chroot /host crictl ps |grep openshift-apiserver
chroot /host crictl inspect 5c831210d2594 |grep '"pid":'
# nsenter run program in different namespaces
nsenter -n -t $pid -- ip a
nsenter -n -t 2205119 -- tcpdump -nn -i eth0 "tcp port 8443" -w /host/tmp/tcpdump.pcap
#copy output to localhost
oc get pods #oaz-dev-tnhr6-master-2-debug
oc cp oaz-dev-tnhr6-master-2-debug:/host/tmp/tcpdump.pcap tcpdump.pcap
# and visualize in wireshark
Two errors occures in logs and their net stack errors:EOF
This means that while the server and the client were performing the TLS handshake, the server saw the connection being closed, aka EOF.i/o timeout
This means that while the server was waiting to read from the client during the TLS handshake, the client didn’t send anything before closing the connection.
2. netstat
oc rsh -n openshift-apiserver apiserver-hv5p5 yum install net-tools netstat -nputw netstat -nputwc
- z 10.131.0.1 prichazeji requesty a jsou ve stavu ESTABILISHED
- nektere se objevi v chybach, zda se ze bude problem s timeoutem
3. use SS to list all tcp4 connections
from master node
while true;do sleep 2;ss -nt4pe -o state established >sslog;done
#find source ports in log
The ss program is using a sock_diag(7) netlink socket to retrieve information about sockets. But the sock_diag interface doesn’t support a “monitor”/watching/listening mode, as rtnetlink(7) does. You can only do queries via a sock_diag socket.
4. force restart api servers
# openshift-api
for i in (oc get pods -n openshift-apiserver -o name); oc delete -n openshift-apiserver $i;end
for i in (oc get pods -n openshift-sdn --selector app=sdn -o name); oc delete -n openshift-sdn $i;end
# openshift-kube-apiserver runs as static pod
# from masternode {kube-apiserver,kube-apiserver-check-endpoints}
crictl ps |grep kube-apiserver
crictl stop/start UID
5. delete podnetworkconnectivitycheck
I have found some errors like routing to non-existing pods, delete will force to update.
oc scale --replicas 0 -n openshift-kube-apiserver-operator deployments/openshift-kube-apiserver-operator
oc scale --replicas 0 -n openshift-apiserver-operator deployments/openshift-apiserver-operator
oc delete -n openshift-kube-apiserver podnetworkconnectivitycheck --all
oc delete -n openshift-apiserver podnetworkconnectivitycheck --all
oc scale --replicas 1 -n openshift-kube-apiserver-operator deployments/openshift-kube-apiserver-operator
oc scale --replicas 1 -n openshift-apiserver-operator deployments/openshift-apiserver-operator
6. curl API from different locations
Curl does not support CIDR in NO_PROXY “A comma-separated list of host names that shouldn’t go through any proxy is set in … NO_PROXY”.
# openshift apiserver endpoints
oc get endpoints -n openshift-apiserver
NAME ENDPOINTS AGE
api 10.128.0.40:8443,10.129.0.28:8443,10.130.0.55:8443 11d
#kube apiservers
oc get pods -n openshift-kube-apiserver -o wide|sed -n '1p;/kube-apiserver/p'|awk '{print $1" "$6" "$7}'
NAME IP NODE
kube-apiserver-oaz-dev-tnhr6-master-0 10.88.233.196 oaz-dev-tnhr6-master-0
kube-apiserver-oaz-dev-tnhr6-master-1 10.88.233.198 oaz-dev-tnhr6-master-1
kube-apiserver-oaz-dev-tnhr6-master-2 10.88.233.200 oaz-dev-tnhr6-master-2
oc rsh -n openshift-kube-apiserver kube-apiserver-oaz-dev-tnhr6-master-0
for i in {10.128.0.40:8443,10.129.0.28:8443,10.130.0.55:8443}; do echo -e "https://$i/apis";curl -k https://$i/apis --header "Authorization: Bearer $TOKEN" --connect-timeout 10 ;echo;done
https://10.128.0.40:8443/apis
curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received
https://10.129.0.28:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
https://10.130.0.55:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
oc rsh -n openshift-kube-apiserver kube-apiserver-oaz-dev-tnhr6-master-1
for i in {10.128.0.40:8443,10.129.0.28:8443,10.130.0.55:8443}; do echo -e "https://$i/apis";curl -k https://$i/apis --header "Authorization: Bearer $TOKEN" --connect-timeout 10 ;echo;done
https://10.128.0.40:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
https://10.129.0.28:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
https://10.130.0.55:8443/apis
curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received
oc rsh -n openshift-kube-apiserver kube-apiserver-oaz-dev-tnhr6-master-2
for i in {10.128.0.40:8443,10.129.0.28:8443,10.130.0.55:8443}; do echo -e "https://$i/apis";curl -k https://$i/apis --header "Authorization: Bearer $TOKEN" --connect-timeout 10 ;echo;done
https://10.128.0.40:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
https://10.129.0.28:8443/apis
{
"kind": "APIGroupList",
"groups": []
}
https://10.130.0.55:8443/apis
curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received
oc scale --replicas 1 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator
oc scale --replicas 1 -n openshift-apiserver-operator deployments/openshift-apiserver-operator
oc delete -n openshift-kube-apiserver podnetworkconnectivitycheck --all