Authentication operator degraded

DEBUG

2020-07-06

Nejde se zalogovat prez konzoli. Authentication operator je ve stavu Degraded.

Authentication and Authentication Operator

# cluster operator
oc get co 
# operator logs 
set ns openshift-authentication-operator;oc logs -n "$ns" (oc get pods -n "$ns" -o name)
# warning and errors
set ns openshift-authentication-operator;oc logs -n "$ns" (oc get pods -n "$ns" -o name)|grep -E "[W,E][[:digit:]]{4}"

  OAuthClientsDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io openshift-browser-client) 
  to RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)

  OAuthClientsDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io openshift-browser-client)
  failed handling the route: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)
  lookup oauth-openshift.apps.toshi44.sudlice.org on 172.30.0.10:53: read udp 10.130.0.215:56716->172.30.0.10:53: i/o timeout
  failed with: the server is currently unable to handle the request (post oauthclients.oauth.openshift.io)

# curl na oauth endpoint vraci timeout pri kazdem x-tem pokusu
curl -k https://oauth-openshift.apps.oaz-dev.azure.sudlice.cz/oauth/token/display

oc get route -A|grep oauth
  Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)

stern -n openshift-authentication oauth|grep -E "E[[:digit:]]{4}"
# try to delete all
for i in (oc get pods -o name --selector app=oauth-openshift -n openshift-authentication);
  oc delete $i -n openshift-authentication;
  end

ETCD

seems to be quite happy

stern -n openshift-etcd-operator etcd|grep -E "[W,E][[:digit:]]{4}"

ale openshift-etcd-operator(D: etcd-operator) si stale stezuje
unhealthy members: toshi44-l9tcd-master-1,toshi44-l9tcd-master-0,toshi44-l9tcd-master-2
kubeAPI ma velkou latenci a to muze klidne s nezdravosti etcd souviset

etcdctl member list -w table
  24dd89393f91e72e, started, toshi44-l9tcd-master-1, https://10.4.0.4:2380, https://10.4.0.4:2379
  3672e44507206aee, started, toshi44-l9tcd-master-0, https://10.4.0.6:2380, https://10.4.0.6:2379
  d9eaeebdf47b1d9a, started, toshi44-l9tcd-master-2, https://10.4.0.7:2380, https://10.4.0.7:2379

etcdctl endpoint health --cluster
  https://10.4.0.6:2379 is healthy: successfully committed proposal: took = 10.984237ms
  https://10.4.0.7:2379 is healthy: successfully committed proposal: took = 17.462818ms
  https://10.4.0.4:2379 is healthy: successfully committed proposal: took = 22.100276ms

# get events
oc get events --all-namespaces -o json|jq -r '.items[]|{obj: .involvedObject.name,namespace: .involvedObject.namespace,message: .message,last: .lastTimestamp}'|jq -r 'select (.namespace |contains("etcd"))'

nedari se mi prijit na to co se etcd nelibi

get EVENTS

oc get events --all-namespaces -o json \
|jq -r '.items[]|{obj: .involvedObject.name,namespace: .involvedObject.namespace,message: .message,last: .lastTimestamp}'\
|jq -r 'select (.namespace |contains("sudlice")|not)'\
|jq -r 'select (.namespace | contains("authentication"))'

{
  "obj": "authentication-operator",
  "namespace": "openshift-authentication-operator",
  "message": "Status for clusteroperator/authentication changed: Degraded message changed from \"\" to \"WellKnownEndpointDegraded: failed to GET well-known https://10.4.0.4:6443/.well-known/oauth-authorization-server: net/http: TLS handshake timeout\"",
  "last": "2020-07-07T08:46:38Z"
}
{
  "obj": "authentication-operator",
  "namespace": "openshift-authentication-operator",
  "message": "Status for clusteroperator/authentication changed: Degraded message changed from \"WellKnownEndpointDegraded: failed to GET well-known https://10.4.0.4:6443/.well-known/oauth-authorization-server: net/http: TLS handshake timeout\" to \"\"",
  "last": "2020-07-07T08:46:41Z"
}
{
  "obj": "authentication-operator",
  "namespace": "openshift-authentication-operator",
  "message": "Status for clusteroperator/authentication changed: Degraded message changed from \"\" to \"RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.toshi44.sudlice.org on 172.30.0.10:53: read udp 10.130.0.215:55146->172.30.0.10:53: i/o timeout\"",
  "last": "2020-07-07T06:00:35Z"
}

Openshift API server and API resources

every three runs

oc api-resources

  error: unable to retrieve the complete list of server apis: project.openshift.io/v1: the server is currently unable to handle the request,  
  route.openshift.io/v1: the server is currently unable to handle the request,  
  security.openshift.io/v1: the server is currently unable to handle the request,  
  template.openshift.io/v1: the server is currently unable to handle the request,  
  user.openshift.io/v1: the server is currently unable to handle the request

stern -n openshift-apiserver apiserver
 # common error
  http: TLS handshake error from 10.129.0.1:4963

DNS

problem se castecne vyresil restartem dns ale neda se tomu rozhodne rikat celkove reseni. Zaroven se mi nezda ze by byl problem v DNS (samozrejme nejake problemy s resolvingem se z logu vycist daji)

set ns openshift-dns; for i in (oc get pods -n "$ns" -o name ); oc delete -n "$ns" $i; end

PackageServer

package server ma problemy a zpusobuje obcasnou nedostupnost API
tyka se to operator-lifecycle-manager-packageserver

packageserver-5df56b8c8-xz257 packageserver I0707 11:13:24.539873       1 log.go:172] http: TLS handshake error from 10.129.0.1:50886: remote error: tls: bad certificate

skusime je procistit

set ns openshift-operator-lifecycle-manager; set sel "app=packageserver"; for i in (oc get pods -n "$ns" --selector $sel -o name ); oc delete -n "$ns" $i; end

HODNOCENI

zda se mi ze za celym problemem se nachazi castecna nedostupnost OpenShift API serveru, kdy pri jenom ze 3 requestu probehne timeout. Otazka je co to zpusobuje.

stern -n openshift-apiserver apiserver
apiserver-6756b4f77b-fq2sp openshift-apiserver I0707 12:35:04.255724       1 log.go:172] http: TLS handshake error from 10.129.0.1:48138: EOF