Agreggated loging

building logging stack

2020-06-08

*file: 01-intro-loggingArch.md *

OCP Logging in general

The cluster logging components are based upon Elasticsearch, Fluentd and Kibana.

logStore: This is where the logs will be stored. The current implementation is Elasticsearch.
collection: This is the component that collects logs from the node, formats them, and stores them in the logStore. The current implementation is Fluentd.
visualization: This is the UI component used to view logs, graphs, charts, and so forth. The current implementation is Kibana.
curation: This is the component that trims logs by age. The current implementation is Curator.
event routing: This is the component forwards events to cluster logging. The current implementation is Event Router. The Event Router communicates with the OpenShift Container Platform and prints OpenShift Container Platform events to log of the pod where the event occurs.

The collector, Fluentd, is deployed to each node in the OpenShift Container Platform cluster. It collects all node and container logs and writes them to Elasticsearch (ES), Kibana to visualize.

*file: 01-provision-resources.md *

Provision resources for logging stack on OCP

Logging stack will be installed as an operator, for elasticsearch we will use dedicated node with propriate taint to allow only schedule pods with defined toleration (part of logging stack).

Create custom node for logging purposes

      taints:
      - effect: NoSchedule
        key: node-role
        value: logging

    spec:
      metadata:
        labels:
          node-role.kubernetes.io/logging: ""

[ openshift/agregateLogging/components_of_logging/yaml/MachineSet-logging.yaml ]

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
    machine.openshift.io/cluster-api-machine-role: infra
    machine.openshift.io/cluster-api-machine-type: infra
  name: toshi44-l9tcd-logging-westeurope3
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
      machine.openshift.io/cluster-api-machineset: toshi44-l9tcd-logging-westeurope3
  template:
    taints:
    - effect: NoSchedule
      key: node-role
      value: logging
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
        machine.openshift.io/cluster-api-machine-role: infra
        machine.openshift.io/cluster-api-machine-type: infra
        machine.openshift.io/cluster-api-machineset: toshi44-l9tcd-logging-westeurope3
        node.purpose: logging
    spec:
      metadata:
        creationTimestamp: null
      providerSpec:
        value:
          apiVersion: azureproviderconfig.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/toshi44-l9tcd-rg/providers/Microsoft.Compute/images/toshi44-l9tcd
            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: westeurope
          managedIdentity: toshi44-l9tcd-identity
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: toshi_vnet_rg
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: toshi44-l9tcd-rg
          sshPrivateKey: ""
          sshPublicKey: ""
          subnet: toshi-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4S_v3
          vnet: toshi_vnet
          zone: "3"

Taints and Labels can be created later on

 # taint
kubectl taint nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role=logging:NoSchedule
 # label
oc label nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role.kubernetes.io/logging=logging

 # get all nodes taints
oc get nodes -o json|jq -r '.items[].spec.taints'

In case we need to remove taint “use minus convention”

kubectl taint nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role=logging:NoSchedule-

Install ClusterLogging Operator and Elasticsearch Operator

Quite a long task described on RedHat with more informations.

Create a Namespace for the Elasticsearch Operator

[ openshift/agregateLogging/components_of_logging/yaml/eo-namespace.yaml ]

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-operators-redhat 
  annotations:
    openshift.io/node-selector: ""
  labels:
    openshift.io/cluster-logging: "true"
    openshift.io/cluster-monitoring: "true"

Create a Namespace for the Cluster Logging Operator

[ openshift/agregateLogging/components_of_logging/yaml/clo-namespace.yaml ]

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-logging
  annotations:
    openshift.io/node-selector: ""
  labels:
    openshift.io/cluster-logging: "true"
    openshift.io/cluster-monitoring: "true"

Create an Operator Group for Elasticsearch operator

[ openshift/agregateLogging/components_of_logging/yaml/eo-operatorgroup.yaml ]

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-operators-redhat
  namespace: openshift-operators-redhat 
spec: {}

Create a Subscription for Elasticsearch operator

[ openshift/agregateLogging/components_of_logging/yaml/eo-subscription.yaml ]

#oc get packagemanifest cluster-logging -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: "elasticsearch-operator"
  namespace: "openshift-operators-redhat" 
spec:
  # channel je vystup .status.channels[].name
  channel: "4.4" 
  installPlanApproval: "Automatic"
  source: "redhat-operators"
  sourceNamespace: "openshift-marketplace"
  name: "elasticsearch-operator"

Verify Operator installation, there should be an Elasticsearch Operator in each Namespace

oc get csv --all-namespaces

Create an Operator Group for ClusterLogging operator

[ openshift/agregateLogging/components_of_logging/yaml/clo-operatorgroup.yaml ]

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: cluster-logging
  namespace: openshift-logging 
spec:
  targetNamespaces:
  - openshift-logging

Create a Subscription for ClusterLogging operator

[ openshift/agregateLogging/components_of_logging/yaml/clo-subscription.yaml ]

# channel=oc get packagemanifest cluster-logging -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: cluster-logging
  namespace: openshift-logging
spec:
  channel: "4.4" 
  name: cluster-logging
  source: redhat-operators
  sourceNamespace: openshift-marketplace

oc get csv -n openshift-logging
 # v namespace openshift-logging by se take mel objevit operator pod
oc get deploy -n openshift-logging
 # do definice container v deploy pridame toleranci uvedenou vyse

Create CLusterLogging Instance CRD

The Cluster Logging Operator Custom Resource Definition (CRD) defines a complete cluster logging deployment that includes all the components of the logging stack to collect, store and visualize logs.
For deployment we must use correct tolerations to tolerate our node taint and allow to Schedule (can be defined in CRD). Elasticsearch will run with 1 Node and ZeroRedundancy configuration.

      tolerations:
      - key: "node-role"
        operator: "Equal"
        value: "logging"
        effect: "NoSchedule"
        #nebo
      - key: "node-role"
        operator: "Exists"
        effect: "NoSchedule"

For storage we will use default storageClass managed-premium class, but later I would like to migrate to azureFile SC. Name of the instance must be “instance” otherwise cluster-logging-operator bude failovat( OCP 4.4.6).

[ openshift/agregateLogging/components_of_logging/yaml/ClusterLogging-CRD.yaml ]

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  managementState: Managed
  logStore:
    type: elasticsearch
    elasticsearch:
      nodeSelector:
        node.purpose: logging
      tolerations:
      - key: "node-role"
        operator: "Equal"
        value: "logging"
        effect: "NoSchedule"
      nodeCount: 1
      redundancyPolicy: ZeroRedundancy
      storage:
        storageClassName: managed-premium
        size: 200G
      resources:
          limits:
            cpu: "800m"
            memory: "8Gi"
          requests:
            cpu: "100m"
            memory: "8Gi"
  visualization:
    type: kibana
    kibana:
      nodeSelector:
        node.purpose: logging
      tolerations:
      - key: "node-role"
        operator: "Equal"
        value: "logging"
        effect: "NoSchedule"
      replicas: 1
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 1Gi
  curation:
    type: curator
    curator:
      tolerations: 
       - key: "node-role"
         operator: "Equal"
         value: "logging"
         effect: "NoSchedule"
      resources:
        limits:
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 100Mi
      schedule: "*/5 * * * *"
  collection:
    logs:
      type: fluentd
      fluentd:
        tolerations: 
        - key: "node-role"
          operator: "Equal"
          value: "logging"
          effect: "NoSchedule"
        resources:
          limits:
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 1Gi

Check status of logging stack

oc get pods -n openshift-logging
oc get clusterlogging instance -n openshift-logging -o yaml
oc get Elasticsearch elasticsearch -n openshift-logging -o yaml
oc get pods --selector component=elasticsearch -n openshift-logging -o name
#health status for indexes
#indices je shell script na podu
oc exec -n openshift-logging pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw -- indices
oc get replicaSet --selector component=elasticsearch -o name -n openshift-logging

*file: 02-fluendD-eventRouter.md *

EventRouter

Deployment which will periodically query events and store and process them over FluentD and store to ElasticSearch

Events

By kubernetes events we understand log messages internal to kubernetes, accessible through the kubernetes API /api/v1/events?watch=true, originally stored in etcd. The etcd storage has time and performance constraints, therefore, we would like to collect and store them permanently in EFK.

eventrouter is deployed to logging project, has a service account and its own role to read events
eventrouter watches kubernetes events, marshalls them to JSON and outputs to its STDOUT
fluentd picks them up and inserts to elastic search logging project index

EventRouter Depoloyment

use template from RHEL
event_router_template

oc process -f eventRouter-template.yaml  | oc apply -f -

Configuring the Event Router

oc project openshift-logging
oc get ds
Set TRANSFORM_EVENTS=true in order to process and store event router events in Elasticsearch.
Set cluster logging to the unmanaged state in web console
oc set env ds/fluentd TRANSFORM_EVENTS=true

oc get clusterlogging instance -o yaml
oc edit ClusterLogging instance

get logs:

oc exec fluentd-ht42r -n openshift-logging -- logs
 # logs is a binary to display logs

You can send Elasticsearch logs to external devices, such as an externally-hosted Elasticsearch instance or an external syslog server. You can also configure Fluentd to send logs to an external log aggregator.

Configuring Fluentd to send logs to an external log aggregator You can configure Fluentd to send a copy of its logs to an external log aggregator, and not the default Elasticsearch, using the secure-forward plug-in. From there, you can further process log records after the locally hosted Fluentd has processed them.

–>to v podstate znamena pouzitu secure-forward —> jina instance fluentD s Kafka pluginem a dal do Kafky

fluentd nema forward plugin pro Kafku a Redhat ani neplanuje

[object Object]: [security_exception] no permissions for [indices:data/read/field_caps] and User [name=CN=system.logging.kibana,OU=OpenShift,O=Logging, roles=[]]

*file: 02-fluentD-OpenshiftScope.md *

FluentD

Everything that a containerized application writes to stdout or stderr is streamed somewhere by the container engine – in Docker’s case, for example, to a logging driver. These logs are usually located in the /var/log/containers directory on your host.

The fluentd component runs as a daemonset it means one pod runs on each node in cluster. As nodes are added/removed, kubernetes orchestration ensures that there is one fluentd pod running on each node. Fluentd is configured to run as a privileged container. It is able to collect logs from all pods on the node, convert them to a structured format and pass them to log aggregator.

Architecture

Kubernetes, containerized applications that log to stdout and stderr have their log streams captured and redirected to JSON files on the nodes. The Fluentd Pod will tail these log files, filter log events, transform the log data, and ship it off to the Elasticsearch logging backend

FLUENTD BASE CONFIGURATION and CUSTOMIZATION

Base configuration is stored in ConfigMap

oc get cm fluentd -o json|jq -r '.data["fluent.conf"]'|vim -
oc get cm fluentd -o json|jq -r '.data["run.sh"]'|vim -

Version of fluentD from logging-operator csv 4.4 is without fluentd-kafka-plugin

 # list ruby gems on container
scl enable rh-ruby25 -- gem list

I made a version of fluentd with kafka and other plugins compiled into gems. So let’s try as kafka we will use Azure EventHub dasdas

TEST FLUENTD locally with podman

plugin used for fluentD build:

gem install fluent-config-regexp-type 
gem install fluent-mixin-config-placeholders 
gem install fluent-plugin-concat 
gem install fluent-plugin-elasticsearch 
gem install fluent-plugin-kafka 
gem install fluent-plugin-kubernetes_metadata_filter 
gem install fluent-plugin-multi-format-parser 
gem install fluent-plugin-prometheus 
gem install fluent-plugin-record-modifier 
gem install fluent-plugin-remote-syslog 
gem install fluent-plugin-remote_syslog 
gem install fluent-plugin-rewrite-tag-filter 
gem install fluent-plugin-splunk-hec 
gem install fluent-plugin-systemd 
gem install fluent-plugin-viaq_data_model

podman pull fluent/fluentd:v1.11-debian-1
 #with conffile mount
podman run -p 8888:8888 -ti  --rm -v  /home/ts/git_repositories/work/openshift/oshi/logging:/fluentd/etc docker.io/fluent/fluentd:v1.11-debian-1 fluentd -c /fluentd/etc/fluent.conf

USE Azure EventHub instead of Kafka

Azure EventHub is able to consume kafka_output, for testing purposes we will use one.

Kafka Concept vs Event Hubs Concept
Cluster        <---->     Namespace
Topic          <---->     Event Hub
Partition      <---->     Partition
Consumer Group <---->     Consumer Group
Offset         <---->     Offset

fluentD kafka output sample configuration:

  <store>
  @type kafka2
  brokers fluentd-eventhub-oshi.servicebus.windows.net:9093
  flush_interval 3s
  <buffer topic>
    @type file
    path '/var/lib/fluentd/retry_clo_default_kafka_out'
		flush_interval "#{ENV['ES_FLUSH_INTERVAL'] || '1s'}"
		flush_thread_count "#{ENV['ES_FLUSH_THREAD_COUNT'] || 2}"
		flush_at_shutdown "#{ENV['FLUSH_AT_SHUTDOWN'] || 'false'}"
		retry_max_interval "#{ENV['ES_RETRY_WAIT'] || '300'}"
		retry_forever true
		queue_limit_length "#{ENV['BUFFER_QUEUE_LIMIT'] || '32' }"
		chunk_limit_size "#{ENV['BUFFER_SIZE_LIMIT'] || '8m' }"
		overflow_action "#{ENV['BUFFER_QUEUE_FULL_ACTION'] || 'block'}"
    flush_interval 3s
  </buffer>

  # topic settings
  default_topic kafka_output 

  # producer settings
  max_send_retries 1
  required_acks 1
  <format>
    @type json
  </format>
  ssl_ca_certs_from_system true

  username $ConnectionString
  password "Endpoint=sb://fluentd-eventhub-oshi.servicebus.windows.net/;SharedAccessKeyName=ss;SharedAccessKey=zeWz+9rSS/yWGanjcKrXMA2mAVCO0hL+MULhNWXHfkk=;EntityPath=kafka_output"
  </store>

PATCH original CM with custom map

oc get cm fluentd -n openshift-logging >fluentd-cm.yaml
 # bass is fish specific in bash you can ommit, right way to do it in fish shell
 # is to use process substitution
 # yq w  -i test.yaml 'data.[fluent.conf]' -- (cat fluent.conf|psub) 
 # but not working becouse I get FIFO not a stream
bass 'yq w -i fluentd-cm.yaml 'data.[fluent.conf]'  -- "$(< fluent.conf)"'
oc apply -f fluentd-cm.yaml

Custom map is mounted to pod in location

/etc/fluentd/config.d/

so restart of fluentD is neccesary

for i in (oc get pods -o name --selector component=fluentd); oc delete $i; end

*file: 03-elasticSearch.md *

Elasticsearch

Elasticsearch architecture

Cluster: Any non-trivial Elasticsearch deployment consists of multiple instances forming a cluster. Distributed consensus is used to keep track of master/replica relationships.
Node: A single Elasticsearch instance.
Index: A collection of documents. This is similar to a database in the traditional terminology. Each data provider (like fluentd logs from a single Kubernetes cluster) should use a separate index to store and search logs. An index is stored across multiple nodes to make data highly available.
Shard: Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes.(Elasticsearch automatically manages the arrangement of these shards. It also re-balances the shards as necessary, so users need not worry about these.)
Replica: By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.

Deploy Client: These nodes provide the API endpoint and can be used for queries. In a Kubernetes-based deployment these are deployed a service so that a logical dns endpoint can be used for queries regardless of number of client nodes. Master: These nodes provide coordination. A single master is elected at a time by using distributed consensus. That node is responsible for deciding shard placement, reindexing and rebalancing operations. Data: These nodes store the data and inverted index. Clients query Data nodes directly. The data is sharded and replicated so that a given number of data nodes can fail, without impacting availability.

Exposing Elasticsearch as a route

For testing purposes and API queries

 # elasticsearch-route.yaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: elasticsearch
  namespace: openshift-logging
spec:
  host:
  to:
    kind: Service
    name: elasticsearch
  tls:
    termination: reencrypt
    destinationCACertificate: |

oc extract secret/elasticsearch --to=. --keys=admin-ca
cat ./admin-ca | sed -e "s/^/      /" >> elasticsearch-route.yaml
set token (oc whoami -t) #get Bearer token
set routeES (oc get route -n openshift-logging elasticsearch -o json|jq -Mr '.spec.host')
 # operations index
curl -s -tlsv1.2 --insecure -H "Authorization: Bearer $token" "https://$routeES/.operations.*/_search?size=1" | jq
 # all indexes
curl -s -tlsv1.2 --insecure -H "Authorization: Bearer $token" "https://$routeES/_aliases" | jq

FluentD