Agreggated loging
building logging stack
OCP Logging in general
The cluster logging components are based upon Elasticsearch, Fluentd and Kibana.
- logStore: This is where the logs will be stored. The current implementation is Elasticsearch.
- collection: This is the component that collects logs from the node, formats them, and stores them in the logStore. The current implementation is Fluentd.
- visualization: This is the UI component used to view logs, graphs, charts, and so forth. The current implementation is Kibana.
- curation: This is the component that trims logs by age. The current implementation is Curator.
- event routing: This is the component forwards events to cluster logging. The current implementation is Event Router. The Event Router communicates with the OpenShift Container Platform and prints OpenShift Container Platform events to log of the pod where the event occurs.
The collector, Fluentd, is deployed to each node in the OpenShift Container Platform cluster. It collects all node and container logs and writes them to Elasticsearch (ES), Kibana to visualize.
Provision resources for logging stack on OCP
Logging stack will be installed as an operator, for elasticsearch we will use dedicated node with propriate taint to allow only schedule pods with defined toleration (part of logging stack).
Create custom node for logging purposes
taints:
- effect: NoSchedule
key: node-role
value: logging
spec:
metadata:
labels:
node-role.kubernetes.io/logging: ""
[ openshift/agregateLogging/components_of_logging/yaml/MachineSet-logging.yaml ]
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
machine.openshift.io/cluster-api-machine-role: infra
machine.openshift.io/cluster-api-machine-type: infra
name: toshi44-l9tcd-logging-westeurope3
namespace: openshift-machine-api
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
machine.openshift.io/cluster-api-machineset: toshi44-l9tcd-logging-westeurope3
template:
taints:
- effect: NoSchedule
key: node-role
value: logging
metadata:
creationTimestamp: null
labels:
machine.openshift.io/cluster-api-cluster: toshi44-l9tcd
machine.openshift.io/cluster-api-machine-role: infra
machine.openshift.io/cluster-api-machine-type: infra
machine.openshift.io/cluster-api-machineset: toshi44-l9tcd-logging-westeurope3
node.purpose: logging
spec:
metadata:
creationTimestamp: null
providerSpec:
value:
apiVersion: azureproviderconfig.openshift.io/v1beta1
credentialsSecret:
name: azure-cloud-credentials
namespace: openshift-machine-api
image:
offer: ""
publisher: ""
resourceID: /resourceGroups/toshi44-l9tcd-rg/providers/Microsoft.Compute/images/toshi44-l9tcd
sku: ""
version: ""
internalLoadBalancer: ""
kind: AzureMachineProviderSpec
location: westeurope
managedIdentity: toshi44-l9tcd-identity
metadata:
creationTimestamp: null
natRule: null
networkResourceGroup: toshi_vnet_rg
osDisk:
diskSizeGB: 128
managedDisk:
storageAccountType: Premium_LRS
osType: Linux
publicIP: false
publicLoadBalancer: ""
resourceGroup: toshi44-l9tcd-rg
sshPrivateKey: ""
sshPublicKey: ""
subnet: toshi-worker-subnet
userDataSecret:
name: worker-user-data
vmSize: Standard_D4S_v3
vnet: toshi_vnet
zone: "3"
Taints and Labels can be created later on
# taint
kubectl taint nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role=logging:NoSchedule
# label
oc label nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role.kubernetes.io/logging=logging
# get all nodes taints
oc get nodes -o json|jq -r '.items[].spec.taints'
In case we need to remove taint “use minus convention”
kubectl taint nodes toshi44-l9tcd-logging-westeurope3-nb8lf node-role=logging:NoSchedule-
Install ClusterLogging Operator and Elasticsearch Operator
Quite a long task described on RedHat with more informations.
Create a Namespace for the Elasticsearch Operator
[ openshift/agregateLogging/components_of_logging/yaml/eo-namespace.yaml ]
apiVersion: v1
kind: Namespace
metadata:
name: openshift-operators-redhat
annotations:
openshift.io/node-selector: ""
labels:
openshift.io/cluster-logging: "true"
openshift.io/cluster-monitoring: "true"
Create a Namespace for the Cluster Logging Operator
[ openshift/agregateLogging/components_of_logging/yaml/clo-namespace.yaml ]
apiVersion: v1
kind: Namespace
metadata:
name: openshift-logging
annotations:
openshift.io/node-selector: ""
labels:
openshift.io/cluster-logging: "true"
openshift.io/cluster-monitoring: "true"
Create an Operator Group for Elasticsearch operator
[ openshift/agregateLogging/components_of_logging/yaml/eo-operatorgroup.yaml ]
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-operators-redhat
namespace: openshift-operators-redhat
spec: {}
Create a Subscription for Elasticsearch operator
[ openshift/agregateLogging/components_of_logging/yaml/eo-subscription.yaml ]
#oc get packagemanifest cluster-logging -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: "elasticsearch-operator"
namespace: "openshift-operators-redhat"
spec:
# channel je vystup .status.channels[].name
channel: "4.4"
installPlanApproval: "Automatic"
source: "redhat-operators"
sourceNamespace: "openshift-marketplace"
name: "elasticsearch-operator"
Verify Operator installation, there should be an Elasticsearch Operator in each Namespace
oc get csv --all-namespaces
Create an Operator Group for ClusterLogging operator
[ openshift/agregateLogging/components_of_logging/yaml/clo-operatorgroup.yaml ]
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: cluster-logging
namespace: openshift-logging
spec:
targetNamespaces:
- openshift-logging
Create a Subscription for ClusterLogging operator
[ openshift/agregateLogging/components_of_logging/yaml/clo-subscription.yaml ]
# channel=oc get packagemanifest cluster-logging -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: cluster-logging
namespace: openshift-logging
spec:
channel: "4.4"
name: cluster-logging
source: redhat-operators
sourceNamespace: openshift-marketplace
oc get csv -n openshift-logging
# v namespace openshift-logging by se take mel objevit operator pod
oc get deploy -n openshift-logging
# do definice container v deploy pridame toleranci uvedenou vyse
Create CLusterLogging Instance CRD
The Cluster Logging Operator Custom Resource Definition (CRD) defines a complete cluster logging deployment that includes all the components of the logging stack to collect, store and visualize logs.
For deployment we must use correct tolerations to tolerate our node taint and allow to Schedule (can be defined in CRD). Elasticsearch will run with 1 Node and ZeroRedundancy configuration.
tolerations:
- key: "node-role"
operator: "Equal"
value: "logging"
effect: "NoSchedule"
#nebo
- key: "node-role"
operator: "Exists"
effect: "NoSchedule"
For storage we will use default storageClass managed-premium class, but later I would like to migrate to azureFile SC. Name of the instance must be “instance” otherwise cluster-logging-operator bude failovat( OCP 4.4.6).
[ openshift/agregateLogging/components_of_logging/yaml/ClusterLogging-CRD.yaml ]
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
name: instance
namespace: openshift-logging
spec:
managementState: Managed
logStore:
type: elasticsearch
elasticsearch:
nodeSelector:
node.purpose: logging
tolerations:
- key: "node-role"
operator: "Equal"
value: "logging"
effect: "NoSchedule"
nodeCount: 1
redundancyPolicy: ZeroRedundancy
storage:
storageClassName: managed-premium
size: 200G
resources:
limits:
cpu: "800m"
memory: "8Gi"
requests:
cpu: "100m"
memory: "8Gi"
visualization:
type: kibana
kibana:
nodeSelector:
node.purpose: logging
tolerations:
- key: "node-role"
operator: "Equal"
value: "logging"
effect: "NoSchedule"
replicas: 1
resources:
limits:
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
curation:
type: curator
curator:
tolerations:
- key: "node-role"
operator: "Equal"
value: "logging"
effect: "NoSchedule"
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
schedule: "*/5 * * * *"
collection:
logs:
type: fluentd
fluentd:
tolerations:
- key: "node-role"
operator: "Equal"
value: "logging"
effect: "NoSchedule"
resources:
limits:
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
Check status of logging stack
oc get pods -n openshift-logging
oc get clusterlogging instance -n openshift-logging -o yaml
oc get Elasticsearch elasticsearch -n openshift-logging -o yaml
oc get pods --selector component=elasticsearch -n openshift-logging -o name
#health status for indexes
#indices je shell script na podu
oc exec -n openshift-logging pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw -- indices
oc get replicaSet --selector component=elasticsearch -o name -n openshift-logging
EventRouter
Deployment which will periodically query events and store and process them over FluentD and store to ElasticSearch
Events
By kubernetes events we understand log messages internal to kubernetes, accessible through the kubernetes API /api/v1/events?watch=true, originally stored in etcd. The etcd storage has time and performance constraints, therefore, we would like to collect and store them permanently in EFK.
- eventrouter is deployed to logging project, has a service account and its own role to read events
- eventrouter watches kubernetes events, marshalls them to JSON and outputs to its STDOUT
- fluentd picks them up and inserts to elastic search logging project index
EventRouter Depoloyment
use template from RHEL
event_router_template
oc process -f eventRouter-template.yaml | oc apply -f -
Configuring the Event Router
oc project openshift-logging
oc get ds
Set TRANSFORM_EVENTS=true in order to process and store event router events in Elasticsearch.
Set cluster logging to the unmanaged state in web console
oc set env ds/fluentd TRANSFORM_EVENTS=true
oc get clusterlogging instance -o yaml
oc edit ClusterLogging instance
get logs:
oc exec fluentd-ht42r -n openshift-logging -- logs
# logs is a binary to display logs
You can send Elasticsearch logs to external devices, such as an externally-hosted Elasticsearch instance or an external syslog server. You can also configure Fluentd to send logs to an external log aggregator.
Configuring Fluentd to send logs to an external log aggregator You can configure Fluentd to send a copy of its logs to an external log aggregator, and not the default Elasticsearch, using the secure-forward plug-in. From there, you can further process log records after the locally hosted Fluentd has processed them.
–>to v podstate znamena pouzitu secure-forward —> jina instance fluentD s Kafka pluginem a dal do Kafky
fluentd nema forward plugin pro Kafku a Redhat ani neplanuje
[object Object]: [security_exception] no permissions for [indices:data/read/field_caps] and User [name=CN=system.logging.kibana,OU=OpenShift,O=Logging, roles=[]]
FluentD
Everything that a containerized application writes to stdout or stderr is streamed somewhere by the container engine – in Docker’s case, for example, to a logging driver. These logs are usually located in the /var/log/containers directory on your host.
The fluentd component runs as a daemonset it means one pod runs on each node in cluster. As nodes are added/removed, kubernetes orchestration ensures that there is one fluentd pod running on each node. Fluentd is configured to run as a privileged container. It is able to collect logs from all pods on the node, convert them to a structured format and pass them to log aggregator.
Architecture
Kubernetes, containerized applications that log to stdout and stderr have their log streams captured and redirected to JSON files on the nodes. The Fluentd Pod will tail these log files, filter log events, transform the log data, and ship it off to the Elasticsearch logging backendfluentD architecture
FLUENTD BASE CONFIGURATION and CUSTOMIZATION
Base configuration is stored in ConfigMap
oc get cm fluentd -o json|jq -r '.data["fluent.conf"]'|vim -
oc get cm fluentd -o json|jq -r '.data["run.sh"]'|vim -
Version of fluentD from logging-operator csv 4.4 is without fluentd-kafka-plugin
# list ruby gems on container
scl enable rh-ruby25 -- gem list
I made a version of fluentd with kafka and other plugins compiled into gems. So let’s try as kafka we will use Azure EventHub dasdas
TEST FLUENTD locally with podman
plugin used for fluentD build:
gem install fluent-config-regexp-type
gem install fluent-mixin-config-placeholders
gem install fluent-plugin-concat
gem install fluent-plugin-elasticsearch
gem install fluent-plugin-kafka
gem install fluent-plugin-kubernetes_metadata_filter
gem install fluent-plugin-multi-format-parser
gem install fluent-plugin-prometheus
gem install fluent-plugin-record-modifier
gem install fluent-plugin-remote-syslog
gem install fluent-plugin-remote_syslog
gem install fluent-plugin-rewrite-tag-filter
gem install fluent-plugin-splunk-hec
gem install fluent-plugin-systemd
gem install fluent-plugin-viaq_data_model
podman pull fluent/fluentd:v1.11-debian-1
#with conffile mount
podman run -p 8888:8888 -ti --rm -v /home/ts/git_repositories/work/openshift/oshi/logging:/fluentd/etc docker.io/fluent/fluentd:v1.11-debian-1 fluentd -c /fluentd/etc/fluent.conf
USE Azure EventHub instead of Kafka
Azure EventHub is able to consume kafka_output, for testing purposes we will use one.
Kafka Concept vs Event Hubs Concept
Cluster <----> Namespace
Topic <----> Event Hub
Partition <----> Partition
Consumer Group <----> Consumer Group
Offset <----> Offset
fluentD kafka output sample configuration:
<store>
@type kafka2
brokers fluentd-eventhub-oshi.servicebus.windows.net:9093
flush_interval 3s
<buffer topic>
@type file
path '/var/lib/fluentd/retry_clo_default_kafka_out'
flush_interval "#{ENV['ES_FLUSH_INTERVAL'] || '1s'}"
flush_thread_count "#{ENV['ES_FLUSH_THREAD_COUNT'] || 2}"
flush_at_shutdown "#{ENV['FLUSH_AT_SHUTDOWN'] || 'false'}"
retry_max_interval "#{ENV['ES_RETRY_WAIT'] || '300'}"
retry_forever true
queue_limit_length "#{ENV['BUFFER_QUEUE_LIMIT'] || '32' }"
chunk_limit_size "#{ENV['BUFFER_SIZE_LIMIT'] || '8m' }"
overflow_action "#{ENV['BUFFER_QUEUE_FULL_ACTION'] || 'block'}"
flush_interval 3s
</buffer>
# topic settings
default_topic kafka_output
# producer settings
max_send_retries 1
required_acks 1
<format>
@type json
</format>
ssl_ca_certs_from_system true
username $ConnectionString
password "Endpoint=sb://fluentd-eventhub-oshi.servicebus.windows.net/;SharedAccessKeyName=ss;SharedAccessKey=zeWz+9rSS/yWGanjcKrXMA2mAVCO0hL+MULhNWXHfkk=;EntityPath=kafka_output"
</store>
PATCH original CM with custom map
oc get cm fluentd -n openshift-logging >fluentd-cm.yaml
# bass is fish specific in bash you can ommit, right way to do it in fish shell
# is to use process substitution
# yq w -i test.yaml 'data.[fluent.conf]' -- (cat fluent.conf|psub)
# but not working becouse I get FIFO not a stream
bass 'yq w -i fluentd-cm.yaml 'data.[fluent.conf]' -- "$(< fluent.conf)"'
oc apply -f fluentd-cm.yaml
Custom map is mounted to pod in location
/etc/fluentd/config.d/
so restart of fluentD is neccesary
for i in (oc get pods -o name --selector component=fluentd); oc delete $i; end
Elasticsearch
Elasticsearch architecture
Cluster: Any non-trivial Elasticsearch deployment consists of multiple instances forming a cluster. Distributed consensus is used to keep track of master/replica relationships.
Node: A single Elasticsearch instance.
Index: A collection of documents. This is similar to a database in the traditional terminology.
Each data provider (like fluentd logs from a single Kubernetes cluster) should use a separate index to store and search logs.
An index is stored across multiple nodes to make data highly available.
Shard: Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes.(Elasticsearch automatically manages the arrangement of these shards. It also re-balances the shards as necessary, so users need not worry about these.)
Replica: By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.
Deploy Client: These nodes provide the API endpoint and can be used for queries. In a Kubernetes-based deployment these are deployed a service so that a logical dns endpoint can be used for queries regardless of number of client nodes. Master: These nodes provide coordination. A single master is elected at a time by using distributed consensus. That node is responsible for deciding shard placement, reindexing and rebalancing operations. Data: These nodes store the data and inverted index. Clients query Data nodes directly. The data is sharded and replicated so that a given number of data nodes can fail, without impacting availability.
Exposing Elasticsearch as a route
For testing purposes and API queries
# elasticsearch-route.yaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: elasticsearch
namespace: openshift-logging
spec:
host:
to:
kind: Service
name: elasticsearch
tls:
termination: reencrypt
destinationCACertificate: |
oc extract secret/elasticsearch --to=. --keys=admin-ca
cat ./admin-ca | sed -e "s/^/ /" >> elasticsearch-route.yaml
set token (oc whoami -t) #get Bearer token
set routeES (oc get route -n openshift-logging elasticsearch -o json|jq -Mr '.spec.host')
# operations index
curl -s -tlsv1.2 --insecure -H "Authorization: Bearer $token" "https://$routeES/.operations.*/_search?size=1" | jq
# all indexes
curl -s -tlsv1.2 --insecure -H "Authorization: Bearer $token" "https://$routeES/_aliases" | jq