5️⃣Logging

If you ask me what is the best place to get more insights to infra, my answer would be from Infrastructure Issues. I spend a good first half of my career supporting and solving infra issues be it network or server. In those days orgs used to heavily depend on IT support, they still do, though it has transformed to present day tech support or cloud support.

Unfortunately , even today it comes to infra it’s reactive approach to solve these issues has been the norm, although proper analysis of incidents with proactive measures can prevent them but they get ignored since they are considered non-functional or operational . These come to notice when businesses start losing customers to their competitors, review their annual performance and identity its due to lower QoS. The faster you can respond and solve any issue the better your QoS and the more trust you establish with your customers.

How many hours you have spent last month on production issues, how many downtimes you had and how long you spend to bring a back system up. Before I continue further spend the next few seconds to find out how long were those so called non-productive hours, so rewind your memory and start noting the downtimes.

Why log ?

Observability and can debug faster.
Root cause analysis and postmortem to identity root cause of an incident and take preventive measures.
In turn allowing you better commitment to your SLAs and SLOs and having the max no. of 9s in it.

In short it helps identify the fault lines in your infra and help to build a fault tolerant service.

Some common challenges in logging

Missing Logs
Format Issue
Too many tools
Too much logs
Log corruption

What should be logged ?

Logging must be enabled to build observability for both infrastructure and applications. It is a must for:

Infrastructure Components like network, system, storage to monitor health, performance, incidents, configuration and access management. Unless you log, how will you know why the last deployment failed , if it was an infrastructure issue or application bug.

Likewise log collectors must be enabled by default for all applications being developed to build observability and traceability in performance.

In addition to the standard logging, enable audit logging for security purpose such as when a user gets created or deleted, if there are multiple failed logins, or too many access errors etc. These audit logs are a must during compliance and regulatory audits.

Centralised Log Aggregator

One stop station for traceability.
Standard format to make it human readable with standard log collector like filebeat, fluentd, journald to prevent missing logs or corruption issues.
Logs must be aggregated and stored in shared storage like NAS, SAN.
Enable regular log rotation and achieve logs older than 30 days. Set a log retention policy inline with your compliance needs. Most regulators especially in banking or financials sector may need you to retails logs for atleast a year or two. In such a case archive and keep then in a separate storage. Or else you might hit a disk space issue for your log aggregator.

Summary

Faster you can debug & solve an issue the shorter your SLA or SLO to your customer. Customer sales increases with the increase on 9s in the SLA.

For faster debugging you need a one stop station of human readable formatted logs, in short a centralised robust logging system that is easy to operate, stable, scalable, secure and cost effective.

PreviousIdentity Management NextMonitoring

Last updated 2 years ago

Was this helpful?