We Need to Make Our Logs Useful

A Lesson in Observability

Saif Uddin Mahmud
Dabbler in Destress

--

Photo by Radek Grzybowski on Unsplash

Trawling Through Crap

We get used to calling print() to debug code meant for one-off school projects and MVPs. We may appreciate the usefulness of dedicated Loggers and Debuggers with time, but the unstructured logging habits carry over to production. And someday it bites you really bad…

It bit me when I was investigating my first bug-in-production at Klinify. Something had gone wrong with a large request, triggered timeouts across the board and the Frontend started spamming the same request. This overloaded the Backend and causing performance issues. Our production infrastructure’s distributed across VMs, Azure-managed services, and a Kubernetes cluster. Azure Log Analytics (ALA) was the tool of choice when sifting through logs pouring in about a few hundred lines per second.

I was fighting ALA, trying to narrow down my query to find the problematic lines. There were a few problems making the process difficult, frustrating, slow, and imprecise:

  1. Logs from different services had different structures and nobody had documented this.
  2. It was difficult to discern the source service of a log entry. The only way to do that was to look at the ID given to the pod by AKS — which did not have the image name!
  3. Once a request hit the Backend and bounced around from service to service understanding the causality became difficult. It was a wild game of checking timestamps and hoping the clock drift was minimal.
  4. There was a lot of noise. You had to tune every query to minimize the false positives and false negatives.

TLDR; it was not a fun day to be trawling logs.

Why This Happened

You might be thinking, “What a bunch of scrubs. I’d fix the logging problem before it got this bad!” I doubt it. If that is what you see in your organization, it is because it bit them too at some point. Congratulations. Regardless, it’s worth exploring why this happens in the first place.

The scenario is a manifestation of technical debt. It creeps up on you because they’re never a top priority in the engineering team. Such problems don’t happen every other day, so I could have fixed the bug and moved on. And I’d dread the day I had to trawl again. These are the kinds of not-so-sexy projects that have the largest ROI on your DevX. Now that our Backend team had extra firepower (aka me), we could afford to focus on this. The rest of the article documents what steps we took to achieve this goal.

Standardize Your Log Structure

I’ve pasted Klinify’s log structure from our internal wiki. We’re going to unpack parts of it below and follow the reasoning. What I want you to take away from here is that it’s crucial for all your logs across the board to look the same. So you’ll know exactly which fields to extract, and which fields to filter within your query.

{
"level": "INFO",
"timestamp": "2021-06-30 02:51:31,310",
"process": 20,
"thread": 139739621476096,
"message": "Request received",
"name": "ResponseAction",
"funcName": "_log_incoming_request",
"pathname": "/app/./ResponseAction.py",
"content": {
... // dict values passed into logger will be inserted here
},
"request_trace": {
"request_id": "MzA5MjAzLTE2MjUwMjU2ODA2NTI=",
},
"pod_metadata": {
"service": "klinify-proxy",
"pod": "klinify-proxy-76689c6458-584j8",
"version": "v2.13.1",
"hash": "d949e2a"
},
"exc_info": "..."
}

Define Logging Conventions

The developers need to know exactly what to log. Leaving this decision completely to the Devs does 2 things:

  1. It creates mental overhead because they need to decide what to include/exclude.
  2. It increases the chance of missing out on important lines and including unnecessary ones.

Have some rough rules of thumb in your internal docs. The following is an excerpt from our docs. Your exact conventions will depend on your context:

When to Log
-----------
A Service can log data to record one or more of the following:
An intention to do something (eg. going to send a response)
Successful completion of some activity (eg. received a request / updated the db)
Failure to complete an activity (eg. error while saving to db)
The state of the app at a certain point. (eg. result of an intermediate computation)
Do not log PII, except as DEBUG logs. Watch for them in URL params, responses, queries.
Do not log passwords or secrets. Be sure to scrub them from URLs and headers. Don't even log them as debug logs.
Log exceptions where it happens (e.g. do LOGGER(“exception”, {“message”:”…”}) within the try/except block)
You must always log:
All requests to Services
All responses by Services
When a service has initialized successfully
when a service is shut down
When a service fails to do the main thing it’s supposed to do, and there was no backup plan.
For all logs other than the above, ask yourself:
When will I search for this log? This will typically be for troubleshooting something. If you don't have a clear answer, don't log.
What information will I need when I search for this log? Make sure you provide those in the log.
Do I want an alert every time the code reaches this state? Log, and set up the alert on Azure.
What problems can I not fix if I don't log this? If the information from this log can be inferred from other logs/data easily, then don't log.
Am I building a new feature and I need to watch the logs to make sure it’s working fine? Add debug logs, and remove them once the feature stabilizes.

Another example of decisions you need to make is what each of your levels (INFO/WARN/ERROR/CRITICAL) should be used for.

Once you specify and implement your conventions, the trawler can make certain assumptions and decrease time-to-spot-bug.

Tie Together Your Logs

You need to include a mechanism for this, especially if your architecture is service-oriented. All our requests — both from clients and internal services — are tagged with x-klinify-request-id and logged under request_trace.request_id. Now, whenever we need to investigate a certain request, we filter with that request_id and the chain of causality appears right in front of us.

Include Metadata for Quick Filters

The pod_metadata tells you exactly which service and pod the log is from, what version the pod is on, and the commit hash. If you have a bug you know the code version to look at, which is useful during rolling releases. You will also know if a bug is manifesting in a specific pod (OOM and uneven load distribution may cause this).

To add these in, we mandated the use of git tags. You can then write shell scripts or integrate the metadata within your CI/CD pipelines to inject them into the pod. For example, this appears in our Dockerfiles:

ARG COMMIT_REF
ENV COMMIT_REF=${COMMIT_REF}
ARG COMMIT_HASH
ENV COMMIT_HASH=${COMMIT_HASH}

And this appears in our K8s deployment files:

...
- name: MY_SERVICE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
...

Define a Logging Library

Instead of reinventing the wheel for every service, you should define a logging library. The library should enforce the logging requirements, and make it super easy to do the right thing. The following is how we use our logging library in the codebase:

from klinify_core.klinify_logger import get_klinify_logger
LOGGER = get_klinify_logger("custom_name")
LOGGER("info", {"message": "human readable string, mandated by internal standards", "otherKeys": "otherValues that go into .content field"})

That’s it! The library handles all the metadata, structuring, validation, etc.

Reduce The Noise

Most of your noise should be gone by now if your conventions are well defined, but it’s worth taking another look at. Look at your logs and see which lines you don’t need at all. Drop them. For example, a simple fix we made was not to log uwsgi 2xx/3xx. This slashed our log volume and saved us quite a bit of money! All it took was the following 3 lines in our uwsgi.ini:

disable-logging = true
log-4xx = true
log-5xx = true

Define Reusable Queries

You can store functions right where you query your logs, or in your wiki for easy copy-paste. Define the most common queries for reuse. Don’t reinvent the wheel every time you need to trawl logs.

Photo by ActionVance on Unsplash

Results

  1. I don’t dread trawling logs anymore. Since I don’t have to fight my tools, it’s actually become quite fun.
  2. You feel like batman because you can acquire evidence quickly.
  3. Time-to-spot-problem has gone down dramatically. We’ve had the chance to stress-test the new logging structure during outages.

Now that you’re armed with this wisdom, I hope your log trawling journey is much smoother than mine. Go look at your codebase with a new lens. What can you improve?

Special thanks to Yao Wei for working on a massive chunk of this problem and making my work way easier!

--

--