Debugging a segfault from Ansible

2017-10-25 23:09

I did a thing this week that solved^w worked around a really interesting problem and helped me learn a good deal about the inner workings of Ansible. I thought I’d share since there’s a lot of really useful Ansible debugging tools I learned about along the way, and maybe I can help someone else who’s encountering this same problem.

tl;dr:

If you are getting segfaults from Ansible on macOS, specifically while interacting with AWS services, try adding environment variable no_proxy='*' to your ansible-playbook command.

The symptom

Ansible playbook runs were sometimes (often) dying with ERROR! A worker was found in a dead state while running on macOS. This was happening in tasks that were looking up data via the AWS API (specifically Credstash) to render templates.

The problem was intermittent for me, but no one else on my project could reproduce it. I also couldn’t reproduce on a Linux VM as I’ll talk about in a bit. Convinced it was something in my virtualenv or laptop but unable to continue without fixing this, I dug deeper.

I was using Ansible 2.3 on macOS 10.12.

The approach

My initial approach involved trying to debug using tools built-in to Ansible. This is done at a few layers which I’ll talk about in a bit.

When those just led me to “yep, worker is ded”, I started to reach for a debugger. I’ve used pdb in the past, but that didn’t work because of Ansible’s execution model (again, more on that in a bit). I found another blog post with a broken link suggesting that epdb was the way to go, but it had the same issue. I did find out later that it’s now built into Ansible!

With all of those options exhausted and in order to keep going, I resorted to adding some debug statements in the upstream libraries.

Built in Ansible Debugging Tools

But first, let's talk about some of those built-in tools to Ansible!

Logging

My project did not have logging enabled initially. This was one of the first things I enabled because I thought maybe it would log something a bit more useful that the generic error message I received. Unfortunately for me, enabling it only confirmed what I already suspected -- a worker was dying unexpectedly. This was generally helpful to have enabled, but not useful in my situation.

Enabling is simple, update your ansible.cfg with a path.

Verbosity settings

Ansible has a couple of options for controlling the verbosity of its output. I’m not clear on what does what, but used the recommendation of a maintainer in an issue on Github. I ran it with both environment variable ANSIBLE_DEBUG=1 and -vvvv set, and gained a whole wealth of info. This also enabled me to more easily add tracing statements which helped in narrowing down the issue.

ANSIBLE_KEEP_REMOTE_FILES

Ansible writes out Python files to a .ansible/tmp directory, executes them, then deletes once it’s done. If you set ANSIBLE_KEEP_REMOTE_FILES=1, it will leave those files in place for you to inspect later. This can be useful if you want to use a debugger (read the next section instead!) or otherwise inspect what’s running during a task. I did a bit of a side quest on this, but it was ultimately not fruitful because the files were not being written in my situation.

Ansible debugger

This was one I learned while writing this, but you can add strategy: debug in a play to launch a debugger when a task in that play fails. This is really handy, but wouldn’t help in this situation as the worker wasn’t gracefully dying.

You can also enable this with an environment variable: ANSIBLE_STRATEGY=debug

The Fun Part

I first started encountering this error after making a somewhat major change in the project to some of the twistier parts of the environment. When I did a bit of searching around using Google for this error, most folks were commenting about running out of memory and the task being killed by OOM killer. This didn’t appear to be the case for me (unlike most of them, I had a machine with 16G of memory and quite a bit free). Still, I ran it using time -l to get memory usage. Seemed perfectly fine, hanging out around 68M or so.

Disabling a particular template let me get past the error (but without this file, obviously), so I was convinced that something in this particular change was causing issues. I added a dummy template in the role before that to do another lookup just to see if it happened to be related to the content in a lookup, but it failed in the same way too.

I talked with a few coworkers, had them try using my branch to run Ansible and no one else could reproduce it. They were on an older version of macOS than I was, but everything else was the same. I also tried it on a CentOS 7 VM and was also unable to reproduce there.

After that, I spent a bit of time working on increasing Ansible’s verbosity. I enabled logging and turned on the higher verbosity settings. I was able to see that the last thing it was doing was doing a lookup in Credstash for a secret. After that, the parent process was cleaning up when it found the dead worker.

Loading LookupModule 'credstash'

Next priority then was to get whatever information possible from these workers. Ansible will fork some processes to work in parallel, throwing tasks onto a queue and letting those workers chew through the jobs. This is managed in lib/ansible/executor/task_queue_manager.py, and a coworker helped point out that there were only 3 exit codes that could cause this error, found in has_dead_workers(). Some debug logging in this function (display.debug(..)) helped us figure out that the workers were segfaulting (exit code -11).

Since we knew that the last thing it was doing was attempting to load the Credstash module, we started with that module.

A few (dozen) display.debug() statements later, we discovered that when segfaulted it was doing the credstash.getSecret() call. Knowing that this rabbit hole was deep, I spent a bit of time first reviewing changelogs with both Credstash and boto3 with no success in finding anything useful.

When there was nothing useful to be found there, I started adding trace logging all over the third party libraries (boto, botocore, and requests) and got to a call to a function inside Python’s Standard Library that was failing: proxy_bypass(), passing a string which was the URL of DynamoDB for my region.

One quick search on Google, and I found a bug in Python related to this. The workaround listed (add no_proxy=’*’ environment variable) addressed it and there were no more segfaults.

I haven’t really decided where to go from here, aside from adding that environment variable to my Ansible calls. I can add it to my .zshrc file so it’s always set, which is one way to address it.. but only for myself. At some point I may take a pass at fixing it, but a lot of folks much smarter than me have already worked on it and haven’t addressed it.

What a n00b!