For some reason, I retain pointless details longer than others. This has led to me being the go-to person for edge-cases on multiple teams in the past. The obscure YAML incantation won’t work? Ask Friedemann. Customer thinks feature Z is a must-have, but nobody seems to remember it? Ask Friedemann. Stuck on a Heisenbug? Ask … you get it.
This also massively helps with debugging. Good thing too, because I love doing it.
Strap in, this is going to be kind of a long story.
This happened at one of my first jobs, back in 2010. We wanted to optimize the delivery of our CSS and JavaScript assets, so I wrote a small PHP script that would aggregate all registered files by concatenating them and storing the result under its hash on disk. This was done dynamically at the end of the frontend process.
The new concat feature worked beautifully in development and testing, and was rolled out to our stores. After a while we got reports of broken pages with JS errors. Looking into it, we discovered that the aggregated assets delivered to the browser were incomplete. They stopped after an arbitrary amount of bytes. We looked at the files on disk, and here is where the weirdness starts.
Some of them were whole, and some of them were truncated. We went over the PHP code with a fine comb, but in the end it was a bunch of loops and file_get_contents/file_put_contents. Here is where I learned to appreciate the fact that it’s turtles all the way down: we looked into the source of the PHP interpreter to see if there was some magic involved. The code was unsurprising.
To ensure nothing with PHP or the execution model (mod_php) was wrong, I rewrote the procedure in C using fopen/fread/fwrite. A colleague experienced in systems programming paired with me on this and remarked that we should be using open with O_CREAT|O_EXCL, to ensure that no two processes will try to create and write the file at the same time. Solid advice! Weirdly, it didn’t solve the problem.
Next, we used flock to definitely prevent any other process from touching the file while we wrote our buffer. Still no change. We used strace to see what was sent to the kernel, but again, nothing looked wrong.
And then my colleague had the brilliant insight that on production, our frontend servers had the asset directory on NFS. Specifically, NFSv2, which did not support locking out of the box. The kernel FS implementation would cheerfully return success, that the lock was acquired. And it did have a lock, just not on the NFS server. Other clients could still concurrently change the file just fine.
Under high load, every deploy of a change in the frontend assets opened a window for a race between the web servers to write the bundle file.
As a solution, we put the aggregated assets into machine-local folders, resulting in a little duplicate work but no more race condition.
There are always tools to look into what is happening behind the curtain. With so much software being open source, that is often only a few clicks away.
You can debug the whole world if you care enough to find out how.
If you would like help with some of your bugs, let me know.