JSON conversion as a bottleneck in a Python API


tags: performance, python, software architecture
created: Wed 25 October 2017; status: published;


A new acquaintance discussed with me a scaling problem his dev team is trying to solve. In their Django/Python app they have an API reading a table with long rows but otherwise simple structure. The API has high load and needs to be fast, but their servers are struggling to keep up.

Initially, their database was doing the JSON conversion and running at high CPU. They shifted the JSON conversion to the application servers, which then took the CPU hit and were straining. The team's performance analysis pointed straight at the JSON conversion process in Python.

What should they do?

This leader, who has experience in Java and other languages but not Python, was wondering whether to move the service to Java. This can work, as even poor Java can beat Python for pure performance at many tasks. But at what cost? I cautioned him against that as a first step, because of these costs:

  1. The team has little Java experience, which means extra development time.
  2. Newly-developed code will be buggier than existing code for a while. Handling that correctly has extra time cost.
  3. The team would have to create a second versioning and deployment system for the new Java piece, outside their existing Python installs.
  4. Java has a whole separate set of performance and security parameters than Python.

All of this means the Java implementation would be expensive to produce and then it would be expensive to maintain.

Speed Up the Python Code Path

My suggestion was to stick with Python but explore other JSON libraries that use compiled C under the hood. The Python Standard Library's json package is a pure Python implementation. This is for good reasons: it is readable, fully portable to any Python version, and doesn't require extra compilation tools. It's a friendly, pythonic library in all senses.

The json library is not fast, however, and sometimes you need it to be. I pointed out that Python allows writing extensions in C which then executes as compiled code. This can complicate deploying the code, because ultimately it has to be compiled at some point. This does introduce a new set of tools either on a production server or some intermediate server to build the needed final package. However, on Linux there may be standard packages providing these, not requiring this extra work. Let's not get sidetracked by deployment issues, but it's important to be aware of them, because I pointed that out for the Java idea in #3.

Several people have written C implementations of Python's JSON library. They are, in no particular order: jsonlib2simplejsonyajl (yet another json library) and lastly ujson (ultrajson). Brett Langdon analyzed these in The Fastest Python JSON Library and compared them to the baseline json library. He saw ujson as the fastest. But they all offer 10-50 times speedups in encoding and decoding.

So you just switch your project to ujson or another one of these, right? No!

Now that you have a C extension, you have to think about:

  • Does it have memory leaks? ujson has pull requests to fix severalmemoryleaks
  • How does it differ from the standard library, and thus our existing JSON output? ujsonhandles Long types differently* and*handles unicode differently
  • Is the library maintained? ujson's last release was January 2016. That release is 35 commits behind the git master. So it is abandoned? Several people and companies have forked it to fix bugs or are considering switching libraries
  • What is the license? ujson uses the EA License, which is basically MIT but with a special mention of Electronic Arts. Almost any license requires attribution, which you should be tracking in your project.

So, should you use ujson? Probably not, in this case. This analysis needs to be made on all new libraries a project might consider. Go apply this to the other libraries, or others that have arisen in the last year or two. Don't choose the fastest - choose the most correct and well-supported.

Using the New Library

I next told him that if he follows the suggestion to try alternate JSON libraries, don't just drop it in as a replacement and deploy it. The correct process has several steps depending on how far you want to go.

At the very least, you should have (or write) a test suite for your JSON conversion. This means unit tests taking raw data and checking the output JSON is correct. This may seem silly when using the built-in JSON library. But when transitioning a library, you need to run the same raw data through both libraries and compare the results. Don't use simplistic, fake data either - use real data of as many variations as you can. Test empty data, 1 row, 2 rows, a thousand rows, data with null, "null", none, "none", empty strings, empty lists, empty dictionaries, etc.

One complication in testing this is that Python dictionaries don't have predictable order (well, they are as of Python 3.6). And the resulting JSON in string form won't be precisely identical. If you try to compare "string output from json" with "string output from new_json", they will be different. Even with predictable-order dictionaries, the library may select keys differently. Remember the new library is implemented in C and is working with very different data representation than the Python-implemented json library.

So really, you need to serialize and then be clever. If you have mostly simple data, you can search out keys by name and check the string-encoded representation. If the data involves slashes or quotes or other things that can change string serialization, this is really valuable.

Secondly, if you deserialize back to a Python object, you can compare the dictionaries for identical keys and values. But if you do a ujson->ujson process, any quirks in ujson might get correctly undone in the deserialization, masking an intermediate difference. After all, JSON is a data exchange format and the consumer probably isn't using ujson or whatever library you chose. The JSON has to be identical in string and object form.

You could get clever and do ujson->json and json->ujson and cross-compare deserializations. Really, if you have the dictionary-comparison process worked out, adding all these cases should be trivial.

This is the minimum you should do before attempting to use the new library in your project. You can do more!

Safe Refactoring

When code changes are risky but have to be correct, you can apply guess-and-check approach. This requires some extra work, but is nearly bulletproof.

The idea is this: run both versions of the code, compare the result, and make a note when they are different of the inputs and outputs. And when different, use the result from existing code. This ensures you introduce zero changes to existing software, but you are getting real-world validation.

Github discussed how they approached deploying a very complex git change across their entire user base in an excellent blog post. They simultaneously release a Ruby library called Scientist, which has since been reimplemented in other languages. The Python version is called laboratory. 

When your confidence is high enough, the new code can become the default and the old path removed. This is how to swap out critical code in a system. Implement it, check it for a period in production, and ultimately deprecate the old code.

Remember that the original suggestion was about switching libraries to get better performance. If the application is already struggling to perform, this doubles the work in the critical code path.

So instead you can run this asynchronously. Logging is cheap, so do this by dumping the inputs for all (or just some, say 10%) of inputs. Then grab them and run the 2nd-path analysis on another server. This accomplishes the same thing with very tiny extra burden.

Summary

Ultimately the only safe solution is to approach the replacement methodically. Dropping in a faster library could introduce lots of bugs, so testing is required. But the level of testing required can be a unit test suite all the way up to the 2-path refactoring process.

In my recommendation to take this approach, I still believe this should be the first attempted fix. It maintains a mostly-Python environment with minimal code change in the application. Any alternative, like switching to a separate Java service, actually involves the same testing, but also the extra burden of new deployments, monitoring, and maintenance.

As engineers, it's important to be clever but methodical in our solutions. Performance issues are dangerous because they are usually urgent and seem easy to fix. Do not rush to a solution. Among all the shaky solutions, you must find the stable one that will strengthen your product.