Speeding up a database intensive Django command

In Python web framework Django there is an option to create custom commands to run some scripts for your website from console. In this post I’ll look at a particularly slow command from our adaptive learning system library proso-apps and what I did to make it faster.

The command is called load_flashcards and its sole purpose is to load some content from a JSON file to database. To give an example of what the content can be, let’s look at what the JSON file contains in project outlinemaps.org. There is a file geography-flashcards.json, which contains the list of all maps, the list of all terms (countries, cities, rivers, mountains, etc. ) and then a list of all flashcards. A flashcard is a pair of a terms and a its identifier on a map.

The file has grown quite large. It has about 50k lines (over 1MB) and is so large that Github won’t preview it. The command has to read all the data from the JSON file and for each item (map, term or flashcard):

  • create it, if it doesn’t exists
  • update its properties if they were changed
  • mark it as inactive (or delete – if specified by a command line option) if it’s no longer in the JSON file.

Running the load_flashcards command on this huge file takes several minutes, so it’s worth to have a look at what takes so long. Note that most of the time only a small part of the file is changed, yet it takes so long, because it has to check everything.

So what does take so long? I would guess reading from and writing to the hard drive. While the JSON file is read just once (1 I/O operation) there are many I/O operations when using the database. While there are some profiler tools for python (e.g. cProfile, or ), in this case I knew the bottleneck is database access, so I used unix time command to measure the overall time the script took to execute. Usage of the time command is pretty straightforward. Just prepend it to the actual command:

time ./manage.py load_flashcards geogrpahy-flashcards.json

Avoid Model.save()

The first thing I noticed was that the script didn’t have to call Django Model.save() method on each database object, but only on those that actually changed. So instead of blindly pouring all data from JSON objects to database objects, I started to check whether there are actually any differences between the two.

The basic idea is illustrated by the following piece of code. Note that the code assumes that variable db_flashcard contains the database ORM object and variable flashcard contains the dict parsed from the JSON file,

if db_flashcard.description != flashcard.get("description"):
    db_flashcard.description = flashcard.get("description")
    modified = True
if modified:
    db_flashcard.save()

You can see all the changes I did based on this idea in this commit.

Avoid Model.objects.get() and Model.objects.filter()

Furthermore, I found out that the command was heavily using Model.objects.get() and Model.objects.filter() to look up related objects, one at a time. For example, to look up a Term for a Flashcard. While there were thousands of flashcards, it resulted in thousands of database queries. Even though each query took just a few miliseconds, all of them combined made a significant slow-down.

The solution is to load all objects (e.g. all Term objects) by one database query and save them to a Python dictionary.

db_terms = {}
for db_term in Term.objects.all():
    db_terms[db_term.identifier] = db_term

Later we can retrieve them from the dictionary in no time.

for flashcard in flashcards:
    term = self.db_terms.get(flashcard["term"])

This is the essence of the solution. In reality, the code is a bit more complicated, as you can see in this commit and another smaller commit.

Conclusion

The main takeaway is pretty simple. Avoid database access whenever possible. It’s way too expensive.

Leave a Reply

Your email address will not be published. Required fields are marked *