Django ORM Lessons: Optimizing SQL

I recently had the pleasure on working on a project that required SQL optimization. The data being pulled from the DB was taking quite a bit of time. The code was pretty straightforward, just required a bit of tweaking. But since there was so much data being pulled at all times, it became clear that code optimization wasn't enough. So, the next step was to rethink the data pull.

Now here's an interesting thing about the ORM. It doesn't actually pull data from the DB unless you actually ask for it (I am sure there is a better way to put it but I will go with mine). What does that mean? Well, take a look:

class ModelA(models.Model):  
    field1 = models.CharField(blank=True, max_length=20)
    field2 = models.ForeignKey(ModelB, null=True)
    field3 = models.ForeignKey(ModelC)
    field4 = models.DateTimeField()

class ModelB(models.Model):  
    field1 = models.CharField(blank=True, max_length=20)
    field2 = models.ForeignKey(ModelD, null=True)
    field3 = models.ForeignKey(ModelE)
    field4 = models.DateTimeField()

When I get stuff like qs = ModelA.objects.filter(field2__isnull=False, field3__pk=3) the qs variable will give me a type QuerySet. Notice QuerySet and not actual SQL data. It's still in the pythonic form. Now if I want to see the actual value of the variable, that's when the data is pulled.

It was a refreshing reminder because that's a very pythonic thing to do. Don't do the heavy lifting unless you absolutely have to. Not to mention most people just ignore it when they read the API (as did I).

Another important thing I saw was prefetch_related. When you do a join, multiple queries get constructed by the ORM that are executed to get the main column. The overhead on this can be huge. When I say huge I mean huge. I tried executing queries without prefetch_related and the jump was tremendous. From executing 9 queries, the ORM jumped to executing 1.4k queries. This is astounding overhead for big searches. It reduces the time drastically because the queries are then executed in batches. Pretty smart if you ask me.

I implemented prefetch_related in another way. I distributed the fetch in two different views. On the preliminary landing page that did not pull the actual data, I fetched all foreign key values and then passed them via the session (as lists) and used them directly in the actual evaluation.

So instead of
SomeModel.objects.filter(ModelB__isNull=True, field2__pk__in=ModelB.objects.filter(foo=True).values_list('pk' ,flat=True))
you get
SomeModel.objects.filter(ModelB__isNull=True, field2__pk__in=[120, 125, 1450, 291]). Now that may seem like a very small change but that reduced time drastically. Almost 30s in some places.

Don't confuse inline python code with inline SQL. If you were to do something like

modelb_values = ModelB.objects.filter(foo=True).values_list('pk', flat=True)  
somemodel_values = SomeModel.objects.filter(ModelB__isNull=True, field2__pk__in=modelb_values)  

The resulting SQL query would still be inline. So it's important to just separate the two. So distribute or use prefetch_related.

Another good option is to explore Atomic transactions. They wrap your queries in HTTP Requests. They also bundle queries so they don't actually get committed. Now, there are a lot of reasons why you shouldn't be using atomic transactions. Django goes into detail about that, but I found that it sped things up here.

Pipelining using Celery

So, this is what I was able to do. I remember reading about pipelining in my Computer Architecture class and that's what I did. I got the first 500-1000 records loaded using Django's ListView (Paginator basically) and fired off a Celery job for the next 500 when I was fetching the query set.

The result of the job was stored in the broker (RabbitMQ) and was pulled when the client (browser) requested it using Ajax. All the browser had to do was send the job id back and we could get the result.

Here's a snippet of the Celery code:

@shared_task(task_name='my_task')
def my_task(start, stop, **kwargs):  
    # some task that takes a lot of time
    _objects = Model.objects.filter(foo=bar)
    object_list = _objects[start:stop].prefetch_related('asdf')
    return object_list, start, stop

Now initially I was trying to update the state of the celery task from within the task (which was pretty stupid as I realized). I was using current_task.update_state(). But as it turns out you can't update the task state from within the task (obviously). You need to do it from within the worker. Here's a post that brought me to my senses.

Anyway, so once the task is complete, you can get the results using .info(). But these results are stored in the Broker and that's the big issue here. Rabbit needs to be cleaned manually. Usually restarting the worker should do that but I wasn't able to confirm that.

from celery.result import AsyncResult  
job_id = "a8098c1a-f86e-11da-bd1a-00112444be1e" #task uuid  
task = AsyncResult(job_id)  
if task.successful():  
    return task.info # will return a tuple

So once the request was made from the browser, you simply get the data and then fire another job for the next 500 records. That way you are always one page ahead of the pending data requests. It's not a seamless solution, there are chances when the job won't complete, so you need to have a plan for that. But that really depends upon your architecture.