Speeding up data store queries with self-merge joins

For a while now Feederator allows some kind of full text queries. When items are read for the first time into the datastore an index will be created. This index is currently a break down of the single words in the title, but can be extended to the description and other attributes of an item. My first attempt of implementing the full text search on the AppEngine resulted in a very costly algorithm.
The fulltext search index is stored as a set of strings. The datastore allows you to query for items on a set as if it would be a singular attribute, like a single String for example. So you can create a query for a set of strings index={“green”, “yellow”, “red”} in this way WHERE index == “red” which would be a hit in this case. I tried to do a query like this WHERE index == “red” AND index == “green” order by date which returned with errors telling me that I need to create a datastore index for that query. I thought that the error came from the fact that I had to filters on the attribute index. But I was wrong. The problem was that I wanted to sort the results at the same time. So, by forgetting about the in-database-sorting the query works and it works darn fast. And even better: you don’t even need an index for AND connected filters like that! Which means you can save storage space (in my case almost 1GB!).

Here are some rules of thumb that I learned from this lesson. Due to the way the datastore is organized (don’t forget its BigTable nature!) you have to pay attention when designing your data model.

  1. You can query easily on sets
  2. Try to avoid non-equal filters or filters that use OR. The best filters are ones that try to find an intersection between different domains on ONE entity
  3. Sort your results in-memory to avoid the need for indexes
  4. Sets are expensive to serialize/ deserialize. This costs occure when you write or read a set from datastore. To avoid serialization costs put your full text index on a separate child object. You query on the child object only for the key. With the key you get the parent object and then you can do in-memory sorting/ ranking
These links helped me a lot to understand how to optimize my full text search:
Related:  Feederator Beta out!
Now a beer! Did you like this post? It often takes me several hours of my free time to write one. If I was your neighbour would you offer me a beer for the hard work I did for you? The beauty is: beers can be teletransported with a painless donation using Paypal. A beer in a Swiss bar costs about USD $4.80. Or use this affiliate link if you order something on Banggood and I'll get a small kickback. Thank you!