February 17, 2020

Optimising Queries & The Value of Crash Reports

Sometimes things break. This post details how I responded to a Jetsam user's phone crashing by optimising a sloppy query

Optimising Queries & The Value of Crash Reports

I recently shoved a community app that I've been developing out into the world, kicking and screaming. It was launched in a local theatre alongside a series of talks about plastic waste, and was installed about 50 times in the first 24 hours (not exactly meteoric, but that works out to a 16% conversion rate from a very niche audience). It's a modern app, with all the usual whistles when it comes to crash analytics & usage monitoring, so I was quite pleased for the first 5 days when usage was high and errors were silent.

Then, I got an email from Bugsnag: OutOfMemoryError in MainActivity

Maps and data

There's a bit of an aside required to understand the cause of the out of memory exception. One of the main app features is a heatmap that shows an aggregate of all of the metric data collected within a given time window. When the map loads, or the map viewport is moved (after a debounce), the app sends a request to the server for all of the collected data within a lat/long boundary for a given stretch of time, and the server dutifully responds with all of the data within that boundary that was created in that stretch of time.

Every bit of data.

Individually.

Maybe the data set for a given area is a few hundred data points, maybe it is a few thousand if you zoom out a bit; this is fine if you have a reasonably behaving device that allocates enough processing power and resources to each running app. I've found that this assumption goes out of the window when dealing with a certain subset of lower-end Samsung devices (not to impugn Samsung as a whole).

The Devil, The Details

It's important to note that a lot of the features in the app were developed with certain constraints with regards to the amount of time that could be spent developing them. After all, I have a full time job and Jetsam is a part time endeavour.

The data is retrieved by a PostGIS query that takes the requested boundary points, creates a polygon, and filters the stored metrics with ST_COVEREDBY (metrics are stored with a GEOGRAPHY location type, so some wrangling is required in the query as well). An eleventh hour change to the query before the launch of the app also restricted the fetched columns to a weight value and the location itself, taking the overall size of a single rendered metric (in JSON format) from 680 bytes to around 150 bytes and saving processing time otherwise spent scrubbing the data on the application side.

As a starting point, this is totally fine. The heatmap layer rendered on the app takes a list of points and generates a heatmap. Done! Looking deeper, this completely ignores the computation required to aggregate a long list of points into a heatmap, work out the display values, generate the graphics, etc. Apparently, this is not an insignificant amount of computation when you have an incredibly aggressive process supervisor breathing down your neck.

A Quick Fix Overdue

The way that retrieving metrics worked was noted as being quite inefficient from the beginning, and an issue was filed on GitHub as a reminder to do something about it, when there was enough time. On the other hand, the server built to support Jetsam was finished up fairly early in the app development process, and didn't need much attention as the app itself was being built. This meant that less scrutiny went into how it could be optimised prior to release and the issue filed about the (frankly wasteful) usage of network bandwidth languished on GitHub, waiting for the perfect storm.

Ultimately, the fix was quite a small change. The amount of data sent to (and subsequently processed by) the app needed to be reduced. The nature of the data meant that a lot of data points were quite close together, and no noticeable reduction in quality would be apparent if the locations were fudged a little bit. If you have a dozen points that are all within a metre of each other, does it really matter if you send a list of 12 metrics with the value 1 or if you send a single metric with the value 12? (Spoiler: it doesn't)

The server now uses ST_SNAPTOGRID to group the data into manageable (and configurable) chunks; if the resolution of data needs to be increased or decreased this can now be achieved by altering the grid size. This has shown around an 80% reduction in the number of data points being sent to the app, with almost no noticeable decrease in the visual quality of the heatmap.

As a nice privacy-oriented bonus, it is now entirely impossible to determine the exact location that a metric was captured at, even if the value is still 1; not only does ST_SNAPTOGRID normalise the location points into a grid, but it also truncates them to the precision of the grid size.

What Next?

The next concern for querying the metric data is the efficiency of the queries as the data set grows. The database stores metrics in a  Timescaledb super table with a bucket size a bit larger than the average time span queried. There are additional geo-spatial indexes on the locations of the metrics, but some even deeper diving might need to go into optimising their usage with the amount of casting that happens in all of the PostGIS function calls.

All of this goes some way to mitigating the number of records that end up getting queried, but when the next performance bottleneck or crash report comes in, it'll need to be revisited.

Not a bad start, though.