“You can’t sacrifice partition tolerance” – is one of the more influential articles that I have read on distributed systems design. It talks about applying the CAP theorem: in any distributed system design, choosing between consistency, availability and partition tolerance is a trilemma – you can only choose two of the three.
Theoretical computer science aside, the point it makes is that any practical distributed system needs to have partition tolerance built in. Remote nodes will die, networks will be flaky, and message packets will get lost – and that is how the world is.
One of the remote nodes in our system is the transmitter SDK – that collects smartphone data and sends it to our backend APIs. Smartphones with this SDK are out and about in the world, at the mercy of patchy mobile networks. As your business operations depend on smartphones and mobile networks, greater reliability against bad networks is crucial. For location data to be reliable in such scenarios, our SDKs are designed to be offline-first.
How it works
In cases when the network is unavailable, requests that cannot be sent need to be cached locally on the device. These include requests for user actions (like ‘task completed’) or requests for location data collected in the background. By extensively caching the state, the SDK maintains a journal of changes that have not been committed to the backend API, allowing the SDK to recover from app crashes and phone reboots.
The SDK utilises the persistent storage APIs on the device for caching, and both Android and iOS have built in SQLite support. The device storage can comfortably store location data for hours at a stretch – remember how we tracked a flight without any connectivity?
With caching, the client SDK becomes the source of truth. This introduces some interesting problems, like relying on the device time. In cases when the device time has been set up incorrectly, all of the data collected is unreliable. Read more on how we fix this to get the true time.
When the mobile network is available, the cached requests need to be sent efficiently. An obvious way to send them is sequentially one after the other – but that is inefficient as every request has to wait for the response of the previous request. In an area with patchy connection, that can be disastrous as network failures will cascade.
To solve this, requests are prioritised on the client, and sent in parallel whenever possible. For example, when the network is available, the location data is prioritised in the reverse chronological order, so that the latest data is sent first. This is critical for the live tracking experience and calibration of ETAs.
For the client to be able to send parallel requests, the backend APIs should not enforce any sequence. Specifically in our case, the real-time filtering architecture had to be redesigned so that it can filter points in any random order of timestamps, while maintaining the context of pre and post location data for high quality filtering.
When devices go offline, it is crucial to know why they went offline – was it the network conditions, did the battery die, or did the user disable permissions for locations. These reasons are available as alerts on our dashboard (see demo) and as webhooks. The journaling in the SDK caches the device context, which powers these alerts.
Our offline-first architecture is available on Android and iOS, without any additional configuration. If your business operations are suffering from bad mobile networks, you could sign up and give us a spin!