In the last blog post, we covered some fundamental concepts about Kafka. This time, let's discuss how we use Kafka in Smily, how we got where we are now, what the decision drivers were, and how the overall architecture has evolved over time.
Like most of the technological startups, Smily (or rather BookingSync at that time) started as a single monolithic application. Yet, almost ten years ago (yes, this is correct, in early 2014), the ecosystem began to grow significantly. Not only did the new ideas appear in the roadmap that were distinct enough to be separate applications (communicating with the existing application - let's call it "Core"), but we were also looking into opening our ecosystem to external partners interested in building an integration with us.
Being a company in its still early stage meant looking for the simplest solution to the problem. Under those circumstances, the natural way was to go with HTTP API, which resulted in the release of API v3 - the API that is still in use at the time of writing this article by our own applications and external Partners.
There were multiple advantages of doing so back then:
All these points lead to the architectural model ("Core-centric Model") that could be visualized in the following way:
This model was built upon two fundamental Ruby gems:
2. Synced, a tool for keeping local models synced with their equivalent API v3 Resources (based on long-polling and periodically fetching the records in subsequent queries updated since the timestamp of the previous synchronization)
On top of HTTP API v3, we also introduced webhooks as an extra addition based on the publish/subscribe pattern, which was mostly a way to implement the Event Notification Pattern so that consumer apps don't need to wait potentially a long time for the next polling interval to act (which happens for some Partner Apps to be every hour or even less often!).
This architecture was sufficient and worked quite fine in the beginning, and only occasionally did it cause any more significant issues. At some point, though, problems started to happen on both Core (significant database load caused by the massive traffic in API v3 requiring a considerable number of pumas to handle it) and consumer apps (taking too much time to synchronize resources, OOMs in Siekiq workers, introducing various workarounds in the synced gem for large batches and various edge cases) clearly showing that this model might not be a good choice anymore.
The list of the suboptimal things in this architectural model could get pretty long, but these were some vital points:
That was the right time to rethink the entire architecture model. At the same time, given that we were a relatively small team back then, we wanted to avoid any significant re-rewrites and re-use what we had as much as possible.
In the end, the list of the requirements that we were expecting from the new architecture was the following:
Under these circumstances, Kafka was a natural choice, especially since it fulfilled all the requirements we had and the way we were using synced and timestamp-based offsets with updated_since flow was close to the dumb broker/smart consumer model implemented by Kafka. It was also straightforward to adapt all the components used for the serialization of resources to JSON in API v3 and do the same thing when publishing to Kafka upon every change of the resource (that would bump the updated_at timestamp).
Thanks to this change, our system turned into the following model:
It could be argued that this model was still a Core-centric one - the difference was that an extra layer (Kafka) was introduced to decouple consumer apps from the Core application. Nevertheless, it turned out to be a great success, and this change has brought considerable benefits in solving problems we used to have with the model based on synced/API v3.
Also, given how simple it was to introduce Kafka publishers in other applications (especially when comparing how much it would take to build HTTP API), it was pretty straightforward to turn that model into the following one:
Thanks to that, Kafka could become a data lake/event lake for the entire ecosystem if there was a need for that, and also, it will allow in the future to separate bigger applications (like Core) into smaller (micro)services.
You might be wondering at that point - how we made this happen so that we could so quickly change from using HTTP-based long-polling to Kafka, especially since one of the requirements was to keep using API v3 resources?
We developed our own framework on top of karafka that made it trivial to introduce new producers and consumers, thanks to the powerful declarative DSL in the gem and adapting something that could be compared to Change Data Capture (CDC) pattern, but not at the database level but rather on the model level.
And given that this is almost the end of this blog post, you probably already know what the next part of this series will be about :).
For that special occasion, we will release our framework publicly (after removing some internal dependencies and reworking the entire documentation, as it's heavily based on the details of our ecosystem), so stay tuned, as this is going to be an opportunity to learn about a complete framework allowing integration of Ruby on Rails application via Kafka in a very simple way.
In this blog post, we covered our old architecture model used to integrate our applications, what problems we experienced with it, and why we decided to switch to Kafka.
Stay tuned for the next part of this series which is going to introduce our framework for doing CDC on the domain level.