So – after quite some time working with distributed systems, I want to share the decisions made throughout as you build an event-based distributed system.
I want to share some of the things that I’ve noticed, and my thoughts in general in these different approaches – what they look like, pros and cons etc.
Command vs Event
First things first – Command and Event are two distinct things – Event is a fact that something happened and therefore, can never be “invalid” – we might not be interested, but this is never in an invalid state – whereas a Command is a request for something to happen passed as a Message.
Message Envelope
This is the typical message envelope that I prescribe.
{
// the type of event or command i.e. JobPosted or PostJob if it's a command
type: string,
// so clients know how to handle our message
version: string,
// an id that uniquely identifies a message
id: string
// a globally unique id for easier debugging
correlationId: string / guid,
// what generated this event / command
source: string,
// when the event happened or when the command was issued
timeStamp: long / DateTime,
// extra meta data to describe the event / command if needed
attritubes: object
// the actual payload
payload: object
}
Let’s briefly look at how you might utilise these properties and where they should come from.
Type and Version is a way to identify the message so that the client knows how to handle it.
Id is an abstract concept that can be used for deduping messages on the other end. This is typically implemented as a type of hash using various properties of the message. A very simple example for a user-generated event would be the hash of timestamp, type and user id assuming that a user can’t feasibly invoke the same type of event at the exact same time.
CorrelationId is generated by the originator of the event and is propagated throughout the rest of the system. This allows us to easily trace at a per event basis anywhere within our system given enough logging.
Source is generated for each time a new message gets produced and uniquely identifies the application and/or user that generated it. If a particular event or command can be published by more than one application, this becomes valuable for tracking down bugs upstream and it makes filtering logs a bit easier.
Timestamp is invaluable in message-based systems – this allows us to find bottlenecks within our system and is an additional data point that we can use for lag compensation for out of order messages.
Topic Segregation
One of the other things to consider when going for an event/message based system is topic segregation i.e. what events should to published to what topic?
Topic Per Event Type
In this segregation setup, every event type i.e. JobPostUpdated, JobPostCreated etc. are published to their own topic.
This allows for consumers to more granularity in terms of event types that they want to listen to without reliance on tech-specific feature – although arguably, most messaging solutions such as AWS SNS/SQS and Azure Service Bus have their own filtering solutions making this a non-issue in most cases.
Pros
- Consumers have some tech agnostic way to only include event types they are not interested in
Cons
- Consumers may have to subscribe to many topics if there are many types of events it’s interested in consuming
- Each event type added needs another topic to be created
Topic Per Entity
The idea here is to publish all event types for each entity to a topic, for example, Job, User, Recruiter etc.
Although consumers have no tech agnostic way of filtering out event types they don’t want to listen to, as mentioned previously, this is usually a not an issue, as most messaginge solutions have a filtering feature – if not, the consumer can just discard the irrelevant events anyway.
Pros
- Keeps the number of topics to a manageable number
- Adding a new event type require no further infrastructure setup
Cons
- Consumers have no tech agnostic way to only include event types they intend to consume (not an issue in 99% of cases)
Event Message Content
One of the things to decide when communication via messaging is what to include in content – and you mainly have three ways of going about this.
1. Entire State
The idea is to simply include the entire state of the entity relating to the event. For example, you may have a JobPostUpdated event – this will include the entire state of the Job.
{
posterUserId: 123,
jobId: 567,
title: 'Job Title',
description: 'Job Description',
tags: ['c#', 'javascript', 'developer']
}
As soon as we add a new property, we simply add it to the message – simple enough, no need to change version as it’s still compatible with the old clients.
But what if the clients now requires to know the first name and last name of user that posted the job? we have to now add it to the payload.
{
posterUserId: 123,
poster: {
firstName: 'Jon',
lastName: 'Doe'
},
jobId: 567,
title: 'Job Title',
description: 'Job Description',
tags: ['c#', 'javascript', 'developer'],
}
Well… that user is probably associated with some sort of an organisation – and the clients downstream need those now as well as part of the requirement for project ABC etc. where does it end?
Pros
- Clients always gets the full payload without extra work
Cons
- Larger payload
- Sending the banana along with the gorilla and the jungle problem
2. Changes Only
The idea here is to simply publish the state that has changed instead of the entire payload, so instead of a single JobPostUpdated event, we split this up further into smaller cohesive events – such as JobPostTitleChanged, JobPostDescriptionChanged, JobPostTagsChanged etc.
JobTitleChanged
{
jobId: 567,
title: 'New Job Title',
}
There’s an assumption here made that all clients are aware or are up to date with the previous complete state of the entity.
Pros
- Smaller payload
- Easier to implement on publisher side as it doesn’t need to read the entire current state of the entity the event is invoked on
Cons
- May require extra things to do for clients to get full picture (ie. not aware of previous messages published)
- More Event Types to publish
3. ID Only
Only publish the ID of the entity that changed, for example JobPostUpdated would only contain:
{
jobId: 567
}
The typical mechanism to get further data looks like this:
1. Publisher --> Publish Event --> Client
2. Client --> GET --> Source of Truth
The producer publishes an event, the client receives the event with the entity id, and it grabs the entire state by asking the source of truth.
Pros
- Arguably the easiest to implement on the publisher side
- The published message likely to never change
- Clients only take only what they need
Cons
- Extra steps required by clients to get the full payload
- Optimistically highly available
- Can create unwanted dependencies that make the system more brittle whereby outages can cause a domino effect
Publishing Messages
Choosing the approach to publishing messages is a critical decision to make – as depending on requirements, you may have to guarantee delivery of messages.
Part of DB Transaction
Simply include the publishing of the message as part of the DB transaction, a psuedo-code example looks like:
using(var db = _dbConnectionFactory.GetWriteConnection())
{
db.BeginTransaction();
try
{
// do a bunch of SQL queries
var myMessagePayload = somePayload;
// we're doing this as part of the transaction before commiting
await _publisher.PublishMessage(myMessagePayload);
db.Commit();
}
catch (Exception ex)
{
db.Rollback();
throw;
}
finally
{
db.Close();
}
}
Pros
- Simple to implement
- At least once delivery
Cons
- Potential performance issue – having extra IO within a transaction means a lock on the row or entire table for a much longer time, this could cause concurrency issue / slowdowns with the entire application
- Not great for a synchronous process such as a Web API endpoint, as a failure to publish means that the entire request fails – this adds an extra point of failure.
Outside of DB Transaction
Simply publish the event outside the DB transaction and hope that it publishes – this is very much the simplest approach, but of course, you lose the guarantee of message delivery. You could add retries around the publishing to reduce chances of losing an event.
using(var db = _dbConnectionFactory.GetWriteConnection())
{
db.BeginTransaction();
try
{
// do a bunch of SQL queries
var myMessagePayload = somePayload;
db.Commit();
}
catch (Exception ex)
{
db.Rollback();
throw;
}
finally
{
db.Close();
}
// we're doing this outside the transaction
await _publisher.PublishMessage(myMessagePayload);
}
Pros
- Simple to implement
Cons
- Not guaranteed delivery
Out-of-process Event Tailing
The idea here is to have an event audit within the application producing the events as part of the DB transaction. This means guaranteed storage of all events that have ever happened.
There will be another process in place that tails the events by pushing and marking each event as published.
This is arguably the most time consuming way of doing this, but this sounds like the proper solution to this problem as it guarantees message delivery and does not affect application performance in anyway, bar the extra time needed to write the event audit.
Pros
- At least once delivery
Cons
- Another worker to maintain
- More time to build