Distributed Systems : Messaging Strategies Part One

So – after quite some time working with distributed systems, I want to share the decisions made throughout as you build an event-based distributed system.

I want to share some of the things that I’ve noticed, and my thoughts in general in these different approaches – what they look like, pros and cons etc.

Command vs Event

First things first – Command and Event are two distinct things – Event is a fact that something happened and therefore, can never be “invalid” – we might not be interested, but this is never in an invalid state – whereas a Command is a request for something to happen passed as a Message.

Message Envelope

This is the typical message envelope that I prescribe.

{
  // the type of event or command i.e. JobPosted or PostJob if it's a command
  type: string, 

  // so clients know how to handle our message
  version: string,

  // an id that uniquely identifies a message
  id: string

  // a globally unique id for easier debugging
  correlationId: string / guid,

  // what generated this event / command
  source: string,

  // when the event happened or when the command was issued
  timeStamp: long / DateTime,

  // extra meta data to describe the event / command if needed
  attritubes: object

  // the actual payload
  payload: object
}

Let’s briefly look at how you might utilise these properties and where they should come from.

Type and Version is a way to identify the message so that the client knows how to handle it.

Id is an abstract concept that can be used for deduping messages on the other end. This is typically implemented as a type of hash using various properties of the message. A very simple example for a user-generated event would be the hash of timestamp, type and user id assuming that a user can’t feasibly invoke the same type of event at the exact same time.

CorrelationId is generated by the originator of the event and is propagated throughout the rest of the system. This allows us to easily trace at a per event basis anywhere within our system given enough logging.

Source is generated for each time a new message gets produced and uniquely identifies the application and/or user that generated it. If a particular event or command can be published by more than one application, this becomes valuable for tracking down bugs upstream and it makes filtering logs a bit easier.

Timestamp is invaluable in message-based systems – this allows us to find bottlenecks within our system and is an additional data point that we can use for lag compensation for out of order messages.

Topic Segregation

One of the other things to consider when going for an event/message based system is topic segregation i.e. what events should to published to what topic?

Topic Per Event Type

In this segregation setup, every event type i.e. JobPostUpdated, JobPostCreated etc. are published to their own topic.

This allows for consumers to more granularity in terms of event types that they want to listen to without reliance on tech-specific feature – although arguably, most messaging solutions such as AWS SNS/SQS and Azure Service Bus have their own filtering solutions making this a non-issue in most cases.

Pros

Consumers have some tech agnostic way to only include event types they are not interested in

Cons

Consumers may have to subscribe to many topics if there are many types of events it’s interested in consuming
Each event type added needs another topic to be created

Topic Per Entity

The idea here is to publish all event types for each entity to a topic, for example, Job, User, Recruiter etc.

Although consumers have no tech agnostic way of filtering out event types they don’t want to listen to, as mentioned previously, this is usually a not an issue, as most messaginge solutions have a filtering feature – if not, the consumer can just discard the irrelevant events anyway.

Pros

Keeps the number of topics to a manageable number
Adding a new event type require no further infrastructure setup

Cons

Consumers have no tech agnostic way to only include event types they intend to consume (not an issue in 99% of cases)

Event Message Content

One of the things to decide when communication via messaging is what to include in content – and you mainly have three ways of going about this.

1. Entire State

The idea is to simply include the entire state of the entity relating to the event. For example, you may have a JobPostUpdated event – this will include the entire state of the Job.

{
  posterUserId: 123,
  jobId: 567,
  title: 'Job Title',
  description: 'Job Description',
  tags: ['c#', 'javascript', 'developer']
}

As soon as we add a new property, we simply add it to the message – simple enough, no need to change version as it’s still compatible with the old clients.

But what if the clients now requires to know the first name and last name of user that posted the job? we have to now add it to the payload.

{
  posterUserId: 123,
  poster: {
    firstName: 'Jon',
    lastName: 'Doe'
  },
  jobId: 567,
  title: 'Job Title',
  description: 'Job Description',
  tags: ['c#', 'javascript', 'developer'],
}

Well… that user is probably associated with some sort of an organisation – and the clients downstream need those now as well as part of the requirement for project ABC etc. where does it end?

Pros

Clients always gets the full payload without extra work

Cons

Larger payload
Sending the banana along with the gorilla and the jungle problem

2. Changes Only

The idea here is to simply publish the state that has changed instead of the entire payload, so instead of a single JobPostUpdated event, we split this up further into smaller cohesive events – such as JobPostTitleChanged, JobPostDescriptionChanged, JobPostTagsChanged etc.

JobTitleChanged

{
  jobId: 567,
  title: 'New Job Title',
}

There’s an assumption here made that all clients are aware or are up to date with the previous complete state of the entity.

Pros

Smaller payload
Easier to implement on publisher side as it doesn’t need to read the entire current state of the entity the event is invoked on

Cons

May require extra things to do for clients to get full picture (ie. not aware of previous messages published)
More Event Types to publish

3. ID Only

Only publish the ID of the entity that changed, for example JobPostUpdated would only contain:

{
  jobId: 567
}

The typical mechanism to get further data looks like this:

1. Publisher --> Publish Event --> Client
2. Client --> GET --> Source of Truth

The producer publishes an event, the client receives the event with the entity id, and it grabs the entire state by asking the source of truth.

Pros

Arguably the easiest to implement on the publisher side
The published message likely to never change
Clients only take only what they need

Cons

Extra steps required by clients to get the full payload
Optimistically highly available
Can create unwanted dependencies that make the system more brittle whereby outages can cause a domino effect

Publishing Messages

Choosing the approach to publishing messages is a critical decision to make – as depending on requirements, you may have to guarantee delivery of messages.

Part of DB Transaction

Simply include the publishing of the message as part of the DB transaction, a psuedo-code example looks like:

using(var db = _dbConnectionFactory.GetWriteConnection())
{
    db.BeginTransaction();
    try 
    {
        // do a bunch of SQL queries
        var myMessagePayload = somePayload;

        // we're doing this as part of the transaction before commiting
        await _publisher.PublishMessage(myMessagePayload);
        db.Commit();
    }
    catch (Exception ex)
    {
        db.Rollback();
        throw;
    }
    finally
    {
        db.Close();
    }
}

Pros

Simple to implement
At least once delivery

Cons

Potential performance issue – having extra IO within a transaction means a lock on the row or entire table for a much longer time, this could cause concurrency issue / slowdowns with the entire application
Not great for a synchronous process such as a Web API endpoint, as a failure to publish means that the entire request fails – this adds an extra point of failure.

Outside of DB Transaction

Simply publish the event outside the DB transaction and hope that it publishes – this is very much the simplest approach, but of course, you lose the guarantee of message delivery. You could add retries around the publishing to reduce chances of losing an event.

using(var db = _dbConnectionFactory.GetWriteConnection())
{
    db.BeginTransaction();
    try 
    {
        // do a bunch of SQL queries
        var myMessagePayload = somePayload;        
        db.Commit();
    }
    catch (Exception ex)
    {
        db.Rollback();
        throw;
    }
    finally
    {
        db.Close();
    }

    // we're doing this outside the transaction
    await _publisher.PublishMessage(myMessagePayload);
}

Pros

Simple to implement

Cons

Not guaranteed delivery

Out-of-process Event Tailing

The idea here is to have an event audit within the application producing the events as part of the DB transaction. This means guaranteed storage of all events that have ever happened.

There will be another process in place that tails the events by pushing and marking each event as published.

This is arguably the most time consuming way of doing this, but this sounds like the proper solution to this problem as it guarantees message delivery and does not affect application performance in anyway, bar the extra time needed to write the event audit.

Pros

At least once delivery

Cons

Another worker to maintain
More time to build