Operations, Events, Exceptions and Correlation

This document clarifies concepts implemented in the platform to facilitate the monitoring and management of services deployed to the platform. The platform provides monitoring data via logs and metrics to the monitoring tools and requires a consistent approach across services to simplify building dashboards, report generation, and alerts configuration.

Having a clear understanding of these concepts is crucial to make the platform consistent and simple for maintenance purposes.

Triggers

A trigger mechanism initiates an operation. The trigger is the event received from external systems or triggered internally to notify that something has happened or is about to happen.

Common triggers include:

HTTP Request: An HTTP request to a RESTful API endpoint initiates an operation.
Queue or Topic Message: A message received from a queue or topic, in the form of a command or event, triggers some processing within the application.
Scheduled Job: A standalone application or function executes on a defined schedule to perform a task and is considered complete when the operation finishes.

Operations

Everything that happens within an application occurs in the context of an operation. An operation starts in response to a trigger, executes, and notifies its completion with an event (and a response in the case of a synchronous operation).

Operation Types

Commands

An operation that modifies one or multiple resources in the application. Commands can be synchronous and may return a result (if required) to the caller in the form of a response, or asynchronous without a return result. The outcome of the asynchronous command should always be published as an event for any outcome (successful or failed results).

Queries

Operations that do not modify the resources in the application. They intend to retrieve data from the application but should also raise events to generate insights into user behavior for platform improvement.

Example: How many times the user uses a feature (search invoice, page search results, etc.). Queries will probably also generate data in the form of exceptions when the user does not have permission to execute operations, for security, auditing, and monitoring purposes.

note

All Commands and Queries must raise events to generate enough data for observability and behavior analysis by the data platform team.

Events

Application events are generated by the application to notify external services that an action has taken place. The event will notify if an operation completed successfully or failed and should contain the required context information to identify the resource changed (if any), the operation which raised the event, and possibly the user ID who requested it.

Exceptions

Exceptions interrupt the current processing flow either because the application or one of its dependent components behaved unexpectedly and can't proceed, or the application logic is aware it can't proceed because doing so may cause issues. All exceptions raised in an application or its dependencies are logged by the infrastructure components, so adding log information to notify an exception has happened is not required and will make the logging redundant.

Exceptions raised by the application must be specific and uniquely identifiable across the platform and must clearly describe the problem that caused the exception. Generic exceptions should be avoided because they will make it difficult to implement proper exception handling logic in the application and to aggregate recurrent exceptions in the logging platform. Having unique exceptions will facilitate aggregating common issues for monitoring and alerting.

Exception types can be classified as Application, Infrastructure, or Base Class Library (BCL).

Application: Exceptions raised by the application to interrupt an operation if the state of a resource or the user permissions are not valid to complete the operation. These are commonly mapped to business requirements implemented as code into the application logic.
Infrastructure: Exceptions raised by base packages used to support the application. They provide useful information to the application to handle issues that it can't handle by itself. These exceptions are generally related to data access problems, messaging, and communication issues implemented into these base packages. The application can either handle these exceptions and provide a custom exception specific to the application or let the infrastructure exception propagate and provide the user with a generic error message.
Base Class Library (BCL): Base exceptions raised by the .NET framework are commonly propagated to the application code to notify unexpected execution of some base class library code, like failure to convert type, invalid operations, etc. These exceptions should generally be handled properly in the application to prevent propagation to the user, providing them with more useful information.

Correlation

Definition: A unique identifier used to track a triggered operation and its chain of events and exceptions.

Usage: Every operation triggered by an event needs to be identified individually. To make this possible, we assign a CorrelationId to the operation. This allows us to identify the chain of events and exceptions triggered by these events using each CorrelationId.

CorrelationId can be used for the entire flow of a process or split into CorrelationId per stage.

Example:

The CreateOrder operation is triggered by the customer at checkout of an e-commerce, raising the OrderCreated event.
The Stock application gets the event raised by the checkout application and triggers the ReserveStock command.

In both cases, a single trigger raised two operations as part of a workflow. Using a single correlation in both phases of this workflow makes it easier to track all events in a chain of events across different systems. However, this approach makes all events have the same CorrelationId, which might get very bloated in a verbose system. Other scenarios include:

The correlationId is provided by the client that didn't get a response on time and timed out, retrying the request using the same CorrelationId.
An event might be processed by multiple handlers at the same time, creating a chain of events interlaced by different handlers.
The handling of the message can't complete because of the state of the application, connectivity, or other issues, re-attempting to process the event using the same CorrelationId.

In these scenarios, it would be useful to generate one CorrelationId per phase.

Example:

CreateOrder generates a CorrelationId and passes it to the event raised.
ReserveStock captures the original event CorrelationId and logs the transition between the old and new before handling the event with a new correlation.

The current sample solution implements one CorrelationId for the entire flow. It is up to the implementer to decide how they want to track the correlation between services.

OperationCode

Definition: A unique code used to track an operation type.

Usage: Every operation has a type.

Example: Create Item, Delete Item, Update Item, Query for Items, etc. Each of these types will be given a code that will be used by the logging handler to record the chain of events, exceptions, or other information generated by an operation. The code is useful to decouple the operation name from other resources.

Example:

The UpdateItem command can raise the ItemUpdated event when succeeded. If the item does not exist, an ItemDoesNotExistException is raised.
The ReserveItem command can raise the ItemUpdated and ItemSoldOut events when succeeded. If the item does not exist, an ItemDoesNotExistException is raised.

In both cases, the same application events and exceptions can be raised, but both are part of different contexts, one is updating the item directly and the other indirectly.

Adding the OperationCode to these events makes it easier to:

Identify the context of an event or exception.
Group and track events or exceptions generally raised by one operation type.
Prevent operations with the same name in different contexts from being considered the same.
Simplify the tracking of existing operations by having a centralized list of operation codes.

Unique Ids and Codes (exceptions, events, and operations codes)

As described above, an application will create or handle operations, events, and exceptions that will generate log entries in the logging platform. These log entries will be used to generate reports and alerts. To do so, they must be uniquely identifiable or easily aggregated to produce useful reports.

Exceptions, events, and possibly operations names might conflict with names defined in different domains (e.g., AddAttachment command, AttachmentAdded event). Using the fully qualified name with a namespace would be a simple way to differentiate each domain-related activity. However, in some cases, like base packages shared by multiple services, exceptions or events might have the same namespaces, making it difficult to track their relationships.

Another scenario is linking exceptions and events to operations that triggered them. We could correlate them by the correlation id and track the operation that started the operation. Another alternative is to annotate the logs with the operation name.

To simplify these complexities, we will add unique ids and codes to exceptions, events, and operations to uniquely identify them. These ids will be attached to all log entries to:

Link events and exceptions to operations that triggered them, simplifying reporting, monitoring, and alerting processes.
Reduce the size of log entries from a long name to an integer.

Triggers​

Operations​

Operation Types​

Commands​

Queries​

Events​

Exceptions​

Correlation​

OperationCode​

Unique Ids and Codes (exceptions, events, and operations codes)​

Triggers

Operations

Operation Types

Commands

Queries

Events

Exceptions

Correlation

OperationCode

Unique Ids and Codes (exceptions, events, and operations codes)