This document purpose is to clarify concepts implemented in the platform in order to facilitate the monitoring and management of services deployed to the platform.
The entire platform will provide monitoring data via logs and metrics to the monitoring tools and requires a consistent approach across services to simplify building dashboards, report generation and alerts configuration.
Having a clear understanding of these concepts is crucial to make the platform consistent and simple for maintenance purposes.
A trigger mechanism is required to initiate an operation. The trigger is the initiator event received from external systems or triggered internally to notify something has happened or is about to happen.
The common triggers are:
HTTP Request: An HTTP request to a RESTful API endpoint is a trigger sent to the API to initiate an operation.
Queue or Topic Message: A message received from a queue or topic, in form of a command or event that will trigger some processing within the application.
Scheduled Job: An stand alone application or function that executes on a defined schedule to execute a task and is considered complete when the operation finishes.
Everything that happens within an application happens in the context of one operation, an operation will start in response from a trigger, execute and notify its completion with an event(and a response in case of synchronous operation).
An operation that modifies one or multiple resource in the application. Commands can be synchronous and may return a result(if one is required) to the caller in form of a response, or asynchronous without a return result, the outcome of the asynchronous command should always be published as an event for any outcome(successful or failed results).
Are operations that does not modify the resources in the application. It intend to retrieve data from the application but should also raise events in order to generate insights into user behaviour for platform improvement.
Example: How many times the user make use of a feature (search invoice, page search results, and so on). Queries will probably also generate data in form of exceptions when the user does not permission to execute operations, for security, auditing and monitoring purposes.
All Commands and Queries must raise events in order to generate enough data for observability and behaviour analysis by data platform team.
Application events are generated by the application to notify external services that an action has taken place. The event will notify if an operation completed successfully or failed and should contain the required context information to identify the resource changed(if any), the operation which raised the event and possibly the user id who requested it.
Exceptions is an execution flow mechanism used to interrupt the current processing flow either because, the application or one of it’s dependent components behaved unexpectedly and can’t proceed, or an application logic is aware it can’t proceed because doing so will/may cause issues. All exceptions raised in an application or it’s dependencies are logged by the infrastructure components, so adding a log information to notify an exception has happened is not required and will make the logging redundant.
Exceptions raised by the application must be specific and uniquely identifiable across the platform and must clearly describe the problem that caused the exception. Generic exceptions should be avoided because they will first difficult to implement proper exception handling logic in the application, and also will make difficult to aggregate recurrent exception in the logging platform. Having unique exception will facilitate aggregating common issues for monitoring and alerting.
Exceptions types can be classified as Application, Infrastructure or BCL.
Application: Exceptions raised by the application to interruption of an operation in case the state of a resource or the user permissions are not valid for complete the operation. These are commonly mapped to business requirements implemented as a code into the application logic.
Infrastructure: Exceptions raised by base packages used to support the application, they provide useful information to the application to handle issues that it can’t handle by itself. These exceptions in general are related to data access problems, messaging and communication issues implemented into these base packages. The application can either handle these exceptions and provide a custom exception specific for the application, or let the infrastructure exception propagate and provide to the user a generic error message.
BCL: Are base exceptions raised by dotnet framework are commonly being propagated to the application code to notify unexpected execution of some base class library code, like failure to convert type, Invalid Operations and so on. In general these exception should handled properly in the application to prevent propagating to the user, giving them a more useful information.
Definition: Unique identifier used to track a triggered operation and it's chain of events and exceptions.
Usage: Every operation triggered by an event needs to be identified individually, to make this possible, we assign a CorrelationId to the operation, so in case two events trigger an operation, we could identify the chain of events and exceptions triggered by these events using each one CorrelationId.
CorrelationId can be either used for the entire flow of a process or split into CorrelationId per stage.
- CreateOrder operation is triggered by the customer at checkout of an e-commerce, this operation will raise OrderCreated
- The Stock application will get the event raised by the checkout application and trigger the ReserveStock command
In both cases, a single trigger raised two operations as part of a workflow, using a single correlation in both phases of this workflow would make much easier to track all events in a chain of events across different systems, on the other hand, this approach makes all events have the same CorrelationId and in a very verbose systems might get very bloated, another scenarios are:
- The correlationId is provided by the client that didn't get a response on time and timed out, retrying the request using the same CorrelationId
- An event might be processed by multiple handlers at same time, creating a chain of events interlaced by different handlers
- The handling of the message can't complete because of the state of the application, connectivity or other issues, re-attempting to process the event using the same CorrelationId
In these scenarios, would be useful to generate one Correlation per phase.
- CreateOrder generates a correlationId and pass it to the event Raised
- ReserveStock capture the original event CorrelationId and Log the transition between the old and new before handling the event with a new correlation
The current sample solution implements one CorrelationId for the entire flow, is up to the implementer to decide how they want to track the correlation between services
Definition: Unique code used to track an operation type.
Usage: Every operation has a type.
Example: Create Item, Delete Item, Update Item, Query for Items and so on. Each of these types will be given a code that will be used by the logging handler to record the chain of events, exceptions or other information generated by an operation. The code is useful to de-couple the Operation name from other resources.
- The Command UpdateItem can raise the ItemUpdated event when succeeded, if the Item does not exist an exception ItemDoesNotExistException is raised.
- The Command ReserveItem can raise the ItemUpdated and ItemSoldOut event when succeeded, if the Item does not exist an exception ItemDoesNotExistException is raised.
In both cases, the same application events and exceptions can be raised, but both are part of different contexts, one is updating the item directly and the other indirectly.
Adding the OperationCode to these events, make it easier to:
- Identify the context of an event or exception
- Group and track events or exceptions generally raised by one Operation type
- Prevent operations with same name in different contexts to be considered the same
- Simplify the tracking of existing operations by Having centralized list of operation codes
Unique Ids and Codes (exceptions, events and operations codes)
As described above, an application will create or handle operations, events and exceptions that will generates log entries in the logging platform. These log entries will be used to generate reports and alerts, in order to do so, they must be uniquely identifiable or easily aggregated to produce useful reports.
Exceptions, Events and possibly Operations names might conflict with names defined into different domains (i.e: AddAttachment command, AttachmentAdded event). Using the Full qualified name with a namespace would be a simple way to differentiate each domain related activity, but in some cases, like base packages shared by multiple services, exceptions or events might have the same namespaces, and it will make difficult to easily track their relationships.
Another scenario is linking exceptions and events to operations that triggered these, we could correlate them by the correlation id and tracking the operation who started the operation, another alternative is annotate the logs with the operation name.
To simplify these complexities, we will add unique ids and codes to exceptions, events and operations in order to uniquely identify them. These ids will be attached to all log entries for:
- Link events and exceptions to operations that triggered then. Simplifying reporting, monitoring and alerting processes.
- Reduce the size of log entries from a long name, to an integer.