Regarding Resilient Software – Asynchronous Interfaces
This is the third post in the series regarding resilient software. For a definition on resilient software see what resilient software means.
I want to expand on a topic I touched on briefly on last post on resiliency dealing with configuration. In that post I mentioned that the configuration store should never be on the critical path and made the following (somewhat refined here) observation:
The more critical dependencies our software has that are external and synchronous the less resilient it will be.
There are three components that make up the dependencies in this observation. First they are external. An external dependency can be considered any system that our software requires services from. Since the system is external it is running in a different OS process than our software and communication will typically occur over a network. Second the dependency is synchronous. This means our software has clients that are in a waiting state until this external system call returns and our software can respond back to the client. Lastly the dependency is considered critical. If the external system call doesn’t work then our software fails to do its job. Operators are notified. Trouble tickets are generated. Support gets brought in. Chaos ensues.
At this point you might be thinking that my observation is not exactly groundbreaking news. As we talked about before software resiliency is all about our software being able to recover from misfortune or change. If our software isn’t able to successfully complete its job when these dependencies fail then it isn’t being resilient. The purpose of this post however is not to prove this observation (I think it’s self evident). Instead I want to talk about what can be done to overcome it.
To start with let’s make two assumptions. First the number of external dependencies that your software has is fixed and there is nothing you can really do about it. Many of these systems you interact with will be the sources of data that you need and you’ll have no option to refactor this functionality internally. Even if you can technically refactor it they are many times political reasons preventing this as well (and thus the enterprise’s calculator web service is born). Second, whether or not external system interfaces are considered critical is fixed and there is nothing you can really do about it. The criticality of an external service has to do with the nature of your software’s job and how this service factors into it getting completed. You might on the rare occasions get to redefine the nature of your software’s job but that’s cheating. That leaves only one factor in this resiliency equation that we have any control over. The more of your critical interfaces to external systems that you can make asynchronous (and not synchronous), the more resilient your software will be.
Synchronous And Asynchronous Interfaces Defined
In the diagram below system A requires some service from the external system B. When A makes the request from the thread of execution t1, system B does not return from the call until it is finished processing. During this processing time t1 is in a blocking state and no further processing on system A (for this thread) occurs. Once B is finished it returns the processing result along with any data that A requires and t1 is released from the blocking state. It can now finish processing based upon the result of the call to B and any returned data. In this scenario system A is synchronized to processing of the request by system B because it waits for it to return.
An asynchronous call from one system to another is characterized by the calling system not waiting on the receiving system to finish processing. In the diagram below system A makes a call to system B in the thread of execution t1 just like in the synchronous example. However in this scenario system B only stores the request from system A for later processing. It then immediately responds back to system A that the message has been saved. At this point the thread of execution t1 picks back up processing but it only knows that the message has been saved and not what the result or return data is yet. Based upon this limited information t1 finishes its processing.
At some point later system B retrieves the stored message and processes the request. When processing is complete system B initiates a call back to system A with the result and data. System A now processes the result in a new thread of execution t2.
Making External Dependencies Asynchronous
It’s been stated that if we are trying to achieve resiliency then we want our external dependencies to be asynchronous but what does this mean? In the context of this discussion an external system dependency is considered to be asynchronous if it processes our requests asynchronously from client interaction with our software. Clients to our software (be it other software system or humans) provide the only reason our software has to exist. It is from the clients viewpoint alone that our software will be judged to be resilient or not. Using this definition there are two different ways to implement asynchronous external system dependencies.
The first type of implementation has the asynchronous behavior fully supported by the external system. In this implementation the interface between our software and the external system is asynchronous. When our software receives a request from a client it triggers the external system call. When the external system receives the request it stores the request in a message store and immediately responds back to our software before the message is processed. The message store is a mechanism for storing and retrieving messages in a highly reliable manner. The message store could be implemented in the same process as the external system or it could be a separate system. Typically this will be implemented as a separate database, a JMS Server, or some other message server technology. Regardless of the implementation our software interacts with the message store as though it was part of the external system.
As soon as the message is saved, the external system responds back to our software and it finishes it’s processing returning a response to the client. Since our software doesn’t know the result of the external system call yet our response to the client may be limited. At sometime in the future the external system processes the request and initiates a callback to our software with the result. At this point our software notifies the client of the result as well. If our client is another software system our software might initiates this as a system call in a similar manner to how the external system called us. If the client is a human emails may be sent out or if this is a web application the status of the external system result may be made available somewhere that the user can see. Since the client doesn’t wait for the external system to process the request this implementation successfully makes the external system dependency asynchronous.
All of the asynchronous behavior is implemented by our software in the second type of implementation In this scenario the external system only supports a synchronous exchange between our software. In the diagram below a client initiates a request to our software that triggers an external system request. Instead of sending the request directly to the external system it saves this message in a message store that is considered part of its internal system. As in the external case, this doesn’t necessarily mean the message store is running in the same process as our software but that from the viewpoint of the client and the external system this all one logical system. Once the message is saved it finishes processing the client’s request and returns a response to the client. As in the external implementation case at this point our software doesn’t know the result of the external system request so the response to the client may be limited.
At sometime in the future our software retrieves the message from the message store (on a different thread of execution) and sends this message synchronously to the external system. Once the external system is finished processing the request it returns with the result. At this point our software can initiate a callback to the client with the resulting data in the same manner as the external asynchronous implementation. Note that although the interface between our software and the external system is synchronous the client interaction is identical to the previous implementation. Since the client interacts asynchronously with the external system call this implementation successfully makes the external system dependency asynchronous.
Why Asynchronous External Dependencies Promote Resiliency
We now know what asynchronous interfaces are and how we can make our dependencies asynchronous. It’s been asserted that asynchronous dependencies promotes resiliency but with no real evidence to back this up. The following are the top reasons that asynchronous external dependencies increase resiliency.
Allows For Delayed Retries
One of the good things about synchronous interfaces is that you immediately know the result of the external system call. This is especially nice when the external system call works as expected. You can use the return data (or just the knowledge that the message has been processed) to finish the job and complete the transaction with the client. However if the call fails our software must now decide what to do. One option would be to just fail the job and return some type of error result back to the client. If we want our software to be resilient however this is not an option that we care to (easily) consider. Another option might be to just retry sending the request. Maybe the external system had some dependency of its own that had crashed but is now back up or maybe the external system was in the middle of an upgrade that just finished. Time heals all wounds. If we decide to resend the message and that fails too should we wait a little bit before trying again? Remember our external dependency is synchronous so we have clients waiting on the response back from us. We can’t keep them waiting indefinitely.
Making the external dependency asynchronous allows the request to be retried without blocking our software’s main thread of execution (that a client is waiting on). Once the message is stored (either by our software or the external system) it can be retrieved and processed as many times as required to successfully process the message. When failures occur longer delays can be inserted between the next retry to give the system a chance to recover (either on its own or with the help of support). Being able to recover from misfortune or change is one of the key characteristics of any software system.
Reduces Potential Timeout Failures
Another concern with synchronous dependencies is timeouts. Timeouts can occur when the external system takes longer than normal to process the request. It could be that the number of requests the external system is receiving has increased and the system is struggling to keep up. Maybe another process that lives on the same machines as the external system is eating all the CPU cycles. Timeouts can happen for any number of reasons really. The only thing you can be sure about is that at some point it will happen.
There are two types of timeouts: network and user. The majority of external system interfaces are going to occur over the network. For 99% of us this means a TCP/IP network. When you make a request over the network using TCP (and if your not using TCP then the dependency probably isn’t critical) there is a timeout associated with that request. If the external system takes longer than this timeout to return, your machine will cancel the request and our software will probably receive some type of network timeout exception. At this point our software is in a state of limbo. It might be that the message was received and is still being processed however it’s also possible that the external system was so far backed up that the message was never received and won’t be processed at all. There’s nothing really that can be done at this point other than to error off and get some humans involved (captain chaos to the rescue).
The timeout for a network request is configurable. One way to try and avoid these timeouts is to set the timeout parameter really high. However by doing that you now have to worry about user timeout. User timeout can occur when somewhere upstream from our software there is an actual person waiting for this synchronous process to complete. Setting the timeout parameter to 2 hours may eliminate all your occasional network timeouts but you can’t expect your user to wait that long. At some point the user will give up and the software is considered broken even if it would have eventually returned (remember that resiliency is judged from our client’s viewpoint).
Making our external dependencies asynchronous removes the concern of client or user timeout. Our clients now wait on the external system only for as long it takes to save the message in the message store. Storing the message should be a very low overhead operation and one that even under heavy load can be performed quickly. Once the message is stored it’s much less of an impact if the external system occasionally takes a longer than normal time to process the request because nobody is synchronized to its response.
Less Exposed To Network Failures
Sometimes the client accessing your software is not an actual user. In these cases keeping all the interfaces synchronous and configuring it with a higher than expected timeout may be an acceptable solution. In my experience if your timeout is minutes than things will probably be okay. However trying to keep a network connection open for a really long time (hours) without it closing/crashing can be difficult. This is particularly true if there is no data flowing back until the end (which will normally be the case unless the external system is streaming something back). Security policies are one reason this can occur. If the external system request is routed through a firewall it will many times have a maximum connection open time. For everything else I just chalk up to general network glitches (i.e. I don’t really know a lot about networks).
Implementing our interface in an asynchronous fashion removes this concern. As we just discussed storing the message in the store should be a quick operations. Even for very large messages the external system should be able to acknowledge back to our software in seconds. At this point the connection is closed and our software is no longer exposed to network instabilities. Both implementation types remove exposure to network failures relative the client. The internal asynchronous implementation however is still exposed to this type of failure because it makes a synchronous request to the external system.
Asynchronous Behavior Adds Complexity
Making external dependencies asynchronous increases resiliency in our software but we don’t get this for free. Asynchronous interfaces are significantly harder to design and implement than synchronous ones. As will be the case with many of the things that promotes software resiliency the price you pay is more complexity. Some of the reasons that making our external dependencies asynchronous introduces more complexity include:
Requires Message Store Implementation
Asynchronous interfaces require a message store implementation one side or the other. This is typically implemented with a database or by introducing a new system into the landscape that is designed specifically for this (MSMQ, JMS Server, etc.). Either way this adds complexity to the overall systems interaction, as there are now more touch points to configure and maintain.
Multi-Threaded Programming Model
From a programming standpoint synchronous interfaces are nice and simple. When a call to an external system is made the response from that system is returned within the same thread of execution. All the data and state (or context) that had led up to the external call is readily available for the programmer to finish its processing with.
Asynchronous interfaces require a different programming model. When an external call to an asynchronous interface is made an acknowledgment of the message is received and the software must now wait for the response back from the external system. The callback from the external system is processed on a different thread of execution and the original context from the message request is no longer available. Instead this context must now be recreated based upon identifiers found within the return data and the previously saved state. We may now have synchronization issues as well. For example what happens if we get the callback from the external system before we had a chance to save context state from the original calling thread? Introducing more threads to a software system ALWAYS increases complexity.
Asynchronous User Interaction
Sometimes the external system that our software interacts with will be returning data (as opposed to just a status) that is returned back to the client to process in some manner. For asynchronous dependencies this is a concern especially if the client is a user. For example, let’s assume that our software is a web application (I know – it’s a stretch) and our client is a user working from within a browser. Let’s also assume that the external system our software interacts with returns some data that the user needs to be able to view from within our application.
In the synchronous interface example when the user issues a browser request that triggers the external system logic of our software the browser request doesn’t return until the external system is done processing. At that point our software has the return data from the system on hand and can immediately display this to the user as part of the response. In the asynchronous interface example, when we receive the browser request and make the external system call we only know it was received and will be processed. We don’t know what the return data will be so what do we return to the user (remember we can’t keep them waiting indefinitely)?
In this scenario the only thing we can return to the user is what we know: the request has been received and will be processed later. Our application will need to make this response data viewable by the user when it receives the callback from the external system. When the callback is received maybe our software will send the user an email or maybe it will make this information available on the user’s home page so they can see it the next time they login. The point is that now the user must interact with our application in an asynchronous fashion.
Certainly for most users this wouldn’t be a completely foreign concept. If you’ve ever bought something on Amazon you’re familiar with this type interaction. After you made your purchase you don’t know for sure that the transaction has gone through until you get an email saying yes or no. But it’s still a more complex interaction for the users than the synchronous example. In the Amazon example you’re probably averaging less than one transaction a day so when you get the email notification you know exactly what it’s for. Imagine though that for our software the user averages 50 or 100 transactions a day. In the same way the software must recreate the context when it receives a callback from the external system, the user must mentally recreate their context when they receive this notification and process it accordingly.
Client And Server Required On Both Sides
In the synchronous scenario our software plays the role of the client and the external system plays the role as the server. In the asynchronous scenario our software must be a client (when making the external system call) and also a server (when receiving the callback message). The external system must now also be a server and client. This adds complexity to our software because the implementation of both sides has to be developed and maintained. It also can add complexity to the network topology. If a firewall(s) sits between our software and the external system it must now allow traffic to be initiated in both directions. If the clients to our software are other software system this same type of complexity shows up here. Our software must now be both server and a client to our software system clients (and vice versa).
This added complexity between our software and the external system only exists in the external asynchronous behavior scenario. One advantage of implementing the internal asynchronous behavior is that from the network’s standpoint and the external system’s standpoint the interface behaves synchronously. This means our software doesn’t have to be a server to the external system and the external system doesn’t have to be a client to our software. Note this decrease in complexity comes with the price of being more exposed to network failures since the interface between the external system is synchronous.
When Should External Dependencies Be Made Asynchronous?
There is no exact formula to determine if the resiliency gained by using asynchronous interfaces is worth the cost in complexity. This will be a recurring theme for most of the regarding resiliency posts. Software architecture is all about tradeoffs and very rarely does it provide us clear-cut, obvious solutions. With that being said I will leave this discussion with the following set of questions that can get you started thinking about these tradeoffs and deciding if asynchronous dependencies make sense for your situation.
Is Resiliency Really Required?
Asking this question will be another common theme in this series. It’s great that your software application can claim five nines availability but it’s an internal employee directory web app that averages about 10 hits a day. It’s probably going to be okay if it’s down on occasion. As mentioned many times before, adding resiliency to software is very rarely free. There is almost always the cost of added complexity. You should always weigh these costs in complexity against the cost of your software not being resilient before making this decision.
Is The Interface Write Only?
The best system interface candidates to make asynchronous are write only ones. If our software (and the client) doesn’t need any return data from the external system then the callback step can be removed from the sequence. This will cause the asynchronous dependency to behave more like a synchronous interface since after the message is stored the transaction is complete. Write-only asynchronous interfaces remove 3 of 4 complexity issues (multi-threaded programming model, asynchronous user interaction, and client and server required on both sides) while still providing all the resiliency advantages. A message store implementation is still required but if resiliency is really a requirement this will many times be worth the cost.
Can The Request Be (Safely) Processed Multiple Times?
One of the main advantages of using asynchronous interfaces is that it allows for retries when message processing fails. This gives time for the external system to recover when misfortune or change occurs. One thing to keep in mind is just because the external system can retry the request multiple times doesn’t necessarily mean that it should. For example, let’s assume that when the external system goes to process the request it needs to update three other systems. If the first two system updates work but the third fails the message can be left in the message store and tried later but what about the two systems that have already been updated? When the external system goes to retry the request are the two systems that have already been updated going to now get duplicate data?
To take advantage of message retries the external system needs to be able to keep duplicate data from being posted when failures occur. If there is only one update operation by the external system that either works or fails this shouldn’t be a problem otherwise the external system will need to account for partial failures. One way this can be accomplished is by running the entire request under a single transaction that the external system can rollback if one or more steps fail. Depending on the type of systems being updated this may or may not be an easy thing to do. Another possibility is the entire request process is idempotent. Idempotence is a characteristic of mathematical operations where multiple applications of the operation do not change the result. This characteristic can be applied to software operations and interfaces as well. Idempotent software interfaces are great for making asynchronous because if there is a failure, the request can be retried later from the very beginning without having to worry about data inconsistencies.






