a proposal for an pragmatic approach
Warning
|
This is a work in progress at a very early stage. As of now, I would not recommend reading this ;-) |
Author: Marc Gorzala
… you want to understand why it is in general not without risk to just repeat a POST-Call in case of errors
… why it is impossible to ensure that a REST Call reaches a REST Service exactly once
you want to understand why reading messages from a message bus (eg. from Kafka), could mean reading a message more than once
how to deal with those challenges. So how make sure, that things that must happen in remote systems also really happen?
you want to learn a little bit about the basics of distributed systems in general and delivery guarantees specifically.
Let’s have a look at the following distributed System of an e-commerce company:
In this scenario, we are looking at 5 Services that exchange messages we each other to facilitate the following 4 Use Cases.
The Customer browses products and puts what he wants to buy in a shopping cart.
All interactions of the customer with the shop were sent asynchronously to the Recommendation Service. Its purpose is to compute a list of products that the customer is likely to find useful. Those recommendations were sent asynchronously to the Shop.
Once the customer decides what he likes to buy, he enters the checkout. In our example that is a synchronous call to the Checkout Service. This in turn will synchronously invoke the Payment Service to get the money from the customer.
Once we get the money, we have to make sure to send this information to our accounting system for legal reasons.
In those four Use Cases we have pretty different demand with regard to the reliability to the exchange of messages.
As described the recommendation service is getting Information from the shop to computed recommendet products.
Is seems obvious that it would not harm if eg. every 100th message would get lost.
The recommendation Service would still be able to compute recommendations that are maybe a little bit less of value, but it would not be a big problem.
On the other, it is also clear that the payment-related process has a higher demand. A requested payment should almost always result in exactly one payment. Not zero payments and also not two or more payments.
Similarly, the information we send should reach the accounting system once. It would be acceptable to send it twice, but the information should rarely get lost.
You could argue that to just always implement the communication with a high reliability. These are computers, this should not be a problem. They are good at doing things right?
Unfortunately, it turns out, that it is not that easy. It comes at costs in the same way in terms of development costs, as the operational costs.
To see why achieving a high reliability is costly let’s move on to the next section, showing you the problem of the two generals.
In this scenario, you would have some microservices communicating with each other.
You have a shop system, that let the customer browser the all products and let him put them into a shopping cart.
All interaction of the customer will be pushed asynchronously to the Message-Bus. Those messages will be read by at least two systems.
The recommendation system, that wants to know which products the customer has watched and how long he has watched them. This enables the recommendation system to determine, what could be of interest for the customer.
Once the customer has everything he wants in his shopping cart, he will perform the checkout process. He will select his payment method, and finally pay via a call to a Payment-Service.
All those systems exchange messages. It is important to understand to be aware that this applies for eg. the synchronous Rest-Calls almost in the same way, as it applies for communicating indirectly via a message bus.
In the Rest-Call we can interpret the Rest-Request being one messsage that the Rest-Client will send to the Resource-Server and the Rest-Response as the message that the Resource-Server ist sending back to the Rest-Client.
So we will be in general in this series of blog entries only talk about systemen that exchange messages.
In this small example we will also see that the business has certainly different expectations about how reliable the exchange of the message has to be:
Let us see one way where it is needed to have a very reliable exchange of messages: the communication between the Checkout and the Payment-Services. It is obivous that each time the Checkout System want’s the Payment System to get Some money, that this really also will happen.
On the other hand would it be acceptable that the Recommendation System will loose some information of the Shop system. It will still be able to compute recommendations. Thouse would be maybe a little be less helpfull, but in practice this would be acceptable.
So we see, that different use case has different needs for correctnes with regards to the communication between systems.
You could argue: just make each communication reliable. So each time one System is sending a message, make exact that message reaches it’s destination once.
Unfortunatly that it not that easy to achieve as I thought initally.
I will show you different kinds of guarantees,
Let’s start with some kind of famous joke:
There are only two hard problems in distributed systems:
2) Exactly-once delivery 1) Guaranteed order of messages 2) Exactly-once delivery
When I started developing a distributed system, I was faced with the question, of what to do, when a Rest-Call does not return with some response.
My colleuge said to me, that it is impossible to know what the state of the Resource Server than could be. You would have to query the server. But if the failed Call was a Post we would not even have an Id to query for as that would have been in the response. So quering would not be so easy. So we would have to retry the call. But then the Call could effectivly happend two times. If was verwirrt. Until than I thought something like this should never happen. The call should either be successfull at whole (und damit habe ich ein erfolgreiches benachrichtigen des Aufrufers mit einbezogen) oder gar nicht erfolgreich (eg. if the response could not have been written).
My colleuge could not refere to some source that describes me why this is so disappointing, he just know that this just is so bad.
Another point that falls interesstingly into the same category was, that we as engeneers where told, that when ever we send out an event we should in general make sure that it realy goes out (at least one time) and in case that we consume an event, that should happen idempotent because we could read the very same event two times. Why this is the case, was also not clear to me and my colleagues.
As I/we felt these time so delivery getreiben, that I did not take time to research why this sit the case. Years later, in another company in another role I came to an interessing problem in computer science: The two general problem
This opened up the door for me understanding all those questions from above quite easily and answer also many other questions.
This is a common well-known problem in computer science. It is a kind of base problem when it comes to computers communicating with each other.
Let’s start.
Back in time, there were two armies.
Let’s call one of them foo and he other one bar.
Foo’s army want to attack Bar. But bar sits in a fortified castle in a valley. It is clear that it is only possible to be successful attacking bar by performing a coordinate attack at the very same sime from opposing edeges of the city of bar. So the Army of foo is split up in two parts. One is - let’s say - in the north of bar and another one in the south. Both parts of foo’s army have one general each. Those who want to communicate with each other, need to allign on one exact time to attack bar.
Here is where the problem begins. One of the two generals (it is general A) send a messenger to the other one (this is general B) to let him know that the planned time for the coordinates attack is Sunday at noon.
This messenger takes a horse and get on it’s way to geneneral B. On his way he has to pass the mountains next to the army of bar. The mountains itself are dangerous. The paths are small and the horse with the messenger can die. But also spies from bars army are in those moutains, waiting to messenger to kill them.
So General A, can not be sure that his messenger will survive and deliver the message to General B. So General A asks General B (included in the message), to send a confirmation message back.
Then General A could start with the attack on Sunday noon. But what with General B, he could not know it the confirmation was successfully received! So he could not know if General A will attack at Sunday noon. So he should better also not start attacking!
So General A could send a confirmation for the confirmation…
It becomes clear that this sending of confirmations could never end.
You should stop here, to try to find a solution. One the let the two generals find consensus on the question when to attack bar.
Even if you know that the problem is not solvable ;-)
It brings you insights.
Th — Refer to original paper but bring an easy proof here
Maybe I sould tell you first, what I mean with dilivery gurantees.
Whenver two systems in distributed systems talk with each other, they are exchangeing messages.
When we have a situation where system talk with each other via a message broker, this is quite intuitve.
One system sends a message to the broker, the other get’s it from the broker.
But also ordinary Rest-Call are nothing else then exchanging messages.
System A send eg. a POST-Call to another system. In this call are generally some information included, that the other system will read. This is the message, that A sends to System B. B on the other hand will receive this message, leave the connection open, will in general do something, like validate the message, persist the message, do some side effects and then will send a Response back via the open connection.
This response is again the next message. This Response will in the Post case include normally an Id of on created identifier in the case that the Post-Call was successful and in case of problems some information what went wrong.
Not all developer are aware that this ubiquitious Rest-Think is in that way also just exchanging messages. Neither a message broker ist needed for Systems to let them exchange messages, nor is theire any need to have it implemented in an asynchronous was.
It is just: whenever two systems interact in any way, they exchange messages and talking about delivery gurantees hold when system exhang messages:
we can have in general three guarantees. We will explain in terms of the two generals problem, and how that maps to our distributed services
This would mean, that General A send only one messenger to General B. General B could send an acknoledge, but regardless if General A reveives an acknowledge or not, it will not resend the message. So General A could be in a situation where it does not know if General B got the message. Still General A, must make a decision when to attack! So General A, is a situation where he has to live with the uncertainity that his message was lost.
General B is in a similar situation. In case he got a message he does not know wehter General A got the acknoledgement. Bad for him. So he will have to believe that the acknolege went through and attack in this trust!
Talking about Rest-Calls, at most once would be the situation when you just perform one let’s say Post-Call and ignore the Response. At least you will not issue another Post-Call in case of an error response.
So the system that issues the Post-Call can not be certain about the question that the call went through and was processed at the rest-service
This is the situation when General A is sending so long messengers (either paralell or sequentielly) as it receives for one message an acknoledge.
This could in theory never terminate, but in practice this will in general terminate eventually.
Keep in mind that even when General A then knows that General B got the message, unfortunately General B does not know that his acknowledgment went through. This is one of the things that have no solutions in the two general problem. In practice, we will see how to tackle this problem in real world.
Talking about Rest-Calls at least once would be the situation, when we implement a loop that tries to make a rest-call until that is succuessfull.
If a Rest-Call is successfull can in general only be found out, be the Response of the call. Unfortunatly, there could be Rest-Call, that were succuessfull, but the Response got lost. In that case our loop would make a call again even if one was succussfull.
This could be a problem. The Two-General-Problem is not a good example here, as it would not be a Problem for General B when it got two distinct Messengers with the same message.
But let us assume for the Rest-Service, that it is a Payent Service. It has a Post-Endpoint that can be used to create Outgoing Payment to Persons. If we just retry this Post-Endpoint as described, it could be that the call (the message) will be delivered two times and also processed two times. Leading in the person getting twice or even more the money!
An exactly once delivery, would mean that General A will just send a messager once to General B. We have just seen, that this is impossible.
Talking about Rest-Calls this is also just not possible. We can not gurantee to make really exactly one Rest-Call reaching the the Rest-Service! This also applies for all other communications between systems in not reliable networks!
But we can achive something, that is almost as good as exactly once delivery. It is often called exactly onces semantics. The idea is: by sending a message to another system we expect in general something to happen. This could be, that General B store in his calender the time to attack. Or make a payment to a customer, could be something. This something if often called the side effect. This was happens when we send the message. The name side effect is someway misleading here, as it normally is the main effect ;-) But this has sich eingebürgert ;-)
So, if we would gurantee that this side effect happens only once, when reading the message, we would be safe:
Storing the attack-time in the calender ist safe in the way. When we save two time the same time to attack, then the result would be the same, as when we only save the time once!
Sending out money to a customer is quite different here. Doing this more often than intended means just losing money ;-)
So we would need to make sure, that also some message were delivered possible more then once, we should either process them only once, or construct this procesing in that way that it will not lead to more than one time the side effect. (this is not really correct here. should it introduce idempotent behaviour earlier and than refer to this?)
I will tell you now, how to achieve this in a little abstract way just here and later how to achive this concrete with Rest and Kafka in later Parts of the Blog.
--- This is something that could be dropeed? It’s is someway interessting, but how can I include it?
But the Two generals could implement the following protocol two achieve something that is being called Exactly once semantics (in contrast to exactly once delivery). This means that, although the message will be received more than once the side effect that the message should cause (Tell General B, when to attackt) will be same when receiving the message multiple times or just once.
For that to happen, the Two Generals could allign on the following protocol (there are plenty of other protocolls).
Both general accept that one acknowledge will never be reacknoledged. So something like this will not happen:
General A send out 100 messengers at once. All with the same message. Then General A waits for acks.
General B waits for incoming messages. When the first one arrives he will ask the messenger how long hier trip was. Than i wait times as long. This will mean that he will wait for all messangers that are not more slower than 10 times slower than the fastest one.
When General B finished waiting, he can compute how long one average messenger needs for reaching him (remember that as average_time_to_travel). He also know how many of them got’s lost. (remembering that as succuess_ratio).
General B can now send back a number of messengers based on the sucuess ratio to General A. Based on the average_tie_to_travel he will now start waiting for possible retries of General A. As they aggreed that an ackk will not be acked again, he will just wait for normal messages. If those will now come in for let’s say average_time_to_travel * 3), then he will be sure, that a consensus was reached.
What you see, ist that in some way
Guarantee | Description | Pro’s | Con’s |
---|---|---|---|
At most once |
General A is sending just one messenger. It doesn’t really matter if General B sends an acknolege, as General A will never resend the message with another messenger. |
Not |
General A could not know, that B got the message when it did not r |
At least once |
A message if being delivered in case of problems the delivery will be r |
Column 3, row 2 |
Column 4, row 2 |
Column 1, row 3 |
Column 2, row 3 |
Column 3, row 3 |
Column 4, row 3 |
Point out, that every Rest-Call has the same problem. And not even every Rest-Call, just every communication between two systems. So also send Messages to a System like Kafka.
As we in general have quite many problems where it is very importand that a message really arrives, we should find a way out of this problem…
First, back to the problem with the two armies. Let’s think about way’s to solve the problem at least aproximately. So even if the two generals could not be 100% sure, let’s them be enough sure, to start that war with that uncertainity.
You can stop here and first look for solutions on your own. Maybe you will come up with similar ideas…
"Gang of Four" [kleppmann01]
So let’s assume we are having a Rest-Endpoint that is a POST-Call to create a Payment:
This Post Call will create an outgoing payment to a customer of dancier.
Let’s assume
In this example "Some Service" can not know it the request was successful. As no HTTP-Response arrives at Something it could be that Payment Service was unable to perform the the outgoing payment and faild sending a http 500. It could also be that the request was succussfull, the payment servcie send the money to the recepient, but failed to send the http-200 to "some service".
Now some service has several options:
Some Service ca try to find out if the request was succuessfull by some other Rest-Endppoints. Unfortunatly it could not just make a get to the possible created rest-resource, as the id for that get-query whould be included in the response that just failed. So it could query for all payments for the given user and amount. This is an option that would require extra effort, and in some enviroements that would be much effort or even impossible or at least error prone. At least we could not generically handle that issue. We will not look at this option further
we could just retry the call, but than we would risk that the call will succeed more than once, with the risk of sending the money twice or even more times. Good for the receipent, bad for us. So we need to make sure, that the money will only go out once, and exactly once. How can we achive this? In theory this is called that we want to achive idempotency for that call. So even
We could make the nature of the call idempotent. Let’s assume that we maintain a balance of what we owe to the customer or the customer owes us. So as an example, if the customer buys something, he will owe us money. We would express this we a postive balance. If he owes us 10€ the balance would be +10€. If he than pays 10€ we will decrease the balance by 10€. So than the balance would be +10€-10€=0€. So we would just aim for a situation that the balacne is always 0€. (if we owe the customer let’s say 5€, than the balance would be -5€, and we would make a payment to the customer). So if we would change the rest-endpoint, we would have a PUT-Endpoint that set’s the expected balance.
show example at that we would not send the money out two time with this solution
drawback is, we will have shared state in that distributed system. "Some Systemen and Payment System will have to maintain the shared data of the balacne"
This would also mean that we would to work arround that technical issue we will heavily impact our api!
How else could we make sure, that the payment only goes out once and exactly once?
We could introdurce an idempotent key.
No, I will show how this idempotenc key things works.
what to do on the calling side and what to do on the receiving side.
The impor