At Piiano we are developing a Vault, a type of storage database dedicated to PII. Vault is implemented in Go, which is a great language for implementing cloud services.
One of the basic requirements of any server is to implement a request timeout. Any request that is handled by the server should have a timeout which is typically around 15-30 seconds on web servers these days. When that timeout expires the user should receive a “503 Timeout Exceeded” response, and the handling of the request on the backend should be stopped.
As this was my first contribution to Vault, and also my first time writing Go, this definitely has been a learning experience. I thought that sharing a few lessons about implementing this feature would be helpful for other developers.
1. Go supports timeouts out-of-the-box
In Go many functions and definitely IO functions receive a Context argument that allows making any request cancelable, or with a deadline, in a cooperative manner. This means that the implementation of any such function should adhere to the cancel signal in the context, for example by select()-ing on the context’s Done channel.
This is pretty standard in Go, and when you are looking to add 10 second timeout to your request-handling code, you could write something like the following:
ctx, cancel := ctx.WithTimeout(10*time.Second)
2. Error handling with timeouts (part 1)
Error handling in Go is also pretty standardized. When you get an error from an internal function, you would normally wrap it e.g. using:
fmt.Errorf(“Unable to add person: %w”, err)
Then, when you go up the call tree, before returning the HTTP error to the user, you could check if the error was initiated by a timeout or not, and if it’s from a timeout, handle it differently.
You can implement this check using code such as:
If IsTimeoutError returns True, then return a 503 timeout error, otherwise, return your standard 4xx error.
3. There are strange errors that are actually caused by timeouts
The above solution worked almost great. However, one of our tests was flaky. Once every 10 runs or so it would return a 500 error when it was supposed to return a 503 timeout error. Essentially it was treating a timeout error as a “generic” error.
The reason was some errors that were caused by timeouts got translated to non-timeout errors. Why? Read on for the situation that we ran into.
Imagine you have a bit of code that starts a DB transaction, makes an INSERT query, and then commits the transaction. If the timeout expires during the INSERT then the DB API call will fail with a timeout error, which would be easy to detect.
However, if the timeout expires exactly in the time window after the INSERT but before the COMMIT of the transaction, the SQL library will rollback the transaction, because the timeout was sent to the DB when the transaction was started. When we’ll try to COMMIT the transaction, instead of a timeout error, we’ll get an error message saying we’re trying to commit a transaction that’s already rolled back. This error will NOT be a timeout error, and so our code will understand it to be some unrelated error, which would be a beautiful race-condition bug.
This is easy to reproduce if you add a time.Sleep(10 * time.Second) between the last query and the commit. (How to discover this is left as an exercise for the reader 🙂
4. Error handling with timeouts (part 2)
Fortunately in the above case, Go comes to our rescue. If the timeout on our context expires, we are guaranteed that ctx.Err() would be context.DeadlineExceeded (or an error wrapping it). So to handle our error, our code should now be:
Where …HTTPTimeoutError… should be replaced with your own code to return the appropriate HTTP error.
5. Testing your timeout code
How should we test our code? Testing timeouts can be particularly tricky. First, we need to understand the requirements of our timeout test:
- The test should fail if the server ignores the timeout parameter (i.e. API calls are allowed to run indefinitely)
- The test shouldn’t be slow – it should run under 1 second
- The test shouldn’t be flaky – its success should not depend on the speed of the machine on which it runs
- The test should verify that the HTTP return for a regular call is correct, and with a short timeout it should be HTTP 503
- The test should verify that the backend processing was stopped due to the timeout – e.g. a transaction inserting values to the database should not be committed.
This is quite a tall order. Here is my initial approach:
- Pick a particular standard API call. In our case AddPerson.
- Call it with a standard timeout. It should succeed and you should be able to observe its effects (a person was added.)
- Call it with a very short timeout.
- It should fail,
- and you should be able to observe that no change was made to the state of your system. (No second person was added.) This makes sure that the backend processing was halted.
The first complication:
Due to the way the API is structured, observing the state of the system is done with an API call. This is all fine and dandy until you realize that reading the state of the system will also fail due to a timeout.
The solution: resetting the server with a new, longer timeout, without resetting the DB.
The second complication:
On fast machines, the action that should fail will actually succeed – the timeout is not short enough. If it’s too short, the server won’t even start processing the request and you’re not testing anything. So you need to find a sweet spot for the timeout in the test. Let’s say, 10ms.
However, on some faster machines, even that is enough time for the action to sometimes succeed, which makes the test flaky.
The solution? Fault injection. We will add a configuration option that is not user visible and will only be used in testing, called FaultInjection. For some particular value, the AddPerson API call will make an sql query with pg_sleep(1) (Note: this is postgres specific). This will be sufficient to make sure the call fails on timeout, and our test for the timeout is now stable.
So there you have it, timeouts implemented with a stable test that proves that they work.