When implementing Serverless for the first time, there is lots of advice on how to put the tech together, but less on when its operational.
When implementing a Serverless system for the first time, you can find plenty of advice on how to put the technology together, but far less on how to manage it once it’s operational. So you end up with something that meets the functional requirements and scales well, but that is not easy to operate or diagnose problems.
From an Operational perspective, you used to be able to watch the underlying virtual hardware, and if memory and CPU both go to 100%, you know you’ve got a problem. But you just can’t look at those measures within a Serverless function.
Recently, I was involved in the implementation of a Serverless system to provide a public API interface to various legacy back end systems, combining the results into new and more useful functions.
We need to be able to answer some key operational questions
With this level of complexity, many things can go wrong. If your Serverless functions take too long to complete, they time out silently without returning a result. Those less-than stable legacy systems can lock up, run slowly or do something else unexpected on a regular basis. So we need to be able to answer some key operational questions on an ongoing basis.
Serverless performance: What is our transaction throughput for each API? How many requests are we receiving and how long are they taking to process? What error types and error volumes are we seeing in the logs?
Back end performance: What transaction volumes are we sending to our back end systems? How long are they taking to process? What error types and error volumes are being returned to us?
We glean answers to these questions by carefully adding the measures we are interested in to the Serverless logs and then letting our monitoring software draw the statistics from the logs. Now we have real-time operational telemetry we can graph and send us alarms when something goes wrong.
When creating Serverless systems, give yourself time to meet this implicit reporting requirement, because it can’t always be solved in the same way as server-based solutions.