Observe everything + Alarm(The Ultimate Guide to AWS Lambda Development Chapter 4)

George Mao
7 min readFeb 13, 2024

Chapter 3 of this guide focused on themes you should follow to optimize all aspects of Lambda. Remember, Cost is everyone’s responsibility. This final chapter focuses tips for monitoring your Serverless applications and things you should alarm on.

Account level health

Let’s start at the top and make sure your AWS account as a whole is healthy.

ClaimedAccountConcurrency

This is a new metric that reports the sum of concurrency that is actively used in the entire account, which consists of: Unreserved concurrency, Allocated reserved concurrency, and configured Provisioned concurrency. You can track this metric to understand how much concurrency is available to on-demand invocations. Increasingly higher values that approach your account quota (default 1000) indicate you need to increase your quota or potentially have Lambda functions that are consuming more than they should. Look for bad patterns in the metric. Ultimately this metric needs to be compared to the account quota.

ClaimedAccountConcurrency

Here’s a quick CloudWatch metric snippet you can use to quickly create the this Metric:

{
"metrics": [
[ { "expression": "(claimedAccountConcurrency/SERVICE_QUOTA(concurrentExecutions)) * 100", "label": "concurrencyUtilization", "id": "concurrencyUtilization" } ],
[ "AWS/Lambda", "ClaimedAccountConcurrency", { "id": "claimedAccountConcurrency" } ],
[ ".", "ConcurrentExecutions", { "id": "concurrentExecutions" } ]
],
"view": "timeSeries",
"stacked": false,
"region": "us-east-1",
"stat": "Maximum",
"period": 300,
"title": "Account Concurrency Utilization",
"yAxis": {
"left": {
"label": "Count",
"showUnits": false
}
}
}

Finally, create an Alarm off the concurrencyUtilization metric and setup actions as needed.

Other Account level Metrics

You should also monitor the account level metrics: Duration, Throttles, and Errors. I review these metrics on 1–2 week basis and look for patterns and outliers that indicate overall account health.

Per function health

Lambda provides various metrics out of the box

Some examples are: ConcurrentExecutions, Invocations, Errors, Throttles, Duration, IteratorAge, AsyncEventsReceived

Make sure you’re reading these metrics using the correct Statistic — otherwise the data will be invalid and give you the wrong analysis. For example, ConcurrentExecutions should be read using the Max statistic. In this example, this function runs somewhere between 50–300 concurrency, peaking at 296.

Max statistic

If you accidentally read the same Metric using the Sum statistic, you’ll see drastically different data and give you wrong information. This function did not run anywhere near Millions of concurrency :)

Sum statistic
Read Metrics using the right Statistic

Create a Custom metric for all Poll based events

For Poll based event sources (SQS, Kinesis, Kafka), Lambda operates a highly scalable/optimized poller on your behalf. That poller extracts messages from the data source and invokes Lambda without any overhead to your Lambda.

re:Invent 2022 (SVS404) — Julian Wood’s session

There is one major item that AWS does not provide:

You can’t tell how many records the Lambda poller is delivering to each invocation of Lambda!

You need to rely on the IteratorAge metric to understand how well you are keeping up with the incoming data. Instead, you can just write a metric that tells you the number of records the Poller is grabbing. This will help you understand how to tune the Poller configuration for Batch Size/ Batch Window.

The easiest way to do this is to output an Embedded Metric Format (EMF) log entry. Cloudwatch will automatically pick this up and create a metric for you. It’s much cheaper than calling the PutMetricData API. I suggest using the AWS Lambda PowerTools SDK to do this. Here’s a snippet of code written in Node. This creates a new metric under the compositeSfTest namespace called recordCount.

import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const metrics = new Metrics({
namespace: 'compositeSfTest',
serviceName: 'recordCount',
});

// Assumes your Lambda handler intakes an event parameter
// The event parameter will contain the batch of records delivered by Lambda
// Increment the metric with the size of the batch delivered by the poller
metrics.addMetric('recordCount', MetricUnits.Count, event.Records.length);
metrics.publishStoredMetrics();

This will generate a EMF entry in your function’s Log Group. The Logs service will detect the EMF entry and automatically create a custom metric under the compositeSfTest namespace.

{
"_aws": {
"Timestamp": 1706828479853,
"CloudWatchMetrics": [
{
"Namespace": "compositeSfTest",
"Dimensions": [
[
"service"
]
],
"Metrics": [
{
"Name": "recordCount",
"Unit": "Count"
}
]
}
]
},
"service": "recordCount",
"recordCount": 1
}

Analysis tools

There are two features that are very powerful analysis tools:

Lambda Insights — this is an extension that has access to detailed resource consumption information for your function. It emits an EMF log entry with data such as CPU utilization, Network IO, idle time, and memory utilization. This can help you track down memory leaks, performance issues, and other things that are not visible in the standard REPORT line.

CloudWatch Logs Insight — this is a query tool that can be used to aggregate and analyze huge number of log entries. One of my favorite queries is to compare the number Cold Starts to total Invocations across a certain time frame. You can do that with this query:

filter @type = "REPORT"
| parse @message /Init Duration: (?<init>\S+)/
| stats count() as total,
count(init) as coldStartCount,
coldStartCount/total*100 as coldPercent, # % of cold invokes
avg(init) as avgInitDuration,
max(init) as maxInitDuration,
min(@duration) as minDuration,
max(@duration) as maxDuration,
avg(@duration) as avgDuration,
avg(@maxMemoryUsed)/1024/1024 as memoryused
by bin (30min) #Group by 30 minute windows
  • filter searches for log entries with REPORT only.
  • Then we parse the message looking for the Init Duration attribute
  • Compute a bunch of stats like min, max, avg
  • bin or group all stats by 30 minute windows

Make sure you run report with a window wide enough for data. It defaults to 3 hours but I generally like to see at least 1 day of data.

Invokes vs Cold starts

Every row is 30 minutes worth of data. You can reduce the bin value if you need to get more granular windows or increase if you have a large amount of data to analyze. You should read the data from bottom to top:

  • Starting at row 10, there were ~90k total invokes and 9 cold starts. Those cold starts averaged 994ms and worst case was 1137ms.
  • Traffic begins to increase from here forward. Move up to row 9 and there were 104k invokes but only 4 cold starts, averaging 963 ms.
  • Traffic continues to increase as we approach the first row (which is most recent)
  • The worst cold start occurred on row 3 at 76 of ~150k invokes but init times are stable at under 1 second

Cloudwatch Logs Insights parsed more than 31 million records, filtered down to 4 million records, and analyzed them in 8.3 seconds!

Just keep in mind that the current cost is $0.005 per GB of data scanned. This example scanned 4 GB so it cost me 2 cents.

Things to be aware of

Alarms can directly invoke Lambda now!

This is a new feature that allows you to invoke a Lambda function directly when an Alarm triggers. Previously this required you to use an intermediate messaging service such as SNS to invoke a Lambda function. This is a really good example of why you should re-optimize on a regular basis.

Some Metrics don’t generate data when there nothing to report

Be default, Alarms will treat missing data as non-breaching. This means missing data in a Metric will not trigger an alarm. You can read more about this in AWS Docs. One example of this is something we talked about in Chapter 3: Provisioned Concurrency autoscaling.

The Autoscaling alarms that are created by default monitor the ProvisionedConcurrencyUtilization metric. This metric will only report report if there are actual invokes on the function. If your workload stops and you do not invoke the function, there is no data!

This means the Alarm will not breach anytime there are 0 invocations on the function and your PC allocation will never scale down!

If you want 0 data to be treated as a Breach, then change the advanced settings for the Alarm from missing to bad.

In CloudFormation set the TreatMissingData attribute to breaching

Tldr;

  • Observe and alarm at both the Account level and Function level
  • Metrics must be read using the right Statistic
  • Some of the best features are built right into CloudWatch (Insights)

Summary

This completes the Ultimate Guide to AWS Lambda Development! I’m glad you made it this far and hope this improves your Serverless developer experience. Reach out to me or join us on Discord #BelieveInServerless to talk more Serverless!

--

--

George Mao

Distinguished Engineer @ Capital One leading all things Serverless | Ex -AWS WW Serverless Tech Lead.