Building Real-Time Alarm Systems with AWS Lambda and DynamoDB
A deep dive into how we built a scalable alarm evaluation system for IoT devices using AWS serverless technologies.
Building reliable alarm systems for IoT devices presents unique challenges. In this post, we'll walk through how we designed and implemented a real-time alarm evaluation system using AWS Lambda, DynamoDB, and EventBridge.
The Problem
Our IoT platform monitors thousands of devices in the field. Each device sends telemetry data at regular intervals, and we need to:
- Detect when devices stop reporting data
- Identify anomalous sensor readings
- Alert operations teams within minutes of an issue
- Auto-resolve alarms when conditions normalize
Architecture Overview
Our alarm system consists of several key components:
IoT Devices → DynamoDB → EventBridge → Lambda Evaluators → SNS → Slack/PagerDuty
DynamoDB Streams
We use DynamoDB Streams to trigger alarm evaluation whenever device data changes. This gives us near-real-time processing without polling.
const streamHandler: DynamoDBStreamHandler = async (event) => {
for (const record of event.Records) {
if (record.eventName === "INSERT" || record.eventName === "MODIFY") {
await evaluateAlarms(record.dynamodb?.NewImage);
}
}
};
Alarm Evaluators
Each alarm type has its own evaluator that implements a common interface:
interface AlarmEvaluator {
evaluate(context: EvaluationContext): Promise<AlarmResult>;
getThreshold(): number;
getSeverity(): AlarmSeverity;
}
This design allows us to:
- Add new alarm types without modifying existing code
- Test evaluators in isolation
- Tune thresholds per alarm type
State Management
We track alarm state in DynamoDB with a TTL-based auto-resolution mechanism:
- Active alarms have a TTL set to
now + resolutionTimeout - Resolved alarms are moved to a history table
- If conditions persist, the TTL is refreshed
This approach ensures alarms don't stay active forever if the underlying issue is resolved but we miss the resolution event.
Key Learnings
1. Idempotency is Critical
DynamoDB Streams can deliver records multiple times. Every evaluator must be idempotent:
// Bad: Creates duplicate alarms
await createAlarm(deviceId, alarmType);
// Good: Upsert with conditional write
await dynamodb.put({
TableName: "alarms",
Item: alarm,
ConditionExpression: "attribute_not_exists(pk) OR #status = :resolved",
ExpressionAttributeNames: { "#status": "status" },
ExpressionAttributeValues: { ":resolved": "RESOLVED" },
});
2. Fan-Out for Scale
Initially, we had one Lambda function evaluating all alarm types. This created a bottleneck. We refactored to fan-out:
// Route to specific evaluator Lambda
const evaluatorArn = `arn:aws:lambda:${region}:${account}:function:alarm-evaluator-${alarmType}`;
await lambda.invoke({ FunctionName: evaluatorArn, Payload: event });
3. Observability First
We instrument everything with structured logging and CloudWatch metrics:
logger.info("Alarm evaluated", {
deviceId,
alarmType,
result: "ACTIVE",
evaluationDurationMs: Date.now() - startTime,
});
await cloudwatch.putMetricData({
Namespace: "AlarmSystem",
MetricData: [
{ MetricName: "AlarmEvaluations", Value: 1, Unit: "Count" },
{ MetricName: "EvaluationDuration", Value: duration, Unit: "Milliseconds" },
],
});
Results
After deploying this system:
- Detection latency: < 60 seconds from issue to alert
- False positive rate: < 1%
- Auto-resolution rate: 85% of alarms resolve without intervention
- Cost: ~$50/month for 100K devices
Conclusion
Building a reliable alarm system requires careful attention to idempotency, scale, and observability. AWS serverless technologies provide the building blocks, but the design patterns matter.
In future posts, we'll dive deeper into specific alarm types and how we handle edge cases like network partitions and clock skew.
Have questions? Reach out to our team!