Building Real-Time Alarm Systems with AWS Lambda and DynamoDB

Building reliable alarm systems for IoT devices presents unique challenges. In this post, we'll walk through how we designed and implemented a real-time alarm evaluation system using AWS Lambda, DynamoDB, and EventBridge.

The Problem

Our IoT platform monitors thousands of devices in the field. Each device sends telemetry data at regular intervals, and we need to:

Detect when devices stop reporting data
Identify anomalous sensor readings
Alert operations teams within minutes of an issue
Auto-resolve alarms when conditions normalize

Architecture Overview

Our alarm system consists of several key components:

IoT Devices → DynamoDB → EventBridge → Lambda Evaluators → SNS → Slack/PagerDuty

DynamoDB Streams

We use DynamoDB Streams to trigger alarm evaluation whenever device data changes. This gives us near-real-time processing without polling.

const streamHandler: DynamoDBStreamHandler = async (event) => {
  for (const record of event.Records) {
    if (record.eventName === "INSERT" || record.eventName === "MODIFY") {
      await evaluateAlarms(record.dynamodb?.NewImage);
    }
  }
};

Alarm Evaluators

Each alarm type has its own evaluator that implements a common interface:

interface AlarmEvaluator {
  evaluate(context: EvaluationContext): Promise<AlarmResult>;
  getThreshold(): number;
  getSeverity(): AlarmSeverity;
}

This design allows us to:

Add new alarm types without modifying existing code
Test evaluators in isolation
Tune thresholds per alarm type

State Management

We track alarm state in DynamoDB with a TTL-based auto-resolution mechanism:

Active alarms have a TTL set to now + resolutionTimeout
Resolved alarms are moved to a history table
If conditions persist, the TTL is refreshed

This approach ensures alarms don't stay active forever if the underlying issue is resolved but we miss the resolution event.

Key Learnings

1. Idempotency is Critical

DynamoDB Streams can deliver records multiple times. Every evaluator must be idempotent:

// Bad: Creates duplicate alarms
await createAlarm(deviceId, alarmType);

// Good: Upsert with conditional write
await dynamodb.put({
  TableName: "alarms",
  Item: alarm,
  ConditionExpression: "attribute_not_exists(pk) OR #status = :resolved",
  ExpressionAttributeNames: { "#status": "status" },
  ExpressionAttributeValues: { ":resolved": "RESOLVED" },
});

2. Fan-Out for Scale

Initially, we had one Lambda function evaluating all alarm types. This created a bottleneck. We refactored to fan-out:

// Route to specific evaluator Lambda
const evaluatorArn = `arn:aws:lambda:${region}:${account}:function:alarm-evaluator-${alarmType}`;
await lambda.invoke({ FunctionName: evaluatorArn, Payload: event });

3. Observability First

We instrument everything with structured logging and CloudWatch metrics:

logger.info("Alarm evaluated", {
  deviceId,
  alarmType,
  result: "ACTIVE",
  evaluationDurationMs: Date.now() - startTime,
});

await cloudwatch.putMetricData({
  Namespace: "AlarmSystem",
  MetricData: [
    { MetricName: "AlarmEvaluations", Value: 1, Unit: "Count" },
    { MetricName: "EvaluationDuration", Value: duration, Unit: "Milliseconds" },
  ],
});

Results

After deploying this system:

Detection latency: < 60 seconds from issue to alert
False positive rate: < 1%
Auto-resolution rate: 85% of alarms resolve without intervention
Cost: ~$50/month for 100K devices

Conclusion

Building a reliable alarm system requires careful attention to idempotency, scale, and observability. AWS serverless technologies provide the building blocks, but the design patterns matter.

In future posts, we'll dive deeper into specific alarm types and how we handle edge cases like network partitions and clock skew.

Have questions? Reach out to our team!