Task Monitoring Using Redis

We’ve started using Redis a lot and it’s been great. We’ve already used it to improve performance in our analytics pipeline, and to power our usage-based billing system.

Most recently we’ve written a leaner, faster version of our background task monitoring system based on Redis. Here’s how we did it…

The Goal

All background tasks use a simple Producer-Consumer model, built on Amazon SQS and Python. Our goal is to track and identify periods of low/high task throughput and adjust the number of available consumers accordingly.

Transactional email traffic can be very bursty, so it’s important that we’re able to monitor and provision workers in real-time.

At any time, we need to know:

How many workers are currently working on Task X
How long has it been since Task X was last completed

Using Redis Hashes

We use Redis Hashes to track this information. A Redis Hash looks like this:

HASH_KEY:
    FIELD_ONE: 0
    FIELD_TWO: 14
    FIELD_THREE: 6

There are two important features of hashes that make them great for our use case:

Hash values can be incremented/decremented atomically using HINCRBY
Undefined fields have an implicit value of 0

This means that we can safely increment any field in a Hash even if it hasn’t been set yet. This is great because we never need to initialize the data structure when adding new fields, and if we ever need to we can delete a Hash and start over.

Tracking Active Tasks

We use an ‘ACTIVE_TASKS’ Hash to track how many tasks are currently active.

ACTIVE_TASKS:
    TASK_A: 7
    TASK_B: 0

When a worker starts performing Task X, it will increment the TASK_X field by 1. Once it has completed the task, it will decrement the field by 1.

We can then use HGET TASK_X to fetch the number of Task X being performed at any given moment.

In the above example, there are 7 active TASK_As, and 0 active TASK_Bs. All other task types are undefined and have an implicit value of 0.

Tracking Idle Time

We use a second Hash called ‘IDLE_SINCE’ to track how long it’s been since a particular task was performed.

IDLE_SINCE:
    TASK_A: 1410080563
    TASK_B: 1410080042

This Hash stores a Unix Timestamp for each task type, indicating when the most recent task of that type was completed.

In the example above:

TASK_A was last completed at 2014/09/07 9:02:43
TASK_B was last completed at 2017/09/07 8:54:02

Maintaining these values is pretty simple — every time a worker completes a task, it sets the timestamp to the current time.

Detecting Idle Tasks

Now for the tricky part – detecting when tasks are idle.

To do this, we fetch values from both Hashes for a specific task and run the following pseudo-code:

if ACTIVE_TASKS.TASK_X > 0:
    return NOT_IDLE
if (NOW - IDLE_SINCE:TASK_X) < IDLE_THRESHOLD:
    return NOT_IDLE
return IDLE

The value of IDLE_THRESHOLD defines how long a task must not be performed before being considered “idle”.

This threshold can be whatever you need – we use different values depending on the task type.

What it Looks Like in Python

Here’s some code to get you started (Gist):

def start_task(task_type):
    redis.HINCRBY('ACTIVE_TASKS', task_type, 1)

def stop_task(task_type):
    with redis.MULTI:
        redis.HSET('IDLE_SINCE', task_type, now())
        redis.HINCRBY('ACTIVE_TASKS', task_type, -1)

def get_active_tasks(task_type):
    return redis.HGET('ACTIVE_TASKS', task_type) or 0

def is_task_idle(task_type, idle_threshold_seconds):
    with redis.MULTI:
        active_tasks = redis.HGET('ACTIVE_TASKS', task_type) or 0
        idle_since = redis.HGET('IDLE_SINCE', task_type) or 0
   
    if active_tasks > 0:
        return False
    return (now() - idle_since) > idle_threshold_seconds

Notes and Warnings

Undefined tasks are infinitely idle. This is because both get_active_tasks and idle_since will return 0 if the task has not yet been seen. In our use case this behaviour is ideal, but it may require more thought in other situations.

Use Multi/Exec to prevent some race conditions. The Redis MULTI command can turn a series of commands into one atomic operation. We use it to prevent race conditions between idle checks and tasks starting/stopping.

Always call stop_task. It’s easy to run into problems if a worker calls start_task and fails to call stop_task. Make sure you’re wrapping your tasks properly and using intelligent error handling.

Protect against negative counts. It’s also possible to call stop_task too many times, giving a negative ACTIVE_TASKS count. Our production code has some additional checks in place to catch and prevent this scenario, as well as log appropriate warnings.

Garbage collection. One of the hardest parts about using Redis is key management and garbage collection. Fortunately this solution only uses two keys: ACTIVE_TASKS and IDLE_SINCE. So garbage collection is super easy.

Resetting tracking data. Because we have protection in place against negative ACTIVE_TASKS values, the Hashes used can be cleared at any time – calling stop_task on an empty Hash will simply set its value to 0.

	Starter	Teams	Teams+
Universal styling	✓	✓	✓
Drag and drop email builder	✓	✓	✓
Custom code editor	✓	✓	✓
Custom fonts	✓	✓	✓
Device specific elements	✓	✓	✓
Version history	✓	✓	✓
Social media link settings	✓	✓	✓
Commenting	✓	✓	✓
Live previews	✓	✓	✓
Dark mode previews	✓	✓	✓
Litmus testing	✓	✓	✓
Custom merge fields	✓	✓	✓
Image hosting	✓	✓	✓
Brand & legal compliance guardrails	✓	✓	✓
User role permissions	✓	✓	✓
ESP logic	✓	✓	✓
Code Export	✓	✓	✓
AMP Email Support	✓	✓	✓
Approval Workflows	✓	✓	✓
API Access	✓	✓	✓
User Management	–	✓	✓
Translation (via Smartling)	–	–	✓
Export via API	–	–	✓
Localizations	–	–	✓
Account Manager	–	–	✓
Premium Support	–	–	✓
SLA	–	–	✓
Premium SAML SSO	–	–	✓
Custom Themes	–	–	✓
Workspaces	–	–	✓
Design / AMP Services	–	–	✓
Pricing	$149/mo	$499/mo	Custom