ENGENHARIA

Protecao Anti-Loop em 5 Camadas: Seguranca de Automacao em Escala

Andres Muguira26 de fevereiro de 20267 min de leitura
AutomationsSafetyArchitecture
← Back to Blog
Resumir com IA

The Automation Loop Problem

Automation is the most dangerous feature in any CRM. A user creates a rule: "When deal stage changes to Won, send a congratulations email." Harmless. Then another user creates: "When an email is sent, update the activity log." Still harmless. Then a third rule: "When the activity log is updated, check if deal stage should change." Now you have a loop. The deal stage triggers an email, the email triggers a log update, the log update triggers a stage check, the stage check triggers another email. Infinitely.

This is not a hypothetical scenario. It happened to us in beta. A customer's workspace ran 47,000 automation executions in 11 minutes before we manually killed the process. Their Supabase usage spiked, their email provider flagged the account, and the activity log became an unreadable wall of automated entries. That incident led us to build five distinct layers of loop protection, each catching scenarios the others miss.

Every automation platform eventually discovers the loop problem. The question is whether you discover it before or after your customer sends 47,000 emails in 11 minutes.
The 5-layer anti-loop protection stack: Source Tag, Depth Limit, Hourly Cap, Dedup Window, and Circuit Breaker.

Layer 1: Source Tagging

Every mutation in SalesSheet carries a source tag. When a user manually updates a deal stage, the source is user. When an automation updates it, the source is automation:{rule_id}. When an API call updates it, the source is api:{key_id}. This tag is attached to the database write and propagated to any events that the write triggers.

The first anti-loop rule is simple: an automation cannot trigger itself. If automation rule #42 updates a field, and that field change would normally trigger rule #42 again, we check the source tag and skip the execution. This catches the simplest loops -- a single rule that modifies the same field it watches.

Why Source Tagging Alone Is Not Enough

Source tagging catches self-loops but misses multi-rule loops. Rule A triggers Rule B, which triggers Rule C, which triggers Rule A. Each rule is triggered by a different rule, so no rule is triggering itself. We need deeper protection.

Layer 2: Execution Depth Limit

Every automation execution carries a depth counter. When a user action triggers an automation, the depth is 1. If that automation's side effects trigger another automation, the depth is 2. We hard-cap at depth 3. Any automation that would execute at depth 4 or higher is silently dropped, and we log the event for the workspace admin to review.

Depth 3 is generous enough for legitimate chains. A common pattern is: deal stage changes (depth 1) triggers a Slack notification (depth 2) which logs an activity (depth 3). Three levels deep is a real use case. Four levels deep is almost certainly a loop or an over-engineered workflow that should be simplified.

Layer 3: Hourly Rate Caps

Even with source tagging and depth limits, a non-looping automation can cause damage at scale. Imagine a bulk import of 10,000 contacts, each triggering a "send welcome email" automation. That is 10,000 emails in seconds -- legitimate, but probably not what the user intended.

We enforce per-rule hourly rate caps:

When a rule hits its cap, we pause it, notify the workspace admin via email and in-app notification, and provide a one-click resume button. The admin can also adjust the cap higher if the volume is intentional.

Layer 4: Deduplication Window

The dedup window catches a subtle failure mode: the same automation firing on the same record multiple times within a short window. This happens when a field is rapidly updated -- for example, a rep changing a deal value three times in quick succession as they negotiate. Without dedup, each change triggers the automation separately, potentially sending three notifications about the same deal.

We maintain a rolling 60-second dedup window per rule per record. If rule #42 fires on contact #789 at timestamp T, and the same rule would fire on the same contact at T+30 seconds, we skip the second execution. The window resets after 60 seconds of no executions. This is implemented with a Redis sorted set where the score is the timestamp and the member is a composite key of rule_id:record_id.

Layer 5: Circuit Breaker

Circuit breaker state machine: Closed (normal) to Open (tripped) to Half-Open (testing at 50% rate).

The circuit breaker is the last line of defense. It operates at the workspace level, not the rule level. If total automation executions for a workspace exceed 5,000 in a rolling 5-minute window, the circuit breaker trips. All automations for that workspace are paused. The admin receives an urgent notification with a summary of which rules fired the most and a button to re-enable automations.

The circuit breaker uses a state machine with three states:

  1. Closed (normal operation) -- automations execute freely
  2. Open (tripped) -- all automations paused, admin notified
  3. Half-open (recovery) -- after admin re-enables, we allow automations at 50% rate for 10 minutes. If the rate stays healthy, we return to Closed. If it spikes again, we return to Open.
The circuit breaker is not about preventing loops. It is about limiting blast radius. Even if layers 1 through 4 fail, the circuit breaker ensures that no workspace can consume unbounded resources.
Automation execution log showing successful runs alongside blocked ones with "Loop Detected" badges and reasons.

The Admin Dashboard

All five layers feed into an automation health dashboard visible to workspace admins. The dashboard shows:

This visibility is critical because most automation problems are not emergencies. They are slow leaks -- a rule that fires 10% more often than it should, gradually cluttering the activity log. The dashboard makes these patterns visible before they become crises.

Lessons Learned

Building these five layers taught us that automation safety is not a single problem. It is a spectrum of failure modes, each requiring a different detection mechanism. Source tagging catches self-loops. Depth limits catch multi-rule chains. Rate caps catch volume spikes. Dedup catches rapid-fire triggers. Circuit breakers catch everything else. No single layer is sufficient, but together they have prevented every loop incident since we deployed them eight months ago.

The 47,000-execution incident was painful, but it was also the best thing that happened to our automation system. It forced us to think about safety as a first-class architectural concern, not an afterthought. If you are building automations in your product, start with these five layers. Your customers' inboxes will thank you.

Try SalesSheet Free

No credit card required. Start selling smarter today.

Start Free Trial