Testing Event-Driven Systems: Playwright, AWS, and the Limits of Synchronous Thinking

At some point, I realised that most of the API testing approaches I'd relied on for years were built on one assumption — that systems respond immediately.

Send a request. Get a response. Validate it. Move on.

That assumption held up for a long time. It stopped holding up last year, when I was working with a large e-commerce platform based out of Vancouver — a distributed system handling inventory, payments, and fulfilment across dozens of services, all running on AWS.

That project changed how I think about testing.

The Problem No One Talks About

Traditional API testing is synchronous by nature. You call an endpoint, you assert against the response. Clean, predictable, reassuring.

Event-driven systems break that model completely. A single request can trigger a chain — API Gateway fires a Lambda, drops a message onto SQS, which wakes up another service, which eventually writes to a database three hops downstream. By the time you're validating the response, the interesting stuff has already happened somewhere you weren't looking.

That's exactly what happened to us. Messages were silently dropping in queues. Lambdas were partially executing. Retries were masking real defects. Contract mismatches between services were slipping through undetected.

The tests were passing. The system wasn't working.

Validating the API response alone gave us false confidence. We needed to see deeper.

Testing Interactions, Not Endpoints

The platform ran on a fairly typical AWS stack — API Gateway, Lambda, SQS, SNS, multiple microservices owning different domains of the e-commerce workflow. High transaction volume, asynchronous processing across the board.

The shift in mindset was significant. We weren't testing endpoints anymore. We were testing the behaviour between services — the handoffs, the message contracts, the sequencing. That distinction changed everything.

I landed on a layered approach. Not because it looked good on a whiteboard, but because each layer caught failures the others missed.

Trigger → Event → Processing → Observability → Validation

Playwright handled both API and UI triggers, simulating real user behaviour. At the event layer, we inspected what actually moved through SQS and SNS — payloads, schemas, sequencing. Lambda executions and service interactions were verified at the processing layer. Datadog gave us distributed tracing across the entire chain. And at the validation layer, we confirmed final state — API responses, database records, UI where needed.

No single layer was sufficient. Validation had to be layered, not single-point.

The glue that held all of it together? Correlation IDs. Every flow got a unique identifier, injected at the trigger point and propagated across every service, every queue message, every log entry. Without correlation, debugging distributed systems is guesswork. With it, you can trace a single user action across fifteen services and know exactly where something broke.

That's not a testing technique. It's an architectural decision — and the single most valuable pattern I've used in complex distributed environments.

Where AI Helped — And Where It Didn't

This was mid-2025. The agentic AI coding tools we have now didn't really exist yet — the MCP ecosystem for AWS had just entered developer preview, and mature agentic workflows weren't available.

I was using ChatGPT and GitHub Copilot — same tools most teams had at the time. Nothing exotic. But even then, they were genuinely useful for troubleshooting distributed systems.

Pasting CloudWatch log chunks into ChatGPT and asking it to find patterns across services, spot timing anomalies, correlate failures — hours of manual log-reading became a focused conversation. Copilot handled the repetitive-but-precise code: polling logic with exponential backoff, SQS message validators, schema assertion utilities. ChatGPT helped generate JSON schema validators and diff payloads against expected contracts, which was faster than hand-writing them for deeply nested e-commerce data structures.

One of the more valuable uses: describing a complex flow — a partial payment on a split order with one out-of-stock item — and asking ChatGPT to generate test scenarios I might be missing. It was genuinely good at thinking through failure modes in distributed systems.

Practical, not revolutionary. But it shifted time from mechanical debugging to actual analysis.

If I Were Doing This Today

Nine months later, I'd approach the same project very differently. The shift isn't incremental — it's a different category of tooling entirely.

Claude Code with Opus 4.6. A 1M token context window means loading an entire service's codebase, CloudWatch logs, Datadog traces, and SQS message history into a single session without context rot. Back in July, I was pasting log snippets into ChatGPT and losing context between conversations. That constraint is gone. Adaptive thinking and effort controls add another dimension — set effort to max for a complex cross-service timing investigation, drop to medium for boilerplate script generation.

MCP for AWS. This is the biggest shift. AWS MCP servers are now production-ready — Claude Code connects directly to SQS, Lambda, CloudWatch, S3 through standardised connectors. Instead of manually exporting and piping logs, an agent queries CloudWatch directly, inspects SQS dead-letter queues in real time, pulls Lambda execution metrics, and correlates everything autonomously. The old "export → paste → analyse" loop becomes a single agentic workflow.

Agent Teams. For a distributed system with distinct domains, you can spin up a team: one agent investigating order-service event flows, another validating payment contract schemas, another tracing inventory sync failures — all running simultaneously. The lead coordinates, merges findings, and spawns targeted subagents to investigate specific failure points deeper.

This maps directly to how distributed systems actually fail. Problems don't live in one service. They live between services. Parallel agents investigating different domains and converging their findings mirrors that reality.

In practice, the workflow looks like this: Playwright triggers a complex e-commerce flow. An agent team fans out — one monitors SQS propagation via MCP, another queries CloudWatch for Lambda traces, another validates final database state. The lead correlates everything using the correlation ID injected at trigger. If something fails, a subagent spins up to investigate with full observability context fed in automatically.

Nine months ago, every one of those steps was manual.

What Doesn't Change

Better tooling doesn't eliminate the hard parts. It makes them more visible.

Eventual consistency, race conditions, flaky async validations, environment instability — these are inherent to distributed systems, not to your test framework. Most of the failures I dealt with existed in the spaces between services — dropped messages, schema mismatches, retry logic silently swallowing errors.

The patterns that consistently helped: correlation IDs everywhere, polling with intelligent backoff, idempotent test design, layered validation, and fallback paths when the primary assertion point was unavailable.

And one thing I keep coming back to: Playwright belongs at the edges. It triggers flows and confirms outcomes. It's not for event inspection or internal service validation. The right tool for the right layer matters more than having one tool that does everything poorly.

The Bigger Point

Testing event-driven systems is less about writing tests and more about designing observability-driven validation. If you can't see what's happening between services, you can't test it. The testing strategy and the observability strategy have to be designed together — they're not separate concerns.

AI tooling is increasingly part of that observability layer. Not replacing it. Making it actionable at a speed that wasn't possible even nine months ago.

The tools will keep changing. The principles won't.