A flaky test is a test that sometimes passes and sometimes fails without any meaningful change in the product code. That unpredictability is what makes flaky tests so frustrating, especially when they show up as intermittent CI failures in a pipeline that otherwise looks healthy.

For QA engineers, SDETs, developers, and DevOps teams, flaky tests are not just a nuisance. They reduce trust in the test suite, slow down releases, and make it harder to tell whether a failure points to a real regression or just test flakiness. In a mature automation strategy, the goal is not to eliminate every possible source of variation, but to make automated test stability good enough that failures carry clear signal.

A flaky test is dangerous because it turns the output of your test suite into a guessing game. Once people start assuming failures are random, real defects can slip through.

Flaky test definition

A flaky test is an automated test with nondeterministic outcomes. Given the same code, same environment, and same inputs, it may pass one run and fail the next. The failure is often intermittent, difficult to reproduce, and not tied to a stable product bug.

This is different from a consistently failing test. A consistently failing test is usually a legitimate defect, a broken assertion, or a changed expectation. A flaky test fails only sometimes, which means the root cause can be hidden in timing, environmental conditions, data dependencies, or brittle test design.

In software testing terms, flakiness is a property of the test and its execution context, not necessarily of the application itself. General background on software testing and test automation is useful here, because flaky behavior tends to emerge where tests interact with real browsers, networks, APIs, queues, databases, or asynchronous systems.

Why flaky tests matter

Flaky tests create a specific kind of operational debt. They do not usually break production directly, but they damage the value of the test suite in several ways:

  • They waste debugging time.
  • They slow down merges because teams rerun pipelines to confirm whether a failure is real.
  • They can block releases if CI gates are strict.
  • They reduce confidence in regression coverage.
  • They create alert fatigue, especially when unstable tests generate noisy notifications.

If a team has too many flaky tests, developers stop trusting red builds. That trust erosion is expensive, because a reliable pipeline is one of the best ways to keep delivery fast while still catching regressions early. This is particularly important in continuous integration, where each commit is validated repeatedly under automated control.

Common causes of test flakiness

Flakiness usually comes from a mismatch between what a test assumes and how the system actually behaves under load, latency, or variance. The most common causes are timing, async UI behavior, data dependencies, browser differences, and unstable environments.

1. Timing and race conditions

Timing issues are one of the most common sources of flaky tests. The test may try to click a button, read text, or assert a state before the application has actually reached that state.

Examples include:

  • Waiting for a spinner to disappear without confirming the underlying request is done.
  • Clicking a button before it becomes enabled.
  • Asserting on DOM content before a render or animation completes.
  • Reading a database record before an asynchronous job finishes writing it.

These failures are especially common in browser automation and integration tests. The application may be working correctly, but the test moves faster than the system.

A fragile pattern is fixed sleeps, such as sleep(2000), because they are both too short sometimes and too long other times. They do not adapt to actual system state.

typescript // Better than a hard sleep, wait for a specific condition

await page.getByRole('button', { name: 'Save' }).click();
await page.getByText('Saved successfully').waitFor({ state: 'visible' });

2. Asynchronous UI states

Modern front ends often update in stages. A page may render skeleton content, fetch data, update the DOM, then hydrate interactive components. Tests that assume a single stable moment can become flaky when they observe the page in the middle of that process.

Common examples:

  • A list appears before all rows are loaded.
  • A modal opens, but focus management completes later.
  • A button exists in the DOM, but is not visible or actionable yet.
  • Virtualized lists render only part of the content until scrolled.

For UI automation, the test should wait for user-visible behavior, not implementation details. If the workflow depends on network or client-side state, use explicit waits for state transitions instead of arbitrary delays.

typescript

await expect(page.getByRole('heading', { name: 'Orders' })).toBeVisible();
await expect(page.getByTestId('orders-table')).toHaveCount(1);

3. Unstable test data

Test data problems are a major cause of flaky tests in API, integration, and end-to-end layers. A test can fail if it depends on data that other tests mutate, if records are reused across runs, or if cleanup is incomplete.

Typical issues include:

  • Reusing the same user account across parallel runs.
  • Creating records with non-unique identifiers.
  • Depending on shared fixture data that other tests update or delete.
  • Leaving background jobs or stateful services in a dirty condition.

A test suite becomes fragile when it assumes a single writer, single reader model in an environment that is actually concurrent. The fix is usually better isolation, not more retries.

Good practices include:

  • Generating unique identifiers per run.
  • Creating data through APIs or factories rather than shared fixtures.
  • Cleaning up after each test, or using ephemeral environments.
  • Making test data ownership explicit.

4. Browser and platform differences

Cross-browser behavior can expose flakiness in locators, rendering, and event timing. A test might pass in one browser and fail in another because of differences in:

  • Element rendering and layout timing.
  • Focus handling.
  • Native dialog behavior.
  • Scroll or viewport calculations.
  • Network and cache behavior.

These differences are not always bugs in the application. Sometimes they reveal that the test depends too closely on browser-specific implementation details. A locator that works in Chromium may fail in Firefox if it relies on fragile DOM structure or assumes a particular rendering order.

The best defense is to target stable user-facing selectors, avoid overfitting to CSS structure, and validate the test across the browsers you actually support.

5. Unstable environments and infrastructure

Sometimes the test logic is fine, but the environment is noisy. Shared CI runners, overloaded containers, throttled resources, flaky third-party services, and network latency can all produce intermittent CI failures.

Examples include:

  • CPU starvation causing the browser to miss timing windows.
  • A test environment with inconsistent seeded data.
  • A mock server that times out under parallel load.
  • A dependency service that responds slowly or rate-limits requests.
  • Docker containers with different resource limits across runs.

This is especially relevant in distributed test automation, where several services must cooperate during a single test. If one dependency is unstable, the test failure may look like a product defect even when the root cause is infrastructure-related.

6. Order dependence and hidden state

Tests that pass only when run in a certain order are a classic source of unreliability. This often means one test leaves behind data, cookies, local storage, shared mocks, or server-side state that changes another test’s assumptions.

If a test suite passes when run alone but fails in a batch, hidden state is a likely culprit. Order dependence is especially common in suites that grew incrementally without strong isolation boundaries.

7. Brittle assertions

Some tests are flaky because the assertion is too exact. A check for a pixel-perfect layout, exact timestamp, or full JSON object in a field that may include extra metadata can fail even though the behavior is correct.

Examples:

  • Asserting exact text when the wording includes dynamic timestamps.
  • Comparing full API responses instead of the stable contract fields.
  • Checking a precise CSS class order that the app does not guarantee.
  • Expecting the exact sequence of async log events.

A better strategy is to assert the parts of the output that matter to the user or the contract being tested.

Flaky tests vs real product bugs

One of the hardest parts of dealing with flaky tests is telling them apart from real defects. A real bug is repeatable under defined conditions. A flaky test is inconsistent by definition.

A practical way to separate them is to ask:

  1. Can I reproduce the failure locally or in a controlled environment?
  2. Does the failure happen every time under the same conditions?
  3. Does the application behavior violate a clear requirement?
  4. Does rerunning the same test often make the problem disappear?

If the answer to the last question is yes, you may be looking at test flakiness. That said, rerun-based diagnosis should be used carefully. A test that passes on rerun is not automatically safe to ignore. Sometimes the rerun succeeds because the environment changed, not because the issue disappeared.

How to reduce flaky tests

Reducing flakiness is mostly about engineering for determinism. The right fix depends on the failure mode, but the strategies below cover most cases.

Use explicit waits, not fixed sleeps

If the system under test is asynchronous, wait for a real condition. Wait for elements to appear, disappear, become enabled, or expose stable text.

In Playwright, good waiting is usually built into the API, which helps reduce test flakiness when used correctly.

typescript

await page.getByRole('button', { name: 'Submit' }).click();
await expect(page.getByText('Submission complete')).toBeVisible();

In Selenium, prefer explicit waits over implicit assumptions.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10) wait.until(EC.visibility_of_element_located((By.ID, ‘success-banner’)))

Isolate test data

Make each test own its data. Use unique IDs, fresh users, isolated tenant accounts, or temporary namespaces. If a test must read shared data, treat that data as read-only and stable.

Useful patterns include:

  • Create data in setup, delete it in teardown.
  • Use API calls to seed exact test state.
  • Reset database state between runs when feasible.
  • Avoid relying on data created by another test.

Stabilize locators

Choose selectors that describe user intent instead of DOM structure. Good locators tend to be resilient to styling and layout changes.

Prefer:

  • Accessible roles
  • Labels
  • Stable data attributes
  • Meaningful text where appropriate

Avoid:

  • Deep CSS chains
  • Index-based selectors unless the list is intentionally positional
  • Auto-generated IDs that change across builds

Reduce dependence on external services

When a test does not need a real third-party service, stub or mock it. For API and integration testing, use contract-aware mocks or service virtualization so failures do not depend on someone else’s uptime.

That said, do not overmock. A test suite with too much mocking can become green while the real integration breaks. The goal is to isolate unstable dependencies selectively, not remove all realism.

Make environments reproducible

If a test passes locally but fails in CI, look at environment drift:

  • Different browser versions
  • Different time zones or locales
  • Different container resources
  • Different feature flags
  • Different data volumes

A reproducible environment is one of the strongest defenses against flaky tests. Dockerized test runners, pinned browser versions, and consistent CI configuration can help make failures easier to reason about.

Improve cleanup and teardown

Many test flakiness issues come from incomplete cleanup. After each run, ensure the test leaves no state behind that can affect the next one.

This includes:

  • Clearing local storage and cookies
  • Deleting created records
  • Resetting queues and background jobs
  • Releasing locks or reserved resources

Track and classify failures

Not every failure should be treated the same way. It helps to classify failures into buckets such as:

  • Product bug
  • Test bug
  • Environment issue
  • Data issue
  • Unknown, needs investigation

This makes it easier to spot patterns. If the same suite fails only in one browser, or only when run in parallel, the pattern may reveal the cause.

A practical debugging workflow

When a flaky test fails, resist the urge to immediately add a retry. Retries can hide the issue while increasing suite runtime and reducing confidence.

Instead, use a structured debugging approach:

  1. Re-run the test multiple times in the same environment.
  2. Capture screenshots, videos, logs, network traces, and console output.
  3. Check whether the failure correlates with timing, parallel execution, or a specific browser.
  4. Compare the failing run to a passing run.
  5. Reduce the test to a minimal reproducible case.

If you cannot explain why a test failed, do not assume it is harmless. Assumption-based triage is how flaky suites become normal.

For UI suites, browser devtools traces, HAR files, and test runner artifacts are often enough to reveal whether the failure was caused by an element not being ready, a request timing out, or a selector resolving unexpectedly.

Should you retry flaky tests?

Retries are a tradeoff, not a fix. They can reduce pipeline noise in the short term, but they also risk masking instability.

Retries make sense when:

  • The failure is caused by a known external dependency.
  • You have already identified the root cause and are temporarily mitigating it.
  • The suite is low-risk and the test is not a release gate.

Retries are risky when:

  • They are added before root cause analysis.
  • They become the default response to every intermittent failure.
  • They are used to silence failures in critical release checks.

A better policy is to treat retries as a controlled stopgap, while tracking flakiness rates and assigning fixes to the underlying cause.

Designing for automated test stability

Stable suites are designed, not hoped for. If automated test stability is a priority, make it part of the testing strategy from the start.

A stable automation strategy usually includes:

  • Clear boundaries between unit, integration, and end-to-end tests
  • Small, focused tests where possible
  • Explicit synchronization around async behavior
  • Deterministic test data setup
  • Environment parity between local and CI runs
  • Strong observability for failures
  • Ownership for fixing flaky tests quickly

It also helps to keep the suite layered. Unit tests should catch logic errors quickly, integration tests should validate service contracts, and end-to-end tests should cover critical user journeys. If the only coverage for a feature is a huge end-to-end test, every small timing or environment issue becomes expensive.

What good flaky test management looks like

A team with mature test automation does not pretend flakiness will disappear completely. Instead, it manages it like any other technical debt:

  • Flaky tests are visible in dashboards or issue trackers.
  • Teams distinguish genuine regressions from unstable tests.
  • Repeated failures are assigned and fixed.
  • CI policies balance speed, signal quality, and practicality.
  • Test design reviews include stability considerations.

A useful rule is that every flaky test should have an owner, a root cause hypothesis, and a target fix path. If a test has no owner, it tends to stay flaky indefinitely.

Quick checklist for diagnosing a flaky test

Use this checklist when a test fails intermittently:

  • Does it use fixed sleeps instead of explicit waits?
  • Does it depend on shared data or shared accounts?
  • Is it sensitive to browser, locale, or time zone differences?
  • Does it fail only in CI or only in parallel runs?
  • Does it depend on a service that is slow or outside your control?
  • Does it clean up after itself?
  • Is the assertion too strict?
  • Does it rely on hidden state from previous tests?

If you answer yes to more than one of these, the test likely needs redesign, not another rerun.

Final take

A flaky test is an automated test with inconsistent behavior that undermines trust in the suite. The most common causes are timing issues, asynchronous UI states, unstable test data, browser differences, environmental noise, and brittle assertions. The best fixes usually involve better synchronization, stronger isolation, more stable locators, reproducible environments, and disciplined cleanup.

For teams building and maintaining test automation, dealing with flaky tests is part of the job, not an edge case. The goal is to keep the suite reliable enough that a failure means something, which is the whole point of automation in the first place.