Our Mission

To make the world's software more reliable, performant, and secure by default.

Overview

Software is being written faster than ever, but not more reliably. From big tech[1] to startups[2], language models have dramatically sped up programming. In doing so, they have exposed and created deep structural weaknesses in how we validate and maintain code. We now introduce changes with less human understanding because traditional mechanisms of assurance don't scale with today's pace of development.

In light of this, our goal is to build faster, more effective feedback mechanisms for developers. We're shortening the path from writing code to understanding its behavior by pioneering adaptive, exhaustive, and autonomous testing.

What's Broken Today

Engineers and organizations are risking introducing unreliable and insecure code into production environments, undermining the efficiency gains models are meant to provide. Review tools operate on the wrong primitives and testing practices lack sufficient automation.

Automated code review has fundamental limitations

One solution to the issues that code generation have introduced is automated review bots living in GitHub or in other source-code-hosting platforms. We feel that this is the wrong abstraction.

First, not all exceptions and security issues can be caught with static analysis. Particularly for interpreted languages, you need to run code to see if it behaves correctly. Even for compiled languages, static analysis suffers from state space explosion: as code complexity increases, the number of possible execution paths grows exponentially, making it computationally infeasible to verify all possible behaviors without actually executing code[3].

Second, we believe that developers need local-first tooling for linting, testing, review, and debugging that provides immediate feedback as they write code. Autonomous review in a remote repository is an unnecessarily loose feedback mechanism.

Tests encode behavior sparsely and rigidly

Industry-standard testing practices (unit, integration, and end-to-end tests) validate only specific input/output combinations without deeply exploring the range of possible behaviors, creating a false sense of security.

Worse, tests typically fail to adapt to rapidly evolving codebases and create a second source of truth that can drift out of sync with implementation. The truth is, any tests committed to a repo add a maintenance burden. Teams waste hours wrestling with configuration and manually updating suites that should automatically adjust to changes in function signatures and call graphs.

What Needs to Exist

We need systems that bridge the gap between rapid coding and rigorous validation.

Runtime-aware with intent inference

We must analyze what the code does, not just what it says. While static analysis can catch syntax errors, it can't validate actual runtime behavior. Modern correctness tools need to observe code during execution, capturing inputs, outputs, and state transitions across various execution paths. Additionally, these tools should infer the intended behavior from function signatures, docstrings, and execution patterns — creating living specifications that evolve with the code itself.

This approach is now viable thanks to advances in runtime tracing and telemetry. OpenTelemetry, structured logging, and modern language runtimes make it possible to cheaply and comprehensively capture runtime behavior: inputs, outputs, side effects, call graphs, and timing — creating a rich dataset from which to derive behavioral models.

Behaviorally grounded through lightweight formal methods

Correctness should be measured by behavioral properties rather than isolated test cases. Traditional unit tests encode only a limited number of specific cases, but property-based testing can verify that code behaves correctly across the entire input space. These properties — like symmetry, idempotence, or invariant preservation — capture the true intent of the code in ways that individual test cases cannot. By focusing on these general properties rather than specific inputs and outputs, we avoid the limitations of traditional testing while gaining many of the benefits of formal verification without its complexity.

The tools for this approach have matured in recent years. Property-based testing frameworks like Hypothesis, jqf, and libFuzzer have proven their ability to find real bugs. However, they remain siloed and inaccessible to many developers. With LLMs now enabling semantic evaluation of outputs, we can bridge the gap where exact equality checks fail, scoring whether two outputs 'mean the same thing' — even for complex, underspecified responses.

Automatically enforced with continuous validation

Validation must run continuously, without depending on humans to remember edge cases or enforce specs by hand. Modern systems should automatically generate property-based tests, run sophisticated fuzzing campaigns, and surface potential issues before they reach production. This continuous validation pipeline should integrate seamlessly with development workflows, providing immediate feedback when changes violate the inferred specifications or introduce new edge cases that require attention. The goal is to make rigor and validation as frictionless as the coding experience itself.

WASM and containerized runtimes now make this approach practical at scale. Executing, comparing, and mutating behavior across environments — especially for regression testing — is increasingly feasible. Furthermore, the cultural window for adoption is wide open. AI-assisted development has made engineers receptive to new approaches for ensuring correctness, as traditional review processes struggle to keep pace with AI-generated code.

Assert Labs exists to make this vision a reality. We're building infrastructure that captures real and synthetic runtime behavior, learns what code is expected to do — from past runs, tests, docs, or LLM-generated specs — continuously checks for semantic drift, invariant violations, and unintended change, and flags regressions in behavior, not just structure.

Our goal is not to replace testing, types, or review — but to evolve these practices by grounding them in execution. We need to shift from simplistic unit testing and manual review to continuous, automated behavioral validation — combining the intuitive speed of modern development with the rigor of formal methods, but without the complexity that has historically limited their adoption.

References

[1]
More than a quarter of new code at Google is generated by AI
Jay Peters (2024)https://www.theverge.com/2024/10/29/24282757/google-new-code-generated-ai-q3-2024
[2]
For 25% of the Winter 2025 batch, 95% of lines of code are LLM generated
Garry Tan (2025)https://x.com/garrytan/status/1897303270311489931
[3]
A few billion lines of code later: using static analysis to find bugs in the real world
Al Bessey et al., Communications of the ACM (2010)https://dl.acm.org/doi/10.1145/1646353.1646374
[4]
The Oracle Problem in Software Testing: A Survey
Earl T. Barr et al., IEEE Transactions on Software Engineering (2015)https://ieeexplore.ieee.org/document/6963470
[5]
Running Evals in Vitest
David Cramer (2025)https://cra.mr/vitest-evals
> _