Back to blog

Data is a Liability: Replacing Persistent Retrieval with Ephemeral GPU Compute

· 3 min read
TL;DR

Most AI systems rely on storage-first architectures that retain sensitive data in vector databases, caches, logs, and cloud infrastructure. StatelessLaw explores an alternative approach: a non-persistent inference pipeline based on transient compute, memory-only processing, and reduced forensic surface area. Instead of persisting private legal context, the system streams data through isolated execution environments, performs GPU-accelerated reranking in volatile memory, and minimizes long-lived state where possible. The goal is to reduce application-level persistence and long-term retention of sensitive context.

Data is a Liability: Replacing Persistent Retrieval with Ephemeral GPU Compute

The common assumption in modern AI infrastructure is that more stored data creates more value. In practice, especially in legal and financial systems, persistent storage also increases operational and forensic risk.

Most AI platforms rely on storage-first architectures: documents are uploaded, indexed, embedded, cached, and retained inside persistent infrastructure to enable low-latency retrieval. Even when providers state that customer data is not used for training, the data often still exists at rest within vector databases, object storage, logs, or intermediate processing layers.

At Stateless Logic, we are exploring a different approach with StatelessLaw: an ephemeral retrieval architecture designed to minimize persistence surfaces and reduce long-term data exposure. Instead of treating sensitive legal context as an asset to retain, we treat it as transient compute state.


The Persistence Problem

Persistent infrastructure creates centralized aggregation points for sensitive information. In legal workflows, this can include:

  • internal strategy documents

  • case notes

  • contracts

  • privileged communications

  • regulatory analysis

Traditional AI retrieval pipelines frequently introduce multiple persistence layers:

  • vector indexes

  • cloud object storage

  • inference logs

  • temporary caches

  • observability pipelines

  • backup snapshots

Each layer increases the potential forensic surface.

Our goal is not to claim “perfect invisibility” or “zero trace” computing. Modern systems always involve tradeoffs, including hardware, operating system, hypervisor, and networking layers. Instead, our architectural objective is narrower and more practical: reduce unnecessary persistence and minimize retained sensitive state wherever possible.


Compute, Don’t Collect

StatelessLaw is being designed around a non-persistent inference pipeline. Rather than permanently indexing user-provided legal material into long-lived storage systems, contextual data is streamed into isolated processing environments and handled primarily in volatile memory during inference execution.

This approach shifts the emphasis from storage-heavy retrieval to high-throughput transient compute.

1. Memory-Only Processing Our current architecture experiments with in-memory document processing using isolated inference execution loops. Text is streamed, tokenized, processed in volatile memory, and released after inference completion.

We are currently evaluating NVIDIA Triton Inference Server and TensorRT-based pipelines because they provide tighter execution control and significantly lower latency for large-scale reranking workloads. The objective is not to eliminate all possible traces across every system layer, but to reduce long-lived application-level persistence and avoid retaining searchable customer datasets by default.

2. Accelerated Exhaustive Retrieval Many retrieval systems rely heavily on approximate nearest-neighbor (ANN) indexing because exhaustive semantic comparison becomes computationally expensive on CPUs. Our research direction explores whether modern GPU parallelism makes high-recall brute-force reranking practical at production latency.

Using accelerated retrieval pipelines (including cuVS and Cross-Encoders), we are testing architectures where:

  • large candidate pools remain transient,

  • reranking occurs directly in GPU memory,

  • and retrieval quality improves without requiring permanently stored embedding indexes for user-private data.

This effectively trades persistent storage complexity for parallel compute throughput.

3. Deterministic Legal Context Routing Legal systems are fundamentally relational: statutes reference precedents, amendments override earlier sections, and regulations interact across jurisdictions. To model these dependencies, we structure public legal sources such as Finlex and EUR-Lex as graph-linked legal context layers.

The graph layer handles deterministic context routing, while accelerated reranking handles relevance scoring across large transient candidate sets. This separation allows:

  • deterministic legal structure,

  • probabilistic semantic matching,

  • and reduced reliance on persistent user-specific indexing.


Current Status

Our current proof-of-concept validates the orchestration model and transient ingestion pipeline. The initial prototype used:

  • stateless cloud sandbox execution,

  • CPU-based reranking,

  • and external LLM APIs.

While suitable for validating retrieval behavior, this architecture still introduces external trust boundaries and persistence concerns that are not ideal for sensitive legal workloads. Our ongoing work focuses on:

  • migrating heavy reranking workloads onto dedicated GPU infrastructure (NVIDIA L40S/H100),

  • replacing external APIs with locally hosted open-source models,

  • and improving execution isolation within self-hosted inference environments.

The long-term goal is not “perfect secrecy,” but tighter operational control over data lifecycle, retention behavior, execution locality, and forensic exposure.


Toward Reduced-Persistence AI Systems

AI infrastructure does not necessarily require permanent retention to deliver high-quality retrieval and reasoning. For sensitive domains such as legal technology, reducing persistence surfaces may become as important as improving model capability itself.

We believe future AI systems will increasingly differentiate between:

  • compute that must persist,

  • and compute that should disappear once execution ends.

StatelessLaw is being built around that assumption.

Comments

No comments yet. Start the discussion.