Skip to main content
Step-by-step walkthrough for training a model to search and reason over documents using RL. The example uses patents, but the same architecture applies to any document domain: legal filings, SEC documents, medical literature, enterprise knowledge bases. Stack: verifiers, ChromaDB, BigQuery, Lab, Qwen3-4B, Llama-3.2-3B

Environments Overview

LevelNameReward typeWhat the agent does
Level 1Metadata RetrievalExact matchSingle tool call, parse structured response
Level 2Multi-step ComputationBinary on deterministic answers2+ tool calls, arithmetic, cross-patent comparison
Level 3Open-ended AnalysisLLM judge, normalizedFull patent text reading, synthesis, cross-patent reasoning

Prerequisites

  • Google Cloud account with BigQuery access. Free tier is sufficient; with that, you get 1 TB/month of queries. Downloading ~1,500 patents with full text uses ~3 GB.
  • Prime Intellect account with Lab access. You’ll push environments to the Environments Hub and launch training runs from the CLI.
  • OpenAI API key, which is used for generating Level 3 Q&A ground truth and the LLM judge at training time.
  • Python 3.10+ with verifiers, chromadb, openai, and the prime CLI. Install the Prime CLI with pip install prime-cli.

Step 1 - Get Patent Data

Use Google’s patents-public-data BigQuery dataset. It has full patent text (abstract, claims, description) and supports SQL filtering by company.

BigQuery query

-- fetch_patents.sql
SELECT
  publication_number,
  title,
  abstract,
  claims,
  description,
  filing_date,
  grant_date,
  ARRAY_LENGTH(claims_localized) AS claim_count
FROM
  `patents-public-data.patents.publications`
WHERE
  assignee_harmonized.name IN (
    'QUALCOMM INCORPORATED',
    'ERICSSON',
    'NOKIA', 
    'SAMSUNG',
    'HUAWEI',
    'SONY',
  )
  AND country_code = 'US'
  AND grant_date IS NOT NULL
LIMIT 1500
Export the results to a GCS bucket or download directly as JSON. You can start with just ~1,500 patents; it produces meaningful training signal without excessive embedding costs.

Format each patent

Structure each patent as a markdown document and store metadata separately. Preserving document structure matters for Level 3 because the agent needs to navigate named sections.
# format_patents.py
def format_patent(row):
    doc = f"""# {row['title']}

## Abstract
{row['abstract']}

## Claims
{row['claims']}

## Description
{row['description']}
"""
    metadata = {
        "patent_id": row["publication_number"],
        "title": row["title"],
        "filing_date": row["filing_date"],
        "grant_date": row["grant_date"],
        "claim_count": row["claim_count"],
    }
    return doc, metadata

Step 2 - Design Three RL Environments

The three levels impose a curriculum. For example, an agent that can’t reliably call get_metadata() and parse a date string shouldn’t attempt multi-patent technical comparisons. Each level introduces qualitatively harder capabilities.

Level 1 - Single-tool metadata retrieval

Questions that require exactly one tool call with no computation or reasoning
Question typeExample
Filing date”When was patent X filed?”
Grant date”When was patent X granted?”
Claim count”How many claims does patent X have?”
Title”What is the title of patent X?”
Dataset size: 6,000 Q&A pairs. Every answer is deterministically verifiable from the metadata. Reward: Binary - 1.0 for exact string match, 0.0 otherwise. Tools available: search_patents(query), get_metadata(patent_id) Level 1 is a pipeline validation step. If reward doesn’t climb toward 1.0 within 15 steps, something is wrong with your tool definitions, reward function, or data formatting. It isn’t a model problem.

Level 2 - Multi-step computation and comparison

Questions requiring 2 or more tool calls followed by arithmetic or string comparison.
Question typeExampleWhat the agent does
Days to grant”How many days between filing and grant for patent X?”get_metadata() then parse two dates and subtract
Filed first”Which was filed first: X or Y?”get_metadata() twice, then compare
Claim difference”How many more claims does X have than Y?”get_metadata() twice, then subtract
Search + retrieve”Find a patent about SSB and tell me its filing date”search_patents() then get_metadata()
Abstract content”Does the abstract of patent X mention ‘5G’?”get_abstract() then string check
Dataset size: 500 Q&A pairs. Reward: Binary on deterministic answers. For search questions with multiple valid answers, the reward function checks against a precomputed set of valid answers. New tool: get_abstract(patent_id) returns abstract text. The environment does not do the math. The agent receives raw dates and counts from tool calls and must compute the answer itself. This is intentional to train the reasoning.

Level 3 - Open-ended technical analysis

Questions that require reading patent content, understanding technical concepts, and synthesizing answers that can’t be verified by pure string matching.
Question typeExample
Technical summary”Summarize the key technical innovation in patent X in 2-3 sentences”
Problem identification”What problem does patent X solve?”
Cross-patent comparison”What is the key technical difference between patent X and Y?”
Standards identification”What wireless communication standards are referenced in patent X?”
Claim feature extraction”What are the key limiting features in claim 1 of patent X?”
Abstract-claim mapping”Which claims are most directly described in the abstract?”
Dataset size: 500 Q&A pairs (LLM-generated with ground truth). Reward: LLM judge, normalized. New tools: view_sections(patent_id), read_section(patent_id, section_name)

Step 3 - Generate Verifiable Q&A Pairs

Levels 1 and 2

Iterate over the patent dataset, select random patents (or pairs for comparison questions), compute the answer directly from structured data, and generate the pair. Every answer is verified against the source data. For search-type questions, precompute all valid answers across the entire dataset. For example, if and question asks to identify “SSB” patents and three patents mention “SSB” in their title or abstract, all three filing dates are valid answers. Store this set, and the reward function will check against it at training time.
# generate_qa.py
import random
from datetime import datetime

def generate_date_question(patents: list[dict]) -> dict:
    patent = random.choice(patents)
    return {
        "question": f"When was patent {patent['patent_id']} filed?",
        "answer": patent["filing_date"],  # e.g. "2019-03-15"
        "patent_ids": [patent["patent_id"]],
        "level": 1,
        "type": "filing_date",
    }

def generate_days_to_grant(patents: list[dict]) -> dict:
    patent = random.choice([p for p in patents if p["grant_date"]])
    filed = datetime.strptime(patent["filing_date"], "%Y-%m-%d")
    granted = datetime.strptime(patent["grant_date"], "%Y-%m-%d")
    days = (granted - filed).days
    return {
        "question": f"How many days elapsed between the filing and grant of patent {patent['patent_id']}?",
        "answer": str(days),
        "patent_ids": [patent["patent_id"]],
        "level": 2,
        "type": "days_to_grant",
    }

Level 3 - LLM-generated ground truth

Each Level 3 entry needs a structured reference that the judge can use at training time. Use the LLM to generate three components per question:
  • answer - the reference answer in writing
  • key_points - specific factual claims the answer must contain
  • source_quotes - direct quotes from the patent text supporting each key point
# generate_l3_qa.py
import json
from openai import OpenAI

client = OpenAI()

GENERATION_PROMPT = """You are generating ground truth for a patent analysis training dataset.

Patent text:
{patent_text}

Question: {question}

Return ONLY valid JSON with no preamble:
{{
    "answer": "prose answer to the question",
    "key_points": ["specific factual claim 1", "specific factual claim 2"],
    "source_quotes": ["exact quote from patent supporting point 1", "exact quote supporting point 2"]
}}"""

def generate_l3_ground_truth(patent_text: str, question: str) -> dict:
    response = client.chat.completions.create(
        model="model",
        messages=[{
            "role": "user",
            "content": GENERATION_PROMPT.format(
                patent_text=patent_text,
                question=question
            )
        }],
        temperature=0
    )
    return json.loads(response.choices[0].message.content)

Validate your dataset

Run these checks on every generated pair before training:
  • Every patent ID referenced in a question exists in the dataset
  • Every Level 3 ground truth has at least one key point
  • Source quotes from Level 3 ground truth actually appear in the patent text (substring match)
Manually review a stratified sample across all question types. This surfaces systematic prompt issues that automated checks miss such as questions conflating problem and solution, or rubric with contradictions, or answers with technical inaccuracies.

Step 4 - Reward Design

Levels 1 and 2

Level 1 and 2 reward designs were straightforward.
# rewards.py
def reward_exact(agent_answer: str, ground_truth: str) -> float:
    """Binary reward for single-answer questions."""
    return 1.0 if normalize(agent_answer) == normalize(ground_truth) else 0.0

def reward_set(agent_answer: str, valid_set: set[str]) -> float:
    """For search questions with multiple valid answers."""
    return 1.0 if normalize(agent_answer) in valid_set else 0.0

def normalize(s: str) -> str:
    return s.strip().lower().replace(",", "").replace(".", "")

Level 3 - LLM judge

Getting the judge right required three iterations. Here’s what failed and what ended up working. Iteration 1 - didn’t work: Multi-dimensional weighted scoring (accuracy, completeness, reasoning, conciseness, 0-10 each). Failed because regex parsing of scores was fragile, and per-category weight tuning is a second optimization problem on top of the first. Iteration 2 - didn’t work: Per-question custom rubrics (5 criteria x 2 points each, LLM-generated per question). More principled, but LLM-generated rubrics introduced contradictions like asking for content that would actually weaken an otherwise correct answer. Iteration 3 - works: Universal rubric with content-specific ground truth. The ground truth (answer, key_points, source_quotes) already provides all content specificity needed.
CriterionPointsWhat it catches
Factually accurate relative to patent3Wrong technical details
Free of hallucinated information3Made-up claims, features, standards
Covers key points from ground truth2Missing important content
Directly answers the question asked1Off-topic or evasive responses
References specific patent content1Unsupported assertions
The judge receives the question, the agent’s response, the reference answer, key points, and source quotes. It returns structured JSON, normalized to [0, 1]:
// judge_output_schema.json
{
    "key_points": [
        {"point": "SSB spatial overloading concept", "covered": true},
        {"point": "Spatially separated beams", "covered": true},
        {"point": "Reduced bandwidth overhead", "covered": false}
    ],
    "hallucination": false,
    "factual_error": false,
    "final_score": 7
}
# rewards.py
def reward_l3(judge_output: dict) -> float:
    score = judge_output["final_score"] / 10.0  # normalize to [0, 1]
    if judge_output.get("hallucination"):
        score *= 0.2  # hallucination is catastrophic in patent context
    elif judge_output.get("factual_error"):
        score *= 0.5
    return score
Note on the 0.2 multiplier: zeroing out hallucinated responses entirely creates a sharp gradient that can cause instability. A 0.2 multiplier still strongly penalizes hallucination while providing a small gradient signal. In a patent context, hallucinated technical claims have direct commercial and legal consequences so we err on the side of undertrained vs. positively trained on hallucinations.

Step 5 - Build the Tool Environment

The environment is implemented using the verifiers library as a ToolEnv. The agent receives a system prompt describing its role as a patent analyst with access to a dataset of N patents, a question, and a set of tools.

Tool definitions

# patent_env.py
import chromadb
from verifiers import ToolEnv

class PatentEnv(ToolEnv):
    def __init__(self, patents: list[dict], level: int):
        self.corpus = {p["patent_id"]: p for p in patents}
        self.level = level
        self._init_vector_store(patents)
        super().__init__(tools=self._get_tools(level))

    def _init_vector_store(self, patents):
        self.chroma = chromadb.Client()
        collection = self.chroma.create_collection("patents")
        collection.add(
            ids=[p["patent_id"] for p in patents],
            documents=[f"{p['title']} {p['abstract']}" for p in patents],
        )
        self.collection = collection

    def search_patents(self, query: str) -> list[dict]:
        """Returns patent IDs and titles matching a semantic query."""
        results = self.collection.query(query_texts=[query], n_results=5)
        return [
            {"patent_id": id_, "title": self.corpus[id_]["title"]}
            for id_ in results["ids"][0]
        ]

    def get_metadata(self, patent_id: str) -> dict:
        """Returns title, filing_date, grant_date, claim_count."""
        p = self.corpus[patent_id]
        return {
            "title": p["title"],
            "filing_date": p["filing_date"],
            "grant_date": p["grant_date"],
            "claim_count": p["claim_count"],
        }

    def get_abstract(self, patent_id: str) -> str:
        return self.corpus[patent_id]["abstract"]

    def view_sections(self, patent_id: str) -> list[str]:
        """Lists available section names in the full patent text."""
        doc = self.corpus[patent_id]["full_text"]
        return [
            line.lstrip("# ").strip()
            for line in doc.split("\n")
            if line.startswith("## ")
        ]

    def read_section(self, patent_id: str, section_name: str) -> str:
        """Returns the full text of a named section."""
        doc = self.corpus[patent_id]["full_text"]
        start = doc.find(f"## {section_name}")
        if start == -1:
            return f"Section '{section_name}' not found."
        end = doc.find("\n## ", start + 1)
        return doc[start:end if end != -1 else len(doc)]

    def _get_tools(self, level: int):
        tools = [self.search_patents, self.get_metadata]
        if level >= 2:
            tools.append(self.get_abstract)
        if level >= 3:
            tools += [self.view_sections, self.read_section]
        return tools
Keep tools stateless. The agent can call get_metadata() on the same patent multiple times with no side effects.

Schema to tools generalization

Patents have standardized sections (Abstract, Claims, Description) and coded metadata. This makes them well-suited for tool-based environments. The same structure applies to any document domain with a known schema such as legal filings map to get_case_metadata(), SEC filings to read_section("Risk Factors"), and so on.

Step 6 - Parsing Strategy

Avoid regex in the pipeline because when not implemented well, it can introduce reward hacking. Use simple string methods and structured JSON instead:
# parsing.py
import json

def extract_claims_text(full_text: str) -> str:
    """Extract the Claims section using string slicing, not regex."""
    start = full_text.find("## Claims")
    if start == -1:
        return ""
    end = full_text.find("\n## ", start + 1)
    return full_text[start:end if end != -1 else len(full_text)]

def parse_patent_id(full_code: str) -> str:
    """'US10123456B2' -> split and handle deterministically."""
    return full_code.split("-")[0] if "-" in full_code else full_code

def parse_judge_output(raw: str) -> dict:
    """Parse judge JSON with a robust fallback."""
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        # Fallback: find final_score by character iteration
        key = '"final_score":'
        idx = raw.find(key)
        if idx == -1:
            return {"final_score": 0, "hallucination": True, "key_points": []}
        score_start = idx + len(key)
        digits = ""
        for ch in raw[score_start:].lstrip():
            if ch.isdigit():
                digits += ch
            elif digits:
                break
        return {
            "final_score": int(digits) if digits else 0,
            "hallucination": "hallucin" in raw.lower(),
            "key_points": [],
        }

Step 7 - Train on Lab

Prime Intellect’s Lab platform handles GPU orchestration and multi-tenant LoRA deployments. You push your environment to the Environments Hub, define a training config, and launch the run from the CLI.

Step 7.1 - Push environment to the Hub

# Push each level as a separate named environment
prime env push --name basic-patent-q-and-a
prime env push --name advanced-patent-q-and-a
prime env push --name patent-technical-analysis

Step 7.2 - Define a training config

# config.toml
model = "meta-llama/Llama-3.2-3B-Instruct"
max_steps = 100
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "primeintellect/advanced-patent-q-and-a"

Step 7.3 - Calibrate model to environment difficulty

Before committing to a full training run, verify the base model finds the task challenging but not impossible. A model that starts too high has nothing to learn; one that starts too low can’t generate useful gradient signal. Target starting reward is around 0.15 to 0.35. Do a short 10-step run first and read the reward curve. It plots mean reward per step across all rollouts in the batch. If starting reward is outside that range, adjust difficulty or model size before launching the full run.
# Quick calibration: 10 steps to measure starting reward
prime rl run config.toml \
  --env-var OPENAI_API_KEY=sk-... \
  --override max_steps=10
If starting reward is above 0.7, make the questions harder or use a smaller base model. If it’s below 0.1, the questions may be too hard or the tools too opaque.

Step 7.4 - Launch the full run

prime rl run config.toml --env-var OPENAI_API_KEY=sk-...
You can pass secrets at runtime with --env-var, or set them in the environment settings — both work. Once the run is live, the Lab dashboard shows per-step metrics including mean reward, reward standard deviation, response length, and tool call count. Tool call count is worth checking early. If the agent isn’t calling tools in the first few steps, the system prompt isn’t clear enough about what tools are available and when to use them. Below are some screenshots of what the Prime Intellect dashboard looks like. Reward and metrics dashboard Rollout viewer

Config parameters to tune

ParameterWhat it controlsStarting point
max_stepsTraining duration50 for L1, 100 for L2/L3
batch_sizeExamples per step128
rollouts_per_exampleTrajectories sampled per question8, increase for noisy L3 rewards
max_tokensMax agent response length1024, L3 may need more

Results

Trained Qwen3-4B-Instruct and Llama-3.2-3B-Instruct across all three levels. The rollout viewer in Lab lets you inspect individual trajectories turn by turn: each tool call, its response, and the final answer. It’s the most direct way to see what the model is actually learning to do at each level and understand in depth why the reward curve is behaving as it is.

Level 1

CheckpointReward
Start~0.05
Step 15~1.0
Level 1 reward curve Saturates within 15 steps and stays there. Confirms the pipeline works end-to-end. Once the agent learns the pattern of calling get_metadata() and reading the response, it gets nearly everything right.

Level 2

CheckpointReward
Start~0.25
Step 50~0.70
Level 2 reward curve Clear upward trend with variance throughout. Some batches land on straightforward date subtraction, others require chaining a search into a metadata call and then doing a comparison. The curve hasn’t plateaued at step 50, so longer runs will get more performance.

Level 3

CheckpointReward
Start~0.30
Step 100~0.50
Level 3 reward curve Noisy, as expected from LLM-judged reward. Individual batches swing between 0.15 and 0.9. The upward trend is real but needs larger batches and more steps to converge. The noise has two sources: judge subjectivity, and question difficulty variation (ex: technical summary vs. cross-patent comparison are not the same task and vary in difficulty).

Extending This to Other Domains

The core architecture (schema-derived tools, progressive difficulty levels, universal rubric with content-specific ground truth) is not patent-specific. To adapt it to the following categories, you can:
  • Legal case search: Replace read_section("Claims") with read_section("Holding"). The judgment or ruling is your Level 3 answer target.
  • SEC filings: 10-K documents have standardized sections (Risk Factors, MD&A, Financial Statements).
  • Medical literature: PubMed abstracts for Level 1, full-text PMC articles for Level 3. Use MeSH terms for structured metadata.
  • Enterprise knowledge bases: Internal docs with known schemas. Level 3 judge needs domain-appropriate ground truth generation.
The main work in each new domain is: (1) acquiring the data with full text, (2) defining the question types that capture the actual analytic tasks, and (3) writing the Level 3 ground truth generation prompt with domain-specific few-shot examples.
All environments are published on the Environments Hub. The patent dataset is on HuggingFace.