Documentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
Step-by-step walkthrough for training a model to search and reason over documents using RL. The example uses patents, but the same architecture applies to any document domain: legal filings, SEC documents, medical literature, enterprise knowledge bases.
Stack: verifiers, ChromaDB, BigQuery, Lab, Qwen3-4B, Llama-3.2-3B
Environments Overview
| Level | Name | Reward type | What the agent does |
|---|
| Level 1 | Metadata Retrieval | Exact match | Single tool call, parse structured response |
| Level 2 | Multi-step Computation | Binary on deterministic answers | 2+ tool calls, arithmetic, cross-patent comparison |
| Level 3 | Open-ended Analysis | LLM judge, normalized | Full patent text reading, synthesis, cross-patent reasoning |
Prerequisites
- Google Cloud account with BigQuery access. Free tier is sufficient; with that, you get 1 TB/month of queries. Downloading ~1,500 patents with full text uses ~3 GB.
- Prime Intellect account with Lab access. You’ll push environments to the Environments Hub and launch training runs from the CLI.
- OpenAI API key, which is used for generating Level 3 Q&A ground truth and the LLM judge at training time.
- Python 3.10+ with
verifiers, chromadb, openai, and the prime CLI. Install the Prime CLI with pip install prime-cli.
Step 1 - Get Patent Data
Use Google’s patents-public-data BigQuery dataset. It has full patent text (abstract, claims, description) and supports SQL filtering by company.
BigQuery query
-- fetch_patents.sql
SELECT
publication_number,
title,
abstract,
claims,
description,
filing_date,
grant_date,
ARRAY_LENGTH(claims_localized) AS claim_count
FROM
`patents-public-data.patents.publications`
WHERE
assignee_harmonized.name IN (
'QUALCOMM INCORPORATED',
'ERICSSON',
'NOKIA',
'SAMSUNG',
'HUAWEI',
'SONY',
)
AND country_code = 'US'
AND grant_date IS NOT NULL
LIMIT 1500
Export the results to a GCS bucket or download directly as JSON. You can start with just ~1,500 patents; it produces meaningful training signal without excessive embedding costs.
Structure each patent as a markdown document and store metadata separately. Preserving document structure matters for Level 3 because the agent needs to navigate named sections.
# format_patents.py
def format_patent(row):
doc = f"""# {row['title']}
## Abstract
{row['abstract']}
## Claims
{row['claims']}
## Description
{row['description']}
"""
metadata = {
"patent_id": row["publication_number"],
"title": row["title"],
"filing_date": row["filing_date"],
"grant_date": row["grant_date"],
"claim_count": row["claim_count"],
}
return doc, metadata
Step 2 - Design Three RL Environments
The three levels impose a curriculum. For example, an agent that can’t reliably call get_metadata() and parse a date string shouldn’t attempt multi-patent technical comparisons. Each level introduces qualitatively harder capabilities.
Questions that require exactly one tool call with no computation or reasoning
| Question type | Example |
|---|
| Filing date | ”When was patent X filed?” |
| Grant date | ”When was patent X granted?” |
| Claim count | ”How many claims does patent X have?” |
| Title | ”What is the title of patent X?” |
Dataset size: 6,000 Q&A pairs. Every answer is deterministically verifiable from the metadata.
Reward: Binary - 1.0 for exact string match, 0.0 otherwise.
Tools available: search_patents(query), get_metadata(patent_id)
Level 1 is a pipeline validation step. If reward doesn’t climb toward 1.0 within 15 steps, something is wrong with your tool definitions, reward function, or data formatting. It isn’t a model problem.
Level 2 - Multi-step computation and comparison
Questions requiring 2 or more tool calls followed by arithmetic or string comparison.
| Question type | Example | What the agent does |
|---|
| Days to grant | ”How many days between filing and grant for patent X?” | get_metadata() then parse two dates and subtract |
| Filed first | ”Which was filed first: X or Y?” | get_metadata() twice, then compare |
| Claim difference | ”How many more claims does X have than Y?” | get_metadata() twice, then subtract |
| Search + retrieve | ”Find a patent about SSB and tell me its filing date” | search_patents() then get_metadata() |
| Abstract content | ”Does the abstract of patent X mention ‘5G’?” | get_abstract() then string check |
Dataset size: 500 Q&A pairs.
Reward: Binary on deterministic answers. For search questions with multiple valid answers, the reward function checks against a precomputed set of valid answers.
New tool: get_abstract(patent_id) returns abstract text.
The environment does not do the math. The agent receives raw dates and counts from tool calls and must compute the answer itself. This is intentional to train the reasoning.
Level 3 - Open-ended technical analysis
Questions that require reading patent content, understanding technical concepts, and synthesizing answers that can’t be verified by pure string matching.
| Question type | Example |
|---|
| Technical summary | ”Summarize the key technical innovation in patent X in 2-3 sentences” |
| Problem identification | ”What problem does patent X solve?” |
| Cross-patent comparison | ”What is the key technical difference between patent X and Y?” |
| Standards identification | ”What wireless communication standards are referenced in patent X?” |
| Claim feature extraction | ”What are the key limiting features in claim 1 of patent X?” |
| Abstract-claim mapping | ”Which claims are most directly described in the abstract?” |
Dataset size: 500 Q&A pairs (LLM-generated with ground truth).
Reward: LLM judge, normalized.
New tools: view_sections(patent_id), read_section(patent_id, section_name)
Step 3 - Generate Verifiable Q&A Pairs
Levels 1 and 2
Iterate over the patent dataset, select random patents (or pairs for comparison questions), compute the answer directly from structured data, and generate the pair. Every answer is verified against the source data.
For search-type questions, precompute all valid answers across the entire dataset. For example, if and question asks to identify “SSB” patents and three patents mention “SSB” in their title or abstract, all three filing dates are valid answers. Store this set, and the reward function will check against it at training time.
# generate_qa.py
import random
from datetime import datetime
def generate_date_question(patents: list[dict]) -> dict:
patent = random.choice(patents)
return {
"question": f"When was patent {patent['patent_id']} filed?",
"answer": patent["filing_date"], # e.g. "2019-03-15"
"patent_ids": [patent["patent_id"]],
"level": 1,
"type": "filing_date",
}
def generate_days_to_grant(patents: list[dict]) -> dict:
patent = random.choice([p for p in patents if p["grant_date"]])
filed = datetime.strptime(patent["filing_date"], "%Y-%m-%d")
granted = datetime.strptime(patent["grant_date"], "%Y-%m-%d")
days = (granted - filed).days
return {
"question": f"How many days elapsed between the filing and grant of patent {patent['patent_id']}?",
"answer": str(days),
"patent_ids": [patent["patent_id"]],
"level": 2,
"type": "days_to_grant",
}
Level 3 - LLM-generated ground truth
Each Level 3 entry needs a structured reference that the judge can use at training time. Use the LLM to generate three components per question:
- answer - the reference answer in writing
- key_points - specific factual claims the answer must contain
- source_quotes - direct quotes from the patent text supporting each key point
# generate_l3_qa.py
import json
from openai import OpenAI
client = OpenAI()
GENERATION_PROMPT = """You are generating ground truth for a patent analysis training dataset.
Patent text:
{patent_text}
Question: {question}
Return ONLY valid JSON with no preamble:
{{
"answer": "prose answer to the question",
"key_points": ["specific factual claim 1", "specific factual claim 2"],
"source_quotes": ["exact quote from patent supporting point 1", "exact quote supporting point 2"]
}}"""
def generate_l3_ground_truth(patent_text: str, question: str) -> dict:
response = client.chat.completions.create(
model="model",
messages=[{
"role": "user",
"content": GENERATION_PROMPT.format(
patent_text=patent_text,
question=question
)
}],
temperature=0
)
return json.loads(response.choices[0].message.content)
Validate your dataset
Run these checks on every generated pair before training:
- Every patent ID referenced in a question exists in the dataset
- Every Level 3 ground truth has at least one key point
- Source quotes from Level 3 ground truth actually appear in the patent text (substring match)
Manually review a stratified sample across all question types. This surfaces systematic prompt issues that automated checks miss such as questions conflating problem and solution, or rubric with contradictions, or answers with technical inaccuracies.
Step 4 - Reward Design
Levels 1 and 2
Level 1 and 2 reward designs were straightforward.
# rewards.py
def reward_exact(agent_answer: str, ground_truth: str) -> float:
"""Binary reward for single-answer questions."""
return 1.0 if normalize(agent_answer) == normalize(ground_truth) else 0.0
def reward_set(agent_answer: str, valid_set: set[str]) -> float:
"""For search questions with multiple valid answers."""
return 1.0 if normalize(agent_answer) in valid_set else 0.0
def normalize(s: str) -> str:
return s.strip().lower().replace(",", "").replace(".", "")
Level 3 - LLM judge
Getting the judge right required three iterations. Here’s what failed and what ended up working.
Iteration 1 - didn’t work: Multi-dimensional weighted scoring (accuracy, completeness, reasoning, conciseness, 0-10 each). Failed because regex parsing of scores was fragile, and per-category weight tuning is a second optimization problem on top of the first.
Iteration 2 - didn’t work: Per-question custom rubrics (5 criteria x 2 points each, LLM-generated per question). More principled, but LLM-generated rubrics introduced contradictions like asking for content that would actually weaken an otherwise correct answer.
Iteration 3 - works: Universal rubric with content-specific ground truth. The ground truth (answer, key_points, source_quotes) already provides all content specificity needed.
| Criterion | Points | What it catches |
|---|
| Factually accurate relative to patent | 3 | Wrong technical details |
| Free of hallucinated information | 3 | Made-up claims, features, standards |
| Covers key points from ground truth | 2 | Missing important content |
| Directly answers the question asked | 1 | Off-topic or evasive responses |
| References specific patent content | 1 | Unsupported assertions |
The judge receives the question, the agent’s response, the reference answer, key points, and source quotes. It returns structured JSON, normalized to [0, 1]:
// judge_output_schema.json
{
"key_points": [
{"point": "SSB spatial overloading concept", "covered": true},
{"point": "Spatially separated beams", "covered": true},
{"point": "Reduced bandwidth overhead", "covered": false}
],
"hallucination": false,
"factual_error": false,
"final_score": 7
}
# rewards.py
def reward_l3(judge_output: dict) -> float:
score = judge_output["final_score"] / 10.0 # normalize to [0, 1]
if judge_output.get("hallucination"):
score *= 0.2 # hallucination is catastrophic in patent context
elif judge_output.get("factual_error"):
score *= 0.5
return score
Note on the 0.2 multiplier: zeroing out hallucinated responses entirely creates a sharp gradient that can cause instability. A 0.2 multiplier still strongly penalizes hallucination while providing a small gradient signal. In a patent context, hallucinated technical claims have direct commercial and legal consequences so we err on the side of undertrained vs. positively trained on hallucinations.
The environment is implemented using the verifiers library as a ToolEnv. The agent receives a system prompt describing its role as a patent analyst with access to a dataset of N patents, a question, and a set of tools.
# patent_env.py
import chromadb
from verifiers import ToolEnv
class PatentEnv(ToolEnv):
def __init__(self, patents: list[dict], level: int):
self.corpus = {p["patent_id"]: p for p in patents}
self.level = level
self._init_vector_store(patents)
super().__init__(tools=self._get_tools(level))
def _init_vector_store(self, patents):
self.chroma = chromadb.Client()
collection = self.chroma.create_collection("patents")
collection.add(
ids=[p["patent_id"] for p in patents],
documents=[f"{p['title']} {p['abstract']}" for p in patents],
)
self.collection = collection
def search_patents(self, query: str) -> list[dict]:
"""Returns patent IDs and titles matching a semantic query."""
results = self.collection.query(query_texts=[query], n_results=5)
return [
{"patent_id": id_, "title": self.corpus[id_]["title"]}
for id_ in results["ids"][0]
]
def get_metadata(self, patent_id: str) -> dict:
"""Returns title, filing_date, grant_date, claim_count."""
p = self.corpus[patent_id]
return {
"title": p["title"],
"filing_date": p["filing_date"],
"grant_date": p["grant_date"],
"claim_count": p["claim_count"],
}
def get_abstract(self, patent_id: str) -> str:
return self.corpus[patent_id]["abstract"]
def view_sections(self, patent_id: str) -> list[str]:
"""Lists available section names in the full patent text."""
doc = self.corpus[patent_id]["full_text"]
return [
line.lstrip("# ").strip()
for line in doc.split("\n")
if line.startswith("## ")
]
def read_section(self, patent_id: str, section_name: str) -> str:
"""Returns the full text of a named section."""
doc = self.corpus[patent_id]["full_text"]
start = doc.find(f"## {section_name}")
if start == -1:
return f"Section '{section_name}' not found."
end = doc.find("\n## ", start + 1)
return doc[start:end if end != -1 else len(doc)]
def _get_tools(self, level: int):
tools = [self.search_patents, self.get_metadata]
if level >= 2:
tools.append(self.get_abstract)
if level >= 3:
tools += [self.view_sections, self.read_section]
return tools
Keep tools stateless. The agent can call get_metadata() on the same patent multiple times with no side effects.
Patents have standardized sections (Abstract, Claims, Description) and coded metadata. This makes them well-suited for tool-based environments. The same structure applies to any document domain with a known schema such as legal filings map to get_case_metadata(), SEC filings to read_section("Risk Factors"), and so on.
Step 6 - Parsing Strategy
Avoid regex in the pipeline because when not implemented well, it can introduce reward hacking. Use simple string methods and structured JSON instead:
# parsing.py
import json
def extract_claims_text(full_text: str) -> str:
"""Extract the Claims section using string slicing, not regex."""
start = full_text.find("## Claims")
if start == -1:
return ""
end = full_text.find("\n## ", start + 1)
return full_text[start:end if end != -1 else len(full_text)]
def parse_patent_id(full_code: str) -> str:
"""'US10123456B2' -> split and handle deterministically."""
return full_code.split("-")[0] if "-" in full_code else full_code
def parse_judge_output(raw: str) -> dict:
"""Parse judge JSON with a robust fallback."""
try:
return json.loads(raw)
except json.JSONDecodeError:
# Fallback: find final_score by character iteration
key = '"final_score":'
idx = raw.find(key)
if idx == -1:
return {"final_score": 0, "hallucination": True, "key_points": []}
score_start = idx + len(key)
digits = ""
for ch in raw[score_start:].lstrip():
if ch.isdigit():
digits += ch
elif digits:
break
return {
"final_score": int(digits) if digits else 0,
"hallucination": "hallucin" in raw.lower(),
"key_points": [],
}
Step 7 - Train on Lab
Prime Intellect’s Lab platform handles GPU orchestration and multi-tenant LoRA deployments. You push your environment to the Environments Hub, define a training config, and launch the run from the CLI.
Step 7.1 - Push environment to the Hub
# Push each level as a separate named environment
prime env push --name basic-patent-q-and-a
prime env push --name advanced-patent-q-and-a
prime env push --name patent-technical-analysis
Step 7.2 - Define a training config
# config.toml
model = "meta-llama/Llama-3.2-3B-Instruct"
max_steps = 100
batch_size = 128
rollouts_per_example = 8
[sampling]
max_tokens = 1024
[[env]]
id = "primeintellect/advanced-patent-q-and-a"
Step 7.3 - Calibrate model to environment difficulty
Before committing to a full training run, verify the base model finds the task challenging but not impossible. A model that starts too high has nothing to learn; one that starts too low can’t generate useful gradient signal. Target starting reward is around 0.15 to 0.35.
Do a short 10-step run first and read the reward curve. It plots mean reward per step across all rollouts in the batch. If starting reward is outside that range, adjust difficulty or model size before launching the full run.
# Quick calibration: 10 steps to measure starting reward
prime train run config.toml \
--env-var OPENAI_API_KEY=sk-... \
--override max_steps=10
If starting reward is above 0.7, make the questions harder or use a smaller base model. If it’s below 0.1, the questions may be too hard or the tools too opaque.
Step 7.4 - Launch the full run
prime train run config.toml --env-var OPENAI_API_KEY=sk-...
You can pass secrets at runtime with --env-var, or set them in the environment settings — both work.
Once the run is live, the Lab dashboard shows per-step metrics including mean reward, reward standard deviation, response length, and tool call count. Tool call count is worth checking early. If the agent isn’t calling tools in the first few steps, the system prompt isn’t clear enough about what tools are available and when to use them. Below are some screenshots of what the Prime Intellect dashboard looks like.
Config parameters to tune
| Parameter | What it controls | Starting point |
|---|
max_steps | Training duration | 50 for L1, 100 for L2/L3 |
batch_size | Examples per step | 128 |
rollouts_per_example | Trajectories sampled per question | 8, increase for noisy L3 rewards |
max_tokens | Max agent response length | 1024, L3 may need more |
Results
Trained Qwen3-4B-Instruct and Llama-3.2-3B-Instruct across all three levels. The rollout viewer in Lab lets you inspect individual trajectories turn by turn: each tool call, its response, and the final answer. It’s the most direct way to see what the model is actually learning to do at each level and understand in depth why the reward curve is behaving as it is.
Level 1
| Checkpoint | Reward |
|---|
| Start | ~0.05 |
| Step 15 | ~1.0 |
Saturates within 15 steps and stays there. Confirms the pipeline works end-to-end. Once the agent learns the pattern of calling get_metadata() and reading the response, it gets nearly everything right.
Level 2
| Checkpoint | Reward |
|---|
| Start | ~0.25 |
| Step 50 | ~0.70 |
Clear upward trend with variance throughout. Some batches land on straightforward date subtraction, others require chaining a search into a metadata call and then doing a comparison. The curve hasn’t plateaued at step 50, so longer runs will get more performance.
Level 3
| Checkpoint | Reward |
|---|
| Start | ~0.30 |
| Step 100 | ~0.50 |
Noisy, as expected from LLM-judged reward. Individual batches swing between 0.15 and 0.9. The upward trend is real but needs larger batches and more steps to converge. The noise has two sources: judge subjectivity, and question difficulty variation (ex: technical summary vs. cross-patent comparison are not the same task and vary in difficulty).
Extending This to Other Domains
The core architecture (schema-derived tools, progressive difficulty levels, universal rubric with content-specific ground truth) is not patent-specific. To adapt it to the following categories, you can:
- Legal case search: Replace
read_section("Claims") with read_section("Holding"). The judgment or ruling is your Level 3 answer target.
- SEC filings: 10-K documents have standardized sections (Risk Factors, MD&A, Financial Statements).
- Medical literature: PubMed abstracts for Level 1, full-text PMC articles for Level 3. Use MeSH terms for structured metadata.
- Enterprise knowledge bases: Internal docs with known schemas. Level 3 judge needs domain-appropriate ground truth generation.
The main work in each new domain is: (1) acquiring the data with full text, (2) defining the question types that capture the actual analytic tasks, and (3) writing the Level 3 ground truth generation prompt with domain-specific few-shot examples.
All environments are published on the Environments Hub. The patent dataset is on HuggingFace.