Prime Intellect Inference provides OpenAI-compatible API access to state-of-the-art language models. Our inference service routes requests to various model providers, offering flexible model selection made for running large scale evaluations.
Inference API is currently in closed beta. Features and model availability may change as we continue to improve the service.

Getting Started

1. Get Your API Key

First, obtain your API key from the Prime Intellect Platform:
  1. Navigate to your account settings
  2. Go to the API Keys section
  3. Generate a new API key for inference access

2. Set Up Authentication

Set your API key as an environment variable:
export PRIME_API_KEY="your-api-key-here"

3. Access through the CLI or API

You can use Prime Inference in two ways: The Prime CLI provides easy access to inference models, especially useful for running evaluations:
# List available models
prime inference models

# Use with environment evaluations (most common use case)
prime env eval gsm8k -m meta-llama/llama-3.1-70b-instruct -n 25
For evaluations: See Environment Evaluations guide for comprehensive examples and best practices regarding evaluations.

Direct API Access (OpenAI-Compatible)

import openai

# Configure the client to use Prime Inference
client = openai.OpenAI(
    api_key=os.environ.get("PRIME_API_KEY"),
    base_url="https://api.pinference.ai/api/v1"
)

# Make a chat completion request
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "user", "content": "What is Prime Intellect?"}
    ]
)

print(response.choices[0].message.content)

Available Models

Prime Inference provides access to various state-of-the-art language models. You can list all available models using the models endpoint:

Get All Available Models

# List all available models
prime inference models

Using the Inference Endpoint

Basic Chat Completion

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500,
    temperature=0.7
)

Streaming Responses

For real-time applications, use streaming to receive responses as they’re generated:
stream = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "user", "content": "Write a short story about a robot."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

Prime Inference supports all standard OpenAI API parameters:
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Your prompt here"}],

    # Generation parameters
    max_tokens=1000,
    temperature=0.8,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.1,

    # Advanced options
    stream=False,
    stop=["END", "\n\n"],
    logprobs=True,
    top_logprobs=3
)

Pricing and Billing

Prime Inference uses token-based pricing with competitive rates:
  • Input tokens: Charged for tokens in your prompt
  • Output tokens: Charged for tokens in the model’s response
  • Billing: Automatic deduction from your Prime Intellect account balance
Pricing varies by model. We will provide more details on pricing soon and make it available through the models API.

Support and Feedback

Since Inference API is in closed beta, we welcome your feedback:

API Reference

The Prime Intellect Inference API provides OpenAI-compatible endpoints:

Available Endpoints

Returns a list of all available models that you can use for inference requests.Response:
{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/llama-3.1-70b-instruct",
      "object": "model",
      "owned_by": "meta",
      "created": 1693721698
    }
  ]
}
Retrieves detailed information about a specific model.Parameters:
  • model_id (path): The ID of the model to retrieve
Response:
{
  "id": "meta-llama/llama-3.1-70b-instruct",
  "object": "model",
  "owned_by": "meta",
  "created": 1693721698
}
Creates a model response for the given chat conversation.Request Body:
{
  "model": "meta-llama/llama-3.1-70b-instruct",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "stream": false
}
Response:
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1693721698,
  "model": "meta-llama/llama-3.1-70b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 9,
    "total_tokens": 17
  }
}
Interactive API Documentation: Full interactive API documentation with request/response examples will be available soon. The current endpoints are fully compatible with OpenAI’s API format.

Next Steps