Skip to main content

Basic Chat Completion

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=500,
    temperature=0.7
)

Streaming Responses

For real-time applications, use streaming to receive responses as they’re generated:
stream = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "user", "content": "Write a short story about a robot."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

Prime Inference supports all standard OpenAI API parameters:
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Your prompt here"}],

    # Generation parameters
    max_tokens=1000,
    temperature=0.8,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.1,

    # Advanced options
    stream=False,
    stop=["END", "\n\n"],
    logprobs=True,
    top_logprobs=3
)

Next Steps

I