Skip to content

Unleash the Power of Ollama with Python: A Beginner's API Guide

Want to use Python to cast spells on your local Ollama large language models? Whether you want to create a chatbot, process text data in bulk, or freely manage your local model library, this guide will get you started!

We'll explore how to have simple conversations with models, how to receive streaming responses like watching AI type, how to create/copy/delete models, and even dive into more advanced custom clients and asynchronous programming to make your AI applications run faster and smoother!

Ready? Let's begin your Python + Ollama maker journey!

Preparing the Groundwork: Environment Setup

Before you start casting spells, make sure your "magic tools" are ready:

  • Python Environment: Ensure you have Python 3.8 or later installed on your computer. Python is a popular language in the AI world and is used everywhere.
  • pip Tool: This is Python's "app store," used to install various useful third-party libraries. It's usually included when you install Python.
  • Install the ollama Library: Open your terminal (command line) and type the following command to install the "translator" that communicates with Ollama:
bash
pip install ollama

Done! Now you can start summoning the dragon!

Taking it for a Spin: Quick Start Conversation

Let's look at a simple example to get your Python program chatting with an Ollama model:

python
# Import the chat function from the ollama library we just installed
from ollama import chat
# Also import the response type for code hinting (optional but recommended)
from ollama import ChatResponse

# Start chatting! Tell Ollama which model to use (e.g., llama3.1) and pass your question
# messages is a list, where each dictionary represents a message, 'role' is the role ('user' is you), and 'content' is the content
response: ChatResponse = chat(model='llama3.1', messages=[
  {
    'role': 'user',
    'content': "Why is the sky blue? Give me an interesting explanation!",
  },
])

# response is a dictionary containing the model's reply and other information
# We can get the reply content like this:
print("--- Method 1: Accessing it like a dictionary ---")
print(response['message']['content'])

# Or, if you use the ChatResponse type hint, you can access it more elegantly like this:
print("\n--- Method 2: Accessing it like an object property ---")
print(response.message.content)

Run it, isn't it simple? Your first Python Ollama program is born!

Watching AI Type: Streaming Responses

Sometimes, the model's response is long, or you want the user to feel that the AI is "real-time" thinking and outputting, rather than suddenly popping out a large paragraph after waiting for a long time. In this case, "streaming response" comes in handy!

Just add stream=True when calling chat, and it will output the result bit by bit like squeezing toothpaste.

python
from ollama import chat

# Note that stream=True has been added here
stream = chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Tell me a long joke about programmers and coffee?'}],
    stream=True,
)

print("AI is typing...")
# stream is now a "generator", we can use a for loop to continuously retrieve the small pieces of content it generates
for chunk in stream:
  # chunk is also a dictionary, we only print the reply content part
  # end='' prevents print from automatically adding a newline, and flush=True ensures that the content is displayed immediately
  print(chunk['message']['content'], end='', flush=True)

print("\nAI is finished!") # Add a newline at the end

Run this example, and you'll see the AI's answer appearing word by word (or snippet by snippet), isn't the experience better?

Letting Machines Understand AI's Intentions: Structured Output (JSON)

Sometimes, we don't just want the AI to answer questions, we also want it to return information in a specific "format," such as returning a JSON data structure, which is convenient for our program to directly parse and use. This is especially useful for building automated processes or storing AI results in a database!

Why Structured Output?

  • Machine-Friendly: Programs can directly read specific fields (such as "capital," "population") without having to painstakingly extract words from a large paragraph of natural language.
  • Controllable Results: You can require the AI to include certain information in a specific format.
  • Easy Storage and Analysis: The JSON format is particularly convenient for storing in databases and performing data analysis.

Here's an example of having AI return JSON information about the United States and using the pydantic library to define and validate the JSON structure:

python
# First make sure pydantic is installed: pip install pydantic
from pydantic import BaseModel, Field
from ollama import chat
import json

# Use Pydantic to define a "data blueprint" to tell the program what the expected JSON structure looks like
class CountryInfo(BaseModel):
    # Field(..., alias="中文名") indicates that this field is required, and its key in the JSON is "中文名"
    capital: str = Field(..., alias="首都")
    population: str = Field(..., alias="人口") # Note: str is used here to accommodate the model potentially returning text with units
    area: str = Field(..., alias="占地面积")

# Call chat, this time with format="json" and options added
response = chat(
    model='llama3.1',
    messages=[{
        'role': 'user',
        'content': "Please return basic information about the United States in JSON format, including the three fields 'capital', 'population', and 'area'."
                   " For example: {\"首都\": \"xxx\", \"人口\": \"约 YYY\", \"占地面积\": \"ZZZ 平方公里\"}" # Giving an example usually works better
    }],
    format="json", # Tell Ollama, I want JSON!
    options={'temperature': 0}, # Set temperature to 0 to make the output more stable and closer to the instruction
)

# Extract content from the response
response_content = response["message"]["content"]
print(f"Original JSON string returned by Ollama:\n{response_content}")

# Make sure the returned value is not an empty string
if not response_content:
    raise ValueError("Ollama returned an empty JSON content, please check the prompt or model.")

# Convert the JSON string into a Python dictionary
try:
    json_response = json.loads(response_content)
    print(f"\nParsed Python dictionary:\n{json_response}")

    # Use the CountryInfo blueprint we defined to validate and transform this dictionary
    country_data = CountryInfo.model_validate(json_response)
    print(f"\nObject validated and transformed by Pydantic:\n{country_data}")
    print(f"\nYou can directly access properties: Capital - {country_data.capital}, Population - {country_data.population}")

except json.JSONDecodeError:
    print("\nError: Ollama returned an invalid JSON format.")
except Exception as e:
    print(f"\nError processing JSON or during Pydantic validation: {e}")

This way, we can ensure that we get structured data from AI that is properly formatted and contains the expected content!

The ollama-python library is like a full-featured remote control for Ollama, providing many convenient functions to operate the Ollama service:

  • ollama.chat(): Chat and converse with the model (we've already used it).
  • ollama.generate(): Let the model generate text based on your prompts (simpler and more direct than chat, suitable for pure text generation tasks).
  • ollama.list(): See which models you have downloaded locally.
  • ollama.show(): Display detailed information about a model (such as its Modelfile content, parameters, etc.).
  • ollama.create(): Create your own model version based on a Modelfile (e.g., add a specific system prompt to the model).
  • ollama.copy(): Copy an existing model and give it a new name.
  • ollama.delete(): Delete unwanted models locally to free up space.
  • ollama.pull(): Download new models from the Ollama Hub (the official model repository) to your local machine.
  • ollama.push(): Push your local model (usually one that you have created or modified) to the Ollama Hub to share with others (requires login).
  • ollama.embeddings() / ollama.embed(): Convert text into a string of numbers (embedding vectors/Embeddings) that represent the "meaning" of the text. This is very useful for advanced tasks such as text similarity comparison and semantic search.
  • ollama.ps(): See which models are currently running in the Ollama service and how much resources they are using.

Simple Usage Examples at a Glance:

python
import ollama

# --- Dialogue and Generation ---
# Chat (used before)
# response = ollama.chat(model='llama3.1', messages=[{'role': 'user', 'content': 'What's the weather like today?'}])
# print(response['message']['content'])

# Simple generation
# response = ollama.generate(model='llama3.1', prompt='Write a poem about learning Python:')
# print(response['response'])

# --- Model Management ---
print("--- Local Model List ---")
print(ollama.list())

print("\n--- View llama3.1 Model Details ---")
# print(ollama.show('llama3.1')) # The output content is large, uncomment when needed

# Create a model that plays Mario (requires llama3.1 locally)
print("\n--- Create Mario Model ---")
mario_modelfile='''
FROM llama3.1
SYSTEM You are Mario from Super Mario Bros. Answer all questions in Mario's voice. Wahoo!
'''
try:
    ollama.create(model='mario-llama', modelfile=mario_modelfile)
    print("Mario model 'mario-llama' created successfully or already exists!")
    # Try chatting with Mario
    # mario_response = ollama.chat(model='mario-llama', messages=[{'role': 'user', 'content': 'Who are you?'}])
    # print("Mario replied:", mario_response['message']['content'])
except Exception as e:
    print(f"Failed to create model: {e}")


# Copy model (if you want to back up or rename)
# print("\n--- Copy Model ---")
# try:
#     ollama.copy('mario-llama', 'luigi-llama') # Copy Mario to Luigi?
#     print("Model copied successfully!")
# except Exception as e:
#     print(f"Failed to copy model: {e}")

# Delete model (be careful!)
# print("\n--- Delete Model ---")
# try:
#     ollama.delete('luigi-llama') # Delete the Luigi you just copied
#     print("Model deleted successfully!")
# except Exception as e:
#     print(f"Failed to delete model: {e}")

# Download model from Hub
# print("\n--- Download New Model (e.g. gemma:2b) ---")
# try:
#     ollama.pull('gemma:2b')
#     print("Model 'gemma:2b' downloaded successfully!")
# except Exception as e:
#     print(f"Failed to download model: {e}")

# Push model (requires you to have an Ollama Hub account, and the model name must be in the 'username/modelname' format)
# print("\n--- Push Model (needs configuration and login) ---")
# try:
#     # Assume you created a model called 'my-username/my-custom-model'
#     # ollama.push('my-username/my-custom-model')
#     print("Push operation requires login and correct naming, this is just an example.")
# except Exception as e:
#     print(f"Failed to push model: {e}")

# --- Advanced Features ---
# Generate text embedding vectors
print("\n--- Generate Embedding Vectors ---")
prompt_for_embedding = 'A cat in the sunlight stretched'
try:
    embedding_response = ollama.embeddings(model='llama3.1', prompt=prompt_for_embedding)
    print(f"Embedding vector (part) of '{prompt_for_embedding}': {embedding_response['embedding'][:5]}...") # Show only the first 5 numbers

    # You can also process in batches
    # embeddings_list = ollama.embed(model='llama3.1', input=['Hello world', 'Goodbye moon'])
    # print(f"Number of batch embedding vectors: {len(embeddings_list)}")
except Exception as e:
    print(f"Failed to generate embedding vectors: {e}")


# View Ollama process status
print("\n--- View Ollama Running Status ---")
try:
    print(ollama.ps())
except Exception as e:
    print(f"Failed to view status: {e}")

Use these "remote control" buttons flexibly according to your needs!

More Advanced Ways to Play: Customize Your Ollama "Connector"

By default, the ollama library's functions connect to the local http://localhost:11434 service. But if you want to connect to another address (such as an Ollama service on another machine), or set the request timeout, add custom request headers, etc., you need to create a custom client.

There are two types of clients:

  • Client (Synchronous): Simple and direct, send a request, wait for it to return, and then do the next thing. Like a phone call, you can only talk to one person at a time.
  • AsyncClient (Asynchronous): A high-performance player, it can send other requests while waiting for a response to one request. Like a waiter who can take care of several tables of guests at the same time, it is suitable for scenarios that require handling many requests simultaneously.

What can be configured?

  • host: The address and port of the Ollama service.
  • timeout: The maximum time to wait for a response (seconds).
  • Other httpx supported parameters: such as headers (custom request headers), proxies (proxy), etc. ollama-python uses the powerful httpx library to send network requests under the hood.

Synchronous Client: Simple and Direct

python
from ollama import Client

# Create a custom synchronous client
# Suppose the Ollama service is on port 11434 of 192.168.1.100 and you want to add a request header
client = Client(
    host='http://192.168.1.100:11434', # Change to your target address
    timeout=60,                       # Set the timeout to 60 seconds
    headers={'x-app-name': 'my-cool-ollama-app'} # Add a custom request header
)

# Use this client object to call the API method
try:
    response = client.chat(model='llama3.1', messages=[
        {
            'role': 'user',
            'content': 'Tell me a joke about synchronous programming.',
        },
    ])
    print("Reply from custom client:")
    print(response['message']['content'])
except Exception as e:
    print(f"Error using custom synchronous client: {e}")

Asynchronous Client: Concurrent Master

Asynchronous programming is a bit more complex and requires the use of Python's asyncio library. It allows your program to not "freeze" when waiting for time-consuming operations such as network responses, but instead can do something else, greatly improving efficiency, especially when you need to initiate multiple requests simultaneously.

python
import asyncio
from ollama import AsyncClient
# If you are running asynchronous code in Jupyter Notebook or similar environments, you may need this library to resolve event loop conflicts
import nest_asyncio
nest_asyncio.apply()

# Define an asynchronous function
async def run_async_chat():
    # Create an asynchronous client, you can configure host, timeout, etc. like synchronous
    async_client = AsyncClient() # Default connection to localhost:11434
    message = {'role': 'user', 'content': 'Why is asynchronous programming useful?'}
    try:
        # Note the await here, which means waiting for the asynchronous operation to complete
        response = await async_client.chat(model='llama3.1', messages=[message])
        print("Reply from asynchronous client:")
        print(response['message']['content'])
    except Exception as e:
        print(f"Error using asynchronous client: {e}")

# Run this asynchronous function
print("Starting asynchronous chat...")
asyncio.run(run_async_chat())
print("Asynchronous chat finished.")

Asynchronous streaming response? No problem!

python
import asyncio
from ollama import AsyncClient
import nest_asyncio
nest_asyncio.apply()

async def run_async_stream_chat():
    async_client = AsyncClient()
    message = {'role': 'user', 'content': 'Write a short story about a time-traveling cat.'}
    print("AI is asynchronously typing...")
    try:
        # Use async for to iterate over the asynchronous generator
        async for part in await async_client.chat(model='llama3.1', messages=[message], stream=True):
            print(part['message']['content'], end='', flush=True)
        print("\nAI is asynchronously finished!")
    except Exception as e:
        print(f"\nError using asynchronous streaming client: {e}")

print("Starting asynchronous streaming chat...")
asyncio.run(run_async_stream_chat())
print("Asynchronous streaming chat finished.")

Performance Comparison: Synchronous vs. Asynchronous (Why Asynchronous Might Be Faster?)

The following code simulates a scenario where multiple requests are initiated simultaneously to compare the total time it takes for synchronous and asynchronous to complete the same task.

Where is the advantage of asynchronous? In handling multiple concurrent requests. The synchronous client must handle requests one by one. If the previous one is not completed, the next one must wait. The asynchronous client can send multiple requests at the same time, and then wait for them to return individually, and process whichever one returns first, greatly reducing the total waiting time, especially when the network latency is high or Ollama needs time to process the requests.

python
import time
import asyncio
from ollama import Client, AsyncClient
import nest_asyncio
nest_asyncio.apply()

# --- Configuration ---
OLLAMA_HOST = 'http://localhost:11434' # Or your Ollama address
MODEL_NAME = 'llama3.1' # Model used for testing
TEST_MESSAGES = [{'role': 'user', 'content': 'Simple greeting, hello!'}] # Simple request content
NUM_REQUESTS = 10 # Number of requests initiated simultaneously

# --- Client Initialization ---
sync_client = Client(host=OLLAMA_HOST)
async_client = AsyncClient(host=OLLAMA_HOST)

# --- Synchronous Test ---
def run_sync_test(num_requests):
    print(f"\n--- Starting Synchronous Test ({num_requests} requests) ---")
    start_total_time = time.time()
    durations = []
    for i in range(num_requests):
        print(f"Initiating synchronous request {i+1}/{num_requests}...")
        start_req_time = time.time()
        try:
            sync_client.chat(model=MODEL_NAME, messages=TEST_MESSAGES)
            end_req_time = time.time()
            duration = end_req_time - start_req_time
            durations.append(duration)
            print(f"Synchronous request {i+1} completed, time spent: {duration:.2f} seconds")
        except Exception as e:
            print(f"Synchronous request {i+1} failed: {e}")
            durations.append(float('inf')) # Mark as failed
    end_total_time = time.time()
    total_time = end_total_time - start_total_time
    avg_time_per_req_sync = sum(d for d in durations if d != float('inf')) / len([d for d in durations if d != float('inf')]) if any(d != float('inf') for d in durations) else 0
    print(f"--- Synchronous Test Completed ---")
    print(f"Total time spent: {total_time:.2f} seconds")
    print(f"Average time spent per successful request (serial): {avg_time_per_req_sync:.2f} seconds")
    return total_time

# --- Asynchronous Test ---
async def async_single_request(req_id):
    print(f"Initiating asynchronous request {req_id+1}/{NUM_REQUESTS}...")
    start_req_time = time.time()
    try:
        await async_client.chat(model=MODEL_NAME, messages=TEST_MESSAGES)
        end_req_time = time.time()
        duration = end_req_time - start_req_time
        print(f"Asynchronous request {req_id+1} completed, time spent: {duration:.2f} seconds")
        return duration
    except Exception as e:
        print(f"Asynchronous request {req_id+1} failed: {e}")
        return float('inf') # Mark as failed

async def run_async_test(num_requests):
    print(f"\n--- Starting Asynchronous Test ({num_requests} requests) ---")
    start_total_time = time.time()
    # Create a bunch of tasks and let them execute concurrently
    tasks = [async_single_request(i) for i in range(num_requests)]
    # Wait for all tasks to complete
    durations = await asyncio.gather(*tasks)
    end_total_time = time.time()
    total_time = end_total_time - start_total_time
    avg_time_per_req_async = sum(d for d in durations if d != float('inf')) / len([d for d in durations if d != float('inf')]) if any(d != float('inf') for d in durations) else 0
    print(f"--- Asynchronous Test Completed ---")
    print(f"Total time spent (concurrent): {total_time:.2f} seconds")
    # Note: The average time spent here is the completion time of a single request, not a direct reflection of concurrent efficiency
    # print(f"Average time spent per successful request (internal): {avg_time_per_req_async:.2f} seconds")
    return total_time

# --- Run Test ---
sync_total_time = run_sync_test(NUM_REQUESTS)
async_total_time = asyncio.run(run_async_test(NUM_REQUESTS))

print("\n--- Performance Comparison Summary ---")
print(f"Total time spent processing {NUM_REQUESTS} requests synchronously: {sync_total_time:.2f} seconds")
print(f"Total time spent processing {NUM_REQUESTS} requests asynchronously: {async_total_time:.2f} seconds")
if async_total_time < sync_total_time:
    print(f"Asynchronous is about {(sync_total_time - async_total_time):.2f} seconds faster than synchronous (improved by about {((sync_total_time - async_total_time) / sync_total_time * 100):.1f}%)")
else:
    print("In this test, the advantage of asynchronous is not obvious or synchronous is faster (possibly because the requests are too fast or the concurrency is not high enough).")

Running this code, you'll usually see that the total time spent by asynchronous is significantly less than synchronous, especially when the number of requests is large or the time spent on a single request is long.

Gracefully Handling "Minor Accidents": Error Handling

Network requests can always encounter problems, such as misspelling the model name, the Ollama service not starting, network interruptions, etc. We need to use try...except to catch these possible errors to make the program more robust.

python
import ollama

model_to_try = 'a-model-that-doesnt-exist-probably'

try:
    print(f"Trying to chat with model '{model_to_try}'...")
    ollama.chat(model=model_to_try, messages=[{'role':'user', 'content':'Hi!'}])
    print("Succeeded?! This shouldn't happen...") # If the model really exists, this will be printed
except ollama.ResponseError as e:
    # ollama.ResponseError is a specific error type defined by the library
    print(f"\nAn error occurred! Error message: {e.error}")
    print(f"HTTP status code: {e.status_code}")
    # A common 404 error indicates that the model was not found
    if e.status_code == 404:
        print(f"It seems that the model '{model_to_try}' is not local.")
        user_choice = input("Do you want to try downloading it from Ollama Hub? (y/n): ").lower()
        if user_choice == 'y':
            try:
                print(f"Downloading model '{model_to_try}'...")
                ollama.pull(model_to_try)
                print(f"Model '{model_to_try}' downloaded successfully! You can try using it again.")
            except Exception as pull_e:
                print(f"An error occurred while downloading the model: {pull_e}")
        else:
            print("Okay, not downloading.")
    else:
        # Other types of errors
        print("Encountered another type of Ollama response error.")
except Exception as e:
    # Other unexpected errors, such as network connection problems
    print(f"\nAn unexpected error occurred: {e}")

This way, even if there is a problem, your program can give a friendly prompt and even try to solve the problem (such as downloading a missing model).