Generate embeddings for whole files using OpenAI in Python

Generate embeddings for PDF and DOCX files using OpenAI models in Python. In this notebook, you will learn how to read the files, divide their content into chunks and generate embeddings for each of them.

This notebook was created with MLJAR Studio

MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.

Documentation

All of the required packages are imported automatically:

# import packages
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
from pypdf import PdfReader
from docx import Document

Create the OpenAI client connection:

# load .env file
load_dotenv()

# get api key from environment
api_key = os.environ["OPENAI_KEY"]

# create OpenAI client
def create_client(api_key):
    try:
        client = OpenAI(api_key=api_key)
        client.models.list()
        return client
    except AuthenticationError:
        print("Incorrect API")
    return None

client = create_client(api_key)

Turning the whole file into embedding is quite senseless, so firstly you have to divide the file into chunks and then generate embedding for each chunk.

We have the recipe for PDF and DOCX files and we would like to show you both, starting with PDF:

# set file path
filePath=r"../../../../data/test.pdf"

# read file
reader = PdfReader(filePath)

# declare lists
chunks = []
embeddings = []

# text division
for i in range(0, len(reader.pages)):
    chunk = reader.pages[i].extract_text()
    chunks.append(chunk)

# create embeddings
for i in range(0, len(chunks)):
    embedding = client.embeddings.create(
        input = chunks[i],
        model = "text-embedding-3-small"
    )
    embeddings.append(embedding.data[0].embedding)

And now with DOCX:

# set file path
filePath=r"../../../../Downloads/test.docx"

# read file
doc = Document(filePath)

# declare lists
chunks = []
embeddings = []

# text division
for i in range(0, len(doc.paragraphs)):
    chunk = doc.paragraphs[i].text
    chunks.append(chunk)

# create embeddings
for i in range(0, len(chunks)):
    embedding = client.embeddings.create(
        input = chunks[i],
        model = "text-embedding-3-small"
    )
    embeddings.append(embedding.data[0].embedding)

Conclusions

Now you know how to generate embeddings for PDF and DOCX files. But how can you use them?

Check How to search the cosine similarity using OpenAI in Python notebook!

Recipes used in the embeddings-files-python.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the embeddings-files-python.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

openai>=1.35.14

python-dotenv>=1.0.1

pypdf>=4.1.0

python-docx>=1.1.2