Generate embeddings for whole files using OpenAI in Python

Generate embeddings for PDF and DOCX files using OpenAI models in Python. In this notebook, you will learn how to read the files, divide their content into chunks and generate embeddings for each of them.

All of the required packages are imported automatically:

# import packages
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
from pypdf import PdfReader
from docx import Document

Create the OpenAI client connection:

# load .env file

# get api key from environment
api_key = os.environ["OPENAI_KEY"]

# create OpenAI client
def create_client(api_key):
        client = OpenAI(api_key=api_key)
        return client
    except AuthenticationError:
        print("Incorrect API")
    return None

client = create_client(api_key)

Turning the whole file into embedding is quite senseless, so firstly you have to divide the file into chunks and then generate embedding for each chunk.

We have the recipe for PDF and DOCX files and we would like to show you both, starting with PDF:

# set file path

# read file
reader = PdfReader(filePath)

# declare lists
chunks = []
embeddings = []

# text division
for i in range(0, len(reader.pages)):
    chunk = reader.pages[i].extract_text()

# create embeddings
for i in range(0, len(chunks)):
    embedding = client.embeddings.create(
        input = chunks[i],
        model = "text-embedding-3-small"

And now with DOCX:

# set file path

# read file
doc = Document(filePath)

# declare lists
chunks = []
embeddings = []

# text division
for i in range(0, len(doc.paragraphs)):
    chunk = doc.paragraphs[i].text

# create embeddings
for i in range(0, len(chunks)):
    embedding = client.embeddings.create(
        input = chunks[i],
        model = "text-embedding-3-small"


Now you know how to generate embeddings for PDF and DOCX files. But how can you use them?

