Generate embeddings for whole files using OpenAI in Python
Generate embeddings for PDF and DOCX files using OpenAI models in Python. In this notebook, you will learn how to read the files, divide their content into chunks and generate embeddings for each of them.
MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.
All of the required packages are imported automatically:
# import packages
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
from pypdf import PdfReader
from docx import Document
Create the OpenAI client connection:
# load .env file
load_dotenv()
# get api key from environment
api_key = os.environ["OPENAI_KEY"]
# create OpenAI client
def create_client(api_key):
try:
client = OpenAI(api_key=api_key)
client.models.list()
return client
except AuthenticationError:
print("Incorrect API")
return None
client = create_client(api_key)
Turning the whole file into embedding is quite senseless, so firstly you have to divide the file into chunks and then generate embedding for each chunk.
We have the recipe for PDF and DOCX files and we would like to show you both, starting with PDF:
# set file path
filePath=r"../../../../data/test.pdf"
# read file
reader = PdfReader(filePath)
# declare lists
chunks = []
embeddings = []
# text division
for i in range(0, len(reader.pages)):
chunk = reader.pages[i].extract_text()
chunks.append(chunk)
# create embeddings
for i in range(0, len(chunks)):
embedding = client.embeddings.create(
input = chunks[i],
model = "text-embedding-3-small"
)
embeddings.append(embedding.data[0].embedding)
And now with DOCX:
# set file path
filePath=r"../../../../Downloads/test.docx"
# read file
doc = Document(filePath)
# declare lists
chunks = []
embeddings = []
# text division
for i in range(0, len(doc.paragraphs)):
chunk = doc.paragraphs[i].text
chunks.append(chunk)
# create embeddings
for i in range(0, len(chunks)):
embedding = client.embeddings.create(
input = chunks[i],
model = "text-embedding-3-small"
)
embeddings.append(embedding.data[0].embedding)
Conclusions
Now you know how to generate embeddings for PDF and DOCX files. But how can you use them?
Check How to search the cosine similarity using OpenAI in Python notebook!
Recipes used in the embeddings-files-python.ipynb
All code recipes used in this notebook are listed below. You can click them to check their documentation.
Packages used in the embeddings-files-python.ipynb
List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.
openai>=1.35.14
python-dotenv>=1.0.1
pypdf>=4.1.0
python-docx>=1.1.2
Similar notebooks
List of similar Python notebooks, so you can find more inspiration ๐
Connection with OpenAI Client using Python
OpenAI Chat Completion in Python Notebook
How to search the cosine similarity using ...
How to generate text embeddings using OpenAI ...
Generate embeddings for whole files using ...
Generate images using OpenAI models in Python
Transcribe the speech using OpenAI in Python
Translate the speech using OpenAI in Python
Generate speech from the given text using ...
OpenAI vision with local images in Python
OpenAI vision with URL images in Python
Build RAG App using OpenAI in Python