How to search the cosine similarity using OpenAI in Python
See how to find text most similar to a user query using OpenAI modules in Python. This notebook covers creating embeddings, calculating cosine similarity and returning the most similar chunk with text.
MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.
Packages are automatically imported:
# import packages
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
from docx import Document
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
OpenAI client connection:
# load .env file
load_dotenv()
# get api key from environment
api_key = os.environ["OPENAI_KEY"]
# create OpenAI client
def create_client(api_key):
try:
client = OpenAI(api_key=api_key)
client.models.list()
return client
except AuthenticationError:
print("Incorrect API")
return None
client = create_client(api_key)
Below cell creates embeddings. Get more information in our Generate embeddings for whole files using OpenAI in Python notebook.
# set file path
filePath=r"../../../../Downloads/example.docx"
# read file
doc = Document(filePath)
# declare lists
chunks = []
embeddings = []
# text division
for i in range(0, len(doc.paragraphs)):
chunk = doc.paragraphs[i].text
chunks.append(chunk)
# create embeddings
for i in range(0, len(chunks)):
embedding = client.embeddings.create(
input = chunks[i],
model = "text-embedding-3-small"
)
embeddings.append(embedding.data[0].embedding)
Now, you can find the closest similarity between the embeddings you created earlier and the embedding query:
# define user query
user_query = "Why Mercury is called Mercury?"
# generate embedding
response = client.embeddings.create(
input = user_query,
model = "text-embedding-3-small"
)
query_embedding = response.data[0].embedding
# find most similar id
best_match_id = cosine_similarity(np.array(embeddings), np.array(query_embedding).reshape(1,-1)).argmax()
# print most similar text
chunks[best_match_id]
Conclusions
Cosine Similarity is useful in the RAG process, which is an interesting topic.
Maybe we should create a notebook with RAG? ๐ค
Recipes used in the python-cosine-similarity.ipynb
All code recipes used in this notebook are listed below. You can click them to check their documentation.
Packages used in the python-cosine-similarity.ipynb
List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.
openai>=1.35.14
python-dotenv>=1.0.1
pypdf>=4.1.0
python-docx>=1.1.2
numpy>=1.26.4
scikit-learn>=1.5.1
Similar notebooks
List of similar Python notebooks, so you can find more inspiration ๐
Connection with OpenAI Client using Python
OpenAI Chat Completion in Python Notebook
How to search the cosine similarity using ...
How to generate text embeddings using OpenAI ...
Generate embeddings for whole files using ...
Generate images using OpenAI models in Python
Transcribe the speech using OpenAI in Python
Translate the speech using OpenAI in Python
Generate speech from the given text using ...
OpenAI vision with local images in Python
OpenAI vision with URL images in Python
Build RAG App using OpenAI in Python