How to search the cosine similarity using OpenAI in Python

See how to find text most similar to a user query using OpenAI modules in Python. This notebook covers creating embeddings, calculating cosine similarity and returning the most similar chunk with text.

This notebook was created with MLJAR Studio

MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.

Documentation

Packages are automatically imported:

# import packages
import os
from dotenv import load_dotenv
from openai import OpenAI, AuthenticationError
from docx import Document
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

OpenAI client connection:

# load .env file
load_dotenv()

# get api key from environment
api_key = os.environ["OPENAI_KEY"]

# create OpenAI client
def create_client(api_key):
    try:
        client = OpenAI(api_key=api_key)
        client.models.list()
        return client
    except AuthenticationError:
        print("Incorrect API")
    return None

client = create_client(api_key)

Below cell creates embeddings. Get more information in our Generate embeddings for whole files using OpenAI in Python notebook.

# set file path
filePath=r"../../../../Downloads/example.docx"

# read file
doc = Document(filePath)

# declare lists
chunks = []
embeddings = []

# text division
for i in range(0, len(doc.paragraphs)):
    chunk = doc.paragraphs[i].text
    chunks.append(chunk)

# create embeddings
for i in range(0, len(chunks)):
    embedding = client.embeddings.create(
        input = chunks[i],
        model = "text-embedding-3-small"
    )
    embeddings.append(embedding.data[0].embedding)

Now, you can find the closest similarity between the embeddings you created earlier and the embedding query:

# define user query
user_query = "Why Mercury is called Mercury?"

# generate embedding
response = client.embeddings.create(
    input = user_query,
    model = "text-embedding-3-small"
)
query_embedding = response.data[0].embedding

# find most similar id
best_match_id = cosine_similarity(np.array(embeddings), np.array(query_embedding).reshape(1,-1)).argmax()

# print most similar text
chunks[best_match_id]

Conclusions

Cosine Similarity is useful in the RAG process, which is an interesting topic.

Maybe we should create a notebook with RAG? ๐Ÿค”

Recipes used in the python-cosine-similarity.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the python-cosine-similarity.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

openai>=1.35.14

python-dotenv>=1.0.1

pypdf>=4.1.0

python-docx>=1.1.2

numpy>=1.26.4

scikit-learn>=1.5.1