How to Make a Semantic Search
with ChatGPT and SentenceTransformers

Here is an experiment of a semantic full-text search in the book "Die Verwandlung" by Franz Kafka. The book was first indexed as a vector space, also called "embeddings." To create the embeddings, the SentenceTransformers library was used, specifically the all-MiniLM-L6-v2 model. The embeddings are stored in a pickle file.

The search is conducted in two steps. In the first step, the search query is also transformed into this vector space, and 20 similar sentences are retrieved. The similarity is calculated using cosine similarity. These sentences are then passed to ChatGPT in the next step as the context for a question (refer to the resulting prompt).

Examples:

What is the name of the book?
What is ticking on the chest of drawers?
What is the job of Samsa?
What happend to Gregor Samsa one morning?

Source code snippets ...

Download book from url tokenize the content, make embeddings and save them to a pickle file.

def create_pkl(url, filename):

    html = requests.get(url).content

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.get_text()
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")

    corpus_sentences = sent_tokenize(text)

    model = SentenceTransformer('all-MiniLM-L6-v2')
    corpus_embeddings = model.encode(corpus_sentences)

    with open("./data/" + filename + ".pkl", "wb") as fOut:
        pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)

Get query, make embeddings for query and fetch similar sentences from corpus.


query = request.form['q']

model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query, convert_to_tensor=True)

with open("./data/kafka.pkl", "rb") as fIn:
    cache_data = pickle.load(fIn)
    corpus_sentences = cache_data['sentences']
    corpus_embeddings = cache_data['embeddings']

top_k = min(20, len(corpus_sentences))

cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)

Format a prompt for ChatGPT with user query and relevant sentences as context and send it to OpenAI API.


prompt = "Input: ### " + query + " ###\n\n"
prompt += "Given context: \n"

for score, idx in zip(top_results[0], top_results[1]):
    prompt += "* " + corpus_sentences[idx].replace("\n", " ") + "\n"

try:
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=40,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
except Exception as e:
    return "Error: " + str(e)

print(res['choices'][0]['text'])