Here is an experiment of a semantic full-text search in the book "Die Verwandlung" by Franz Kafka. The book was first indexed as a vector space, also called "embeddings." To create the embeddings, the SentenceTransformers library was used, specifically the all-MiniLM-L6-v2 model. The embeddings are stored in a pickle file.
The search is conducted in two steps. In the first step, the search query is also transformed into this vector space, and 20 similar sentences are retrieved. The similarity is calculated using cosine similarity. These sentences are then passed to ChatGPT in the next step as the context for a question (refer to the resulting prompt).
Examples:
Download book from url tokenize the content, make embeddings and save them to a pickle file.
def create_pkl(url, filename):
html = requests.get(url).content
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
text = text.replace("\n", " ")
text = text.replace("\r", " ")
corpus_sentences = sent_tokenize(text)
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model.encode(corpus_sentences)
with open("./data/" + filename + ".pkl", "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
Get query, make embeddings for query and fetch similar sentences from corpus.
query = request.form['q']
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query, convert_to_tensor=True)
with open("./data/kafka.pkl", "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']
top_k = min(20, len(corpus_sentences))
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
Format a prompt for ChatGPT with user query and relevant sentences as context and send it to OpenAI API.
prompt = "Input: ### " + query + " ###\n\n"
prompt += "Given context: \n"
for score, idx in zip(top_results[0], top_results[1]):
prompt += "* " + corpus_sentences[idx].replace("\n", " ") + "\n"
try:
res = openai.Completion.create(
engine='text-davinci-003',
prompt=prompt,
temperature=0,
max_tokens=40,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
except Exception as e:
return "Error: " + str(e)
print(res['choices'][0]['text'])
This is the prompt for the OpenAI API after the context has been added. The context is the result of the first step of the search on the server. For all sentences in the book, the similarity to the search query is calculated. The sentences with the highest similarity are then passed as context to the OpenAI API. ChatGPT makes sense of the question and formulates an answer based on the given context.