Introduction to Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) enhances AI by retrieving relevant external data (stored as vector embeddings) and using it as context for generating responses. This allows AI to answer questions beyond its pre-trained knowledge, making it ideal for creating specialized chatbots, like a fashion AI assistant.
Why Use RAG?
- Addressing Knowledge Gaps: RAG supplements AI with up-to-date, specialized information beyond its training cutoff.
- Incorporating Private Data: RAG allows AI to utilize secure, domain-specific data that isn't part of its pre-existing knowledge.
What are Embeddings?
Embeddings are vector representations of data that capture semantic meaning, enabling AI to understand and retrieve similar information. They are generated by processing text data through an AI model, which converts the data into high-dimensional vectors. Different AI providers may offer embeddings with varying dimensions, affecting the level of detail in the representation.
Exploring Vector Stores
A vector store is a storage solution that supports vector operations and indexing for fast retrieval of relevant data, such as MongoDB Atlas, Weaviate, Pinecone, or Postgres with PgVector. Storing embeddings in a vector store allows for quick retrieval and use as context in AI models.
Building a Fashion AI Assistant with RAG
In this tutorial, we'll show you how to set up a fashion AI assistant using RAG. You'll learn how to combine Python, LangChain, OpenAI embeddings, and MongoDB Atlas to enhance AI with domain-specific knowledge.
As prerequisites, we need to set up these credentials as environment variables:
export OPENAI_API_KEY="..."
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="..."
export MONGODB_CONN_STRING="..."
There are two main components to setting up a Fashion AI Assistant with RAG: Indexing and Retrieval and Generation.
1. Indexing
We use a fashion dataset from HuggingFace, transform it into vector embeddings, and store these embeddings in MongoDB Atlas with a vector index for efficient retrieval.
## ingest.py
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from datasets import load_dataset
import pandas as pd
import tiktoken
import os
import params
mongodb_conn_string = os.getenv("MONGODB_CONN_STRING")
db_name = "fashion_shop_faq"
collection_name = "faq_assistant"
ai_model = "text-embedding-3-small"
vector_dimension = 512
# Connect to MongoDB Atlas
client = MongoClient(mongodb_conn_string)
db = client[db_name]
collection = db[collection_name]
# Load dataset
dataset = load_dataset("Quangnguyen711/Fashion_Shop_Consultant", split="train")
# Convert dataset to Panda DataFrame
df = pd.DataFrame(dataset)
# Only keep records where the Question and Answer fields are not null
df = df[df["Question"].notna() & df["Answer"].notna()]
# Combine Question and Answer fields into a single text field
# axis=1: This means the function is applied row-wise
df["text"] = df.apply(lambda row: f"[Question]{row['Question']}[Answer]{row['Answer']}", axis=1)
# Convert the combined text column to a list of strings
texts = df["text"].tolist()
The dataset is first loaded from HuggingFace. Then, it is converted into a Pandas DataFrame, and we ensure to keep only the records where both the Question and Answer fields are not null.
For each record, a new text field is created, which is a combination of the question and answer. We prepare a list of texts for embedding.
## ingest.py
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(model=ai_model, dimensions=vector_dimension)
# Initialize the tokenizer for the specific model
tokenizer = tiktoken.encoding_for_model(ai_model)
# Initialize a variable to keep track of the total tokens used
total_tokens_used = 0
# Define a reasonable batch size
batch_size = 50
# Process the dataset in batches
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i + batch_size]
# Calculate total tokens used for the current batch
batch_tokens_used = sum(len(tokenizer.encode(text)) for text in batch_texts)
total_tokens_used += batch_tokens_used
# Generate embeddings for the current batch
embeddings_list = embeddings.embed_documents(batch_texts)
# Prepare documents with embeddings to insert into MongoDB
documents = []
for j, (index, row) in enumerate(df.iloc[i:i + batch_size].iterrows()):
document = {
"text" : row["text"],
"embedding": embeddings_list[j]
}
documents.append(document)
# Insert the batch of documents into MongoDB
collection.insert_many(documents)
# Print total tokens used in the current batch
print(f"Processed and inserted batch {i // batch_size + 1}, tokens used : {batch_tokens_used}")
print(f"Embeddings generated and stored in MongoDB! Total tokens used: {total_tokens_used}")
# Close the MongoDB connection
client.close()
We use the OpenAI text-embedding-3-small model to generate embeddings with a vector dimension of 512. This model is significantly more efficient than the previous text-embedding-ada-002 model, with pricing reduced by 5X.
We generate the embeddings in batches and insert the batch of documents into MongoDB. We use tiktoken to calculate the tokens used for embedding generation. Here's the total token usage and its cost:
Total tokens used: 31678
Price per 1k tokens: $0.00002
OpenAI API cost: 31678 tokens / 1000 * $0.00002 = $0.00062
2. Retrieval and Generation
This involves converting user queries into embeddings, performing a vector search in MongoDB Atlas using LangChain, and passing the retrieved documents along with the user query to the AI model to generate informed responses specific to the fashion domain.
## query.py
from langchain_openai import OpenAIEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import warnings
import argparse
import os
import params
mongodb_conn_string = os.getenv("MONGODB_CONN_STRING")
db_name = "fashion_shop_faq"
collection_name = "faq_assistant"
ai_model = "text-embedding-3-small"
vector_dimension = 512
index_name = "vector_index"
# Process arguments
parser = argparse.ArgumentParser(description='Fashion Shop Assistant')
parser.add_argument('-q', '--question', help="The question to ask")
args = parser.parse_args()
query = args.question
print("\\nYour question:")
print("-------------")
print(query)
# Connect to MongoDB Atlas
client = MongoClient(mongodb_conn_string)
db = client[db_name]
collection = db[collection_name]
# openAI embedding model
embeddings = OpenAIEmbeddings(model=ai_model, dimensions=vector_dimension)
# Initialize MongoDBAtlasVectorSearch with correct keys
vectorStore = MongoDBAtlasVectorSearch(
collection=collection,
embedding=embeddings, # Your embedding model
text_key="text", # Field in MongoDB for the text you want to retrieve
embedding_key="embedding", # Field in MongoDB for the stored embeddings
index_name=index_name, # Name of Vector Index in MongoDB Atlas
relevance_score_fn="cosine" # Use cosine similarity
)
print(f"User question: {query}\\n")
### get relevant docs from MongoDB
# Perform the similarity search
similar_docs = vectorStore.similarity_search(query)
print("\\nQuery Response:")
print("---------------")
# Access the closest matching document
if similar_docs:
# Iterate through each document and print its content
for i, doc in enumerate(similar_docs):
print(f"Doc {i+1}: {doc.page_content}")
closest_match = similar_docs[0]
# print("Closest Match:", closest_match)
else:
print("No matching document found.")
This script can be run as shown below, where your query is passed in after the -q flag.
python3 query.py -q "Suggestion for summer outfit"
We use MongoDBAtlasVectorSearch from langchain_mongodb to perform the vector search. First, the user query is transformed into an embedding using the same model used in the Indexing phase. Then, we showcase the functionality of vectorStore.similarity_search(), which retrieves similar documents from the vector store based on the user query. By default, it returns four matching documents, but you can adjust this using the k option.
Next, we set up the actual RAG chain to process the user query:
## query.py
### Set up RAG chain
llm = ChatOpenAI(model="gpt-4o-mini")
retriever = vectorStore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
def format_docs(docs):
print("\\nRetriver - prepare context for prompt:")
print("--------------------------------------")
for i, doc in enumerate(docs):
print(f"Doc {i+1}: {doc.page_content}")
return "\\n\\n".join(
[f"{doc.page_content}" for doc in docs]
)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
response = rag_chain.invoke(query)
print("\\nRAG Chain Response:")
print("-------------------")
print(response)
# Close the MongoDB connection
client.close()
The RAG chain is composed of retrieval and generation. The MongoDBAtlasVectorSearch is set up as the retriever, where it uses similarity_search by default. The matching documents found here should match the results from the previous vectorStore.similarity_search(query) as they perform the same search. You can also specify other search types, such as MMR, using the search_type option.
The prompt is used to instruct the LLM in the next stage to answer the user's question based on the given context. We use a pre-defined prompt from LangChain (rlm/rag-prompt), which accepts input parameters of context and question. The matching results string is passed as context, and the user question is passed as question to the prompt.
We use the OpenAI gpt-4o-mini model as the LLM. The prompt is then passed to this LLM, which generates a response for the user.
Conclusion
In this tutorial, we've demonstrated how to build a fashion AI assistant using Retrieval-Augmented Generation (RAG). By leveraging external data stored as vector embeddings and using MongoDB Atlas as a vector store, we created a specialized chatbot capable of providing informed responses in the fashion domain. This approach showcases the power of RAG in expanding AI's knowledge base and enhancing its ability to deliver domain-specific insights.
Explore the Code
If you're interested in the technical details, you can access the full implementation on GitHub and try it out directly in Google Colab:



