Building a RAG Application with Memory Capabilities Using Google's Gemma Model

In this tutorial, we will walk through the process of building a Retrieval-Augmented Generation (RAG) application with memory capabilities. We will use the Google Gemma open-source model and integrate it with Ollama for efficient PDF document interaction. We'll use Chainlit as the framework and LangChain for handling the logic. This application will allow users to upload PDF documents and interact with them through a chat interface.


Introduction to Gemma and Ollama

Gemma is a family of lightweight open models developed by Google DeepMind. These models are available in two sizes, Gemma 2B and Gemma 7B, each with pre-trained and instruction-tuned variants. They are derived from the Gemma models, known for their efficiency and performance.

Ollama provides high-performing open embedding models with an 8,192 token context window, surpassing OpenAI's text embedding models in both short and long-context tasks.


Application Architecture



1. Upload and Extract Text from PDF: Users upload a PDF document, and the application extracts text from it.

2. Text Chunking and Embedding: The extracted text is split into manageable chunks and converted into embeddings using Ollama's embedding model.

3. Vector Store Creation: These embeddings are stored in a vector database for efficient similarity searches.

4. Question Handling: When a user asks a question, the application creates an embedding of the question, searches for similar content in the vector store, and retrieves relevant chunks.

5. Response Generation: The relevant chunks are passed to the Gemma model to generate a refined answer, which is then displayed to the user along with source references.


Implementation Steps



Installation Packages

chainlit==1.0.200

langchain

langchain_community

PyPDF2

chromadb


Code

import PyPDF2

from langchain_community.embeddings import OllamaEmbeddings

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import Chroma

from langchain.chains import ConversationalRetrievalChain

from langchain_community.chat_models import ChatOllama

from langchain.memory import ChatMessageHistory, ConversationBufferMemory

import chainlit as cl




@cl.on_chat_start

async def on_chat_start():

    files = None #Initialize variable to store uploaded files


    # Wait for the user to upload a file

    while files is None:

        files = await cl.AskFileMessage(

            content="Please upload a pdf file to begin!",

            accept=["application/pdf"],

            max_size_mb=100,# Optionally limit the file size

            timeout=180, # Set a timeout for user response,

        ).send()


    file = files[0] # Get the first uploaded file

    print(file) # Print the file object for debugging

    

     # Sending an image with the local file path

    elements = [

    cl.Image(name="image", display="inline", path="pic.jpg")

    ]

    # Inform the user that processing has started

    msg = cl.Message(content=f"Processing `{file.name}`...",elements=elements)

    await msg.send()


    # Read the PDF file

    pdf = PyPDF2.PdfReader(file.path)

    pdf_text = ""

    for page in pdf.pages:

        pdf_text += page.extract_text()

        


    # Split the text into chunks

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=50)

    texts = text_splitter.split_text(pdf_text)


    # Create a metadata for each chunk

    metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]


    # Create a Chroma vector store

    embeddings = OllamaEmbeddings(model="nomic-embed-text")

    docsearch = await cl.make_async(Chroma.from_texts)(

        texts, embeddings, metadatas=metadatas

    )

    

    # Initialize message history for conversation

    message_history = ChatMessageHistory()

    

    # Memory for conversational context

    memory = ConversationBufferMemory(

        memory_key="chat_history",

        output_key="answer",

        chat_memory=message_history,

        return_messages=True,

    )


    # Create a chain that uses the Chroma vector store

    chain = ConversationalRetrievalChain.from_llm(

        ChatOllama(model="gemma:7b"),

        chain_type="stuff",

        retriever=docsearch.as_retriever(),

        memory=memory,

        return_source_documents=True,

    )


    # Let the user know that the system is ready

    msg.content = f"Processing `{file.name}` done. You can now ask questions!"

    await msg.update()

    #store the chain in user session

    cl.user_session.set("chain", chain)



@cl.on_message

async def main(message: cl.Message):

     # Retrieve the chain from user session

    chain = cl.user_session.get("chain") 

    #call backs happens asynchronously/parallel 

    cb = cl.AsyncLangchainCallbackHandler()

    

    # call the chain with user's message content

    res = await chain.ainvoke(message.content, callbacks=[cb])

    answer = res["answer"]

    source_documents = res["source_documents"] 


    text_elements = [] # Initialize list to store text elements

    

    # Process source documents if available

    if source_documents:

        for source_idx, source_doc in enumerate(source_documents):

            source_name = f"source_{source_idx}"

            # Create the text element referenced in the message

            text_elements.append(

                cl.Text(content=source_doc.page_content, name=source_name)

            )

        source_names = [text_el.name for text_el in text_elements]

        

         # Add source references to the answer

        if source_names:

            answer += f"\nSources: {', '.join(source_names)}"

        else:

            answer += "\nNo sources found"

    #return results

    await cl.Message(content=answer, elements=text_elements).send()




Final Thought

This RAG application demonstrates the powerful combination of Google's Gemma model and Ollama embeddings to create an interactive and efficient document querying system. By leveraging these tools, we can build sophisticated applications capable of understanding and responding to complex queries with contextual accuracy. If you have any questions or feedback, please leave a comment.

Post a Comment

Previous Post Next Post