How to Build a RAG Using Qwen3?

Two new Qwen models came out recently – the Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507. Both Qwen3 models have a good context length of 256K, and knowing this, I thought to myself, “Why not make a RAG to make good use of the context length?”. It’s worth mentioning that the Qwen3 family has a large variety of models meant for coding, thinking, embedding, and reranking. Here, to make our RAG in the best possible way, we will use Qwen3’s embedding model and reranker model.

Don’t worry, we won’t start off by building the RAG! We’ll first individually look at these models and then build the RAG.

Qwen3 models: Background

Developed by Alibaba Cloud, multiple Qwen3 models were launched a few months back. As an improvement to these models, two new models, Qwen3-Instruct-2507 and Qwen3-Thinking-2507, were recently introduced in three sizes: 235B-A22B, 30B-A3B, and 4B. Note that we’ll be primarily focusing on the ‘Qwen3-Instruct-2507’ 4B variant for this article. All these models are open-source and are readily available on Hugging Face and Kaggle. It’s also worth mentioning that the Qwen3 models have multilingual support, more specifically, 119 languages and dialects. Let’s see a few of the Qwen3 models in action, and later build the RAG we’re here for.

Qwen3 Models Demo

Let’s start with the text generation model, but before that, make sure to get your Hugging Face access token from here.

Note: We’ll be doing this demo on Google Colab. After opening a new notebook, make sure to add the access token as HF_TOKEN in the secrets tab on the left. Make sure to turn on access to the notebook.

Qwen3-4B-Instruct-2507

This is an updated version of the Qwen3-4B non-thinking mode and has a meaty context length of 256k. As you can tell from the name itself, this model has 4 billion parameters, which is relatively light and suitable to use on Colab. Let’s fire up this model using Hugging Face transformers and see it in action.

Note: Change the runtime type to T4 GPU to handle the model.

# Setup and Dependencies
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel

# Instruct Model (Text Generation)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
model = AutoModelForCausalLM.from_pretrained(
   "Qwen/Qwen3-4B-Instruct-2507",
   torch_dtype="auto",
   device_map="auto"
)


prompt = "Explain what machine learning is in simple terms."
messages = [{"role": "user", "content": prompt}]


text = tokenizer.apply_chat_template(
   messages,
   tokenize=False,
   add_generation_prompt=True,
   enable_thinking=False
)


model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generated_ids = model.generate(
   **model_inputs,
   max_new_tokens=256
)


output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)

Here’s the output I got for my question on “Explain what machine learning is in simple terms.” :

Sure! In simple terms, **machine learning** is a way for computers to learn from data and improve at doing tasks without being explicitly programmed to do so.

Imagine you have a robot that learns to recognize cats in photos. Instead of writing step-by-step instructions like "look for a round face, big eyes, and a nose," you just show it many pictures of cats and pictures of dogs. Over time, the robot starts to figure out what makes a cat different from a dog — it learns by itself through examples.

That’s machine learning: the computer looks at data (like pictures, numbers, or text), finds patterns, and gets better at making predictions or decisions based on those patterns.

Think of it like how a child learns to identify different animals by seeing them over and over. The machine does something similar — it learns from experience (data) and gets smarter with time.

So, in short:  
**Machine learning = teaching computers to learn from data and improve on their own.** 😊

Qwen3-Embedding-0.6B

This is an embedding model used to convert text to dense vector representations to understand the relationship between texts. This is an essential part of the RAG that we’ll be building later. The embedding model forms the heart of the Retriever in Retrieval Augmented Generation (RAG).

Let’s define a function for reusability and find the embedding to find the similarity between the texts. I’m passing 4 strings (text) in the ‘texts’ list.

# Qwen3-Embedding-0.6B (Text Embeddings)
def last_token_pool(last_hidden_states, attention_mask):
   left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
   if left_padding:
       return last_hidden_states[:, -1]
   else:
       sequence_lengths = attention_mask.sum(dim=1) - 1
       batch_size = last_hidden_states.shape[0]
       return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side="left")
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B')


texts = [
   "Machine learning is a subset of artificial intelligence.",
   "Python is a popular programming language.",
   "The weather is sunny today.",
   "Artificial intelligence is transforming industries."
]


batch_dict = tokenizer(
   texts,
   padding=True,
   truncation=True,
   max_length=8192,
   return_tensors="pt",
)


outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings @ embeddings.T)
print(scores.tolist())

Output:

[[1.0, 0.4834885597229004, 0.3609130382537842, 0.6805511713027954], [0.4834885597229004, 1.0000001192092896, 0.44289979338645935, 0.4494439363479614], [0.3609130382537842, 0.44289979338645935, 1.0000001192092896, 0.4508340656757355], [0.6805511713027954, 0.4494439363479614, 0.4508340656757355, 1.0]]

This is a matrix we calculated by finding the similarity scores of texts with each other. Let’s just look at the 1st row of the matrix for easier understanding. As you can see, the similarity of the same texts is always 1, the next immediate highest is 0.68, which is between a sentence about AI and a sentence about ML, and the similarity between a statement about weather and AI is not very high, which makes sense.

Qwen3-Reranker-0.6B

We can pass the retrieved chunks that can be obtained by vector search using the embedding model. The reranker model scores each chunk against the query to order the list of documents and assign priority, or use these scores to subset the retrieved chunks. We’ll see this model directly in action in the upcoming section for better understanding.

Building A RAG using the Qwen models

We’ll build a RAG on Analytics Vidhya Blogs (~ 40 blogs) using the three specified Qwen3 models. We’ll sequentially process the data and use the models. For efficient processing, we’ll be loading/unloading the model for usage to preserve memory. Let’s look at the steps and dive into the script.

Step 1: Download the data. Here’s the link to my repository where you can find the data and the scripts.
Step 2: Install the requirements:
!pip install faiss-cpu PyPDF2
Step 3: Unzip the data into a folder:
!unzip Data.zip
Step 4: For easier execution, you can just add the qwen_rag.py script in the Colab environment and run the script using:
!python qwen_rag.py

Breaking down the script:

We’re using the PYPDF2 library to load the content of the articles in PDF format. A function is defined to read blog content in .txt or .pdf formats.
We’re chunking the content into chunks of size 800 and an overlap of 100 to maintain context relevancy in consecutive documents.
We’re using FAISS to create a vector store, and with the help of our query, we’re retrieving top-15 documents based on similarity.
Now we use the reranker on these 15 documents to get the top-3 by using this function:

def rerank_documents(query, candidates, k_rerank=3):
   """Rerank documents using reranker model"""
   print("Reranking documents...")
   tokenizer, model = load_reranker_model()
  
   # Prepare inputs
   pairs = []
   for doc, _ in candidates:
       pair = f"<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: {query}\n<Document>: {doc['content']}"
       pairs.append(pair)
  
   # Tokenize
   inputs = tokenizer(pairs, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(device)
  
   # Get scores
   with torch.no_grad():
       outputs = model(**inputs)
       logits = outputs.logits[:, -1, :]
      
       # Get yes/no token scores
       token_false_id = tokenizer.convert_tokens_to_ids("no")
       token_true_id = tokenizer.convert_tokens_to_ids("yes")
      
       true_scores = logits[:, token_true_id]
       false_scores = logits[:, token_false_id]
      
       batch_scores = torch.stack([false_scores, true_scores], dim=1)
       batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
       scores = batch_scores[:, 1].exp().tolist()
  
   # Unload reranker model
   del tokenizer, model
   clear_memory()
   print("Reranker model unloaded")
  
   # Sort by score
   documents = [doc for doc, _ in candidates]
   doc_scores = list(zip(documents, scores))
   doc_scores.sort(key=lambda x: x[1], reverse=True)
  
   return doc_scores[:k_rerank]

Once we have these 3 top documents, we pass them to the instruct model to get our final output from the Retrieval Augmented Generator.

Output:

Question: What is the difference between LoRA and QLoRA?

Answer: LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) are both parameter-efficient fine-tuning techniques used to adapt large language models (LLMs) to specific tasks. While they share similarities, they differ in their approach and efficiency.

1. **LoRA (Low-Rank Adaptation)**:

- LoRA introduces low-rank matrices (A and B) into the self-attention module of each layer in the pre-trained model. These matrices act as adapters that allow the model to adapt and specialize for specific tasks while minimizing the number of additional parameters needed.

- LoRA reduces parameter overhead by focusing on optimizing trainable low-rank matrices instead of fine-tuning all parameters. This makes it much more memory-efficient and computationally cheaper.

- LoRA allows the pre-trained model to be shared across multiple tasks, facilitating efficient task-switching during deployment.

- LoRA does not introduce any additional inference latency compared to fully fine-tuned models, making it suitable for real-time applications.

2. **QLoRA (Quantized Low-Rank Adaptation)**:

- QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques.

- NF4 quantization leverages the inherent distribution of pre-trained neural network weights, transforming all weights to a fixed distribution that fits within the range of NF4 (-1 to 1). This allows for effective quantization without the need for expensive quantile estimation algorithms.

- Double Quantization addresses the memory overhead of quantization constants by quantizing the quantization constants themselves. This significantly reduces the memory footprint without compromising performance.

- QLoRA achieves even higher memory efficiency by introducing quantization, making it particularly valuable for deploying large models on resource-constrained devices.

- Despite its parameter-efficient nature, QLoRA retains high model quality, performing on par or even better than fully fine-tuned models on various downstream tasks.

In summary, while LoRA focuses on reducing the number of trainable parameters through low-rank adaptation, QLoRA further enhances this efficiency by incorporating quantization techniques, making it more suitable for deployment on devices with limited computational resources.

Sources: fine_tuning.txt, Parameter-Efficient Fine-Tuning of Large Language Models with LoRA and QLoRA.pdf, Parameter-Efficient Fine-Tuning of Large Language Models with LoRA and QLoRA.pdf

Note: You can refer to the log file ‘rag_retrieval_log.txt’ to get more information about the documents retrieved and the similarity score with the query and the reranker scores.

Conclusion

By combining Qwen3’s instruct, embedding, and reranker models, we’ve built a practical RAG pipeline that makes full use of their strengths. With 256K context length and multilingual support, the Qwen family proves versatile for real-world tasks. For next steps, you could try increasing the documents passed to the instruct model or use a thinking model for a different use case. The outputs also seem promising. I suggest you try evaluating the RAG on metrics like Faithfulness and Answer Relevancy to ensure the LLM is mostly free of hallucinations on your task/use-case.

Frequently Asked Questions

What is chunking?

Chunking is the process of splitting large text into smaller overlapping segments to maintain context while enabling efficient retrieval.

What is a vector store?

A vector store is a database that stores text embeddings for fast similarity search and retrieval.

How can I evaluate a RAG?

You can evaluate a RAG using metrics like accuracy, relevance, and response consistency across different queries.

How to choose how many documents to pass to the LLM?

It depends on your context length limit; typically, 3–5 top-ranked documents work well.

What is a reranker?

A reranker scores retrieved documents against a query to reorder them by relevance before passing to the LLM.

Passionate about technology and innovation, a graduate of Vellore Institute of Technology. Currently working as a Data Science Trainee, focusing on Data Science. Deeply interested in Deep Learning and Generative AI, eager to explore cutting-edge techniques to solve complex problems and create impactful solutions.