Build Gen-AI LLM RAG API Ecosystem on a Node Server

Denuwan Himanga Hettiarachchi
8 min readMay 23, 2024

--

Hello World,

In this post, I’ll explain how to build your own Generative AI Large Language Model API ecosystem to get context-aware answers from PDF documents using Node.js, LangChain libraries, and the Retrieval Augmented Generation (RAG) technique. The purpose of this article is to provide comprehensive knowledge of a cohesive, production-ready REST API ecosystem that you can easily integrate with any product.

Prerequisites

  • Install the latest version of Node.js on your PC
  • Basic knowledge of plain JavaScript
  • OpenAI API Key
  • Enthusiasm

What is Retrieval Augmented Generation (RAG)?

You can skip this theoretical part if you have a basic knowledge of the RAG technique.

We are transitioning from an instruction-driven era to an intention-driven application era, where Generative AI Large Language Models (LLMs) are increasingly influencing every aspect of software development and enhancing human experiences. However, LLMs have certain limitations:

1. Hallucination
LLMs are trained on publicly available data and may lack access to unique or private domain-specific information. If you ask a question related to such private knowledge, LLMs might provide completely incorrect answers, a phenomenon known as hallucination.

2. Real-World Data Limitation
LLMs are not always up to date with real-world data. For example, as of the time of writing, OpenAI’s ChatGPT 4 contains data only up to September 2021, and any events occurring after that date are beyond its knowledge scope.

RAG is introduced to mitigate these challenges. By using a retrieval technique, we can provide context to the LLMs and ask questions based on that context. However, due to prompt limitations, we can’t input the entire context into the LLMs directly.

To overcome this issue, we process our context source and organize the data in a prompt-friendly manner. For instance, if you have a document with 1 million words, we break it into 1,000-word chunks. Through a retrieval process, we identify the relevant chunks related to your question. We then provide the identified relevant chunk and the question to the LLM, asking it to answer based on the given context.

For the retrieval part, we use an embedding mechanism. Each 1,000-word chunk is embedded into a numerical format and stored in a vector database. In the vector database, related embedded words are located near each other. For example, “Apple” and “Phone” might reside close to each other, but “Orange” and “Phone” do not. When a new query comes from a user, we embed the question and check for the relevant word chunks related to the question from the vector database.

RAG is a simple yet powerful mechanism to overcome LLM limitations.

Retrieval Augmented Generation (RAG) Ecosystem

Build Node.JS Express API Server

We are building a cohesive API ecosystem responsible for all data processing tasks, which can be divided into two parts: data pre-processing and data retrieval.

Data Pre-processing: This involves document chunking, embedding, and storing the chunks in the vector database.

Data Retrieval: When a user asks a question, the Node server retrieves the relevant data from the vector database related to the question. Using LangChain’s RetrievalQAChain connected to the OpenAI ChatGPT model, the server generates an output relevant to the document context.

  1. Create a Node project & Install Libraries

To build a Node.js Express server, you need to create a Node project and install a few dependencies. I will explain the practical usage of each library in the sub-sections. Each sub-section will cover the REST API endpoints in the Node.js Express server.

npm init -y
npm i @lancedb/vectordb-linux-x64-gnu //Identify the compatible vector database for your operating system.
npm i @langchain/community
npm i @langchain/openai
npm i express
npm i faiss-node
npm i langchain
npm i multer
npm i pdf-parse
npm i vectordb

2. Upload-Document

curl --location '{HOST}/upload' \
--form 'file=@"{FILE-LOCATION}"'

To process the file, we first need to upload it to the Node server. The upload document API uses Multer, a Node.js middleware for handling multipart/form-data. Once the file upload is successful, the API returns the file name (generated by the server using Date.now()), which is saved on the server. This file name serves as a unique ID for future processes.

Inside the index.js file, we define the /upload endpoint like this.

import express from 'express';
import multer from 'multer';
import fs from 'fs';
import path from 'path';
import bodyParser from 'body-parser';

const app = express();
const port = 3000;
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

app.use(bodyParser.json());

/**
* Set storage engine for multer
* And create the Destination folder for uploaded files
*/
const storage = multer.diskStorage({
destination: function (req, file, cb) {

var folder = './uploads/'

if (!fs.existsSync(folder)) {
fs.mkdirSync(folder, { recursive: true });
}

cb(null, folder);
},
filename: function (req, file, cb) {
cb(null, Date.now() + path.extname(file.originalname));
}
});

// Initialize multer
const upload = multer({
storage: storage,
});

/**
* Define a route to handle file uploads
* And if file was uploaded successfully, send a success response
*/
app.post('/upload', upload.single('file'), (req, res) => {

if (!req.file) {
return res.status(400).json({ error: 'No file uploaded' });
}

res.json({ message: 'File uploaded successfully', fileName: req.file.filename });
});

/**
* Start the server
*/
app.listen(port, () => {
console.log(`Server is listening at http://localhost:${port}`);
});

3. Embedding-Document

curl --location 'http://{HOST}/embedding-document?document={DOCUMENT-ID}'

Now we pull the PDF document into the server ./uploads/ directory. All interactions with the server are now based on the document ID provided in the previous step.

Inside the index.js file, we define the /embedding-document?document={document-id} endpoint like this.

import dataLoader from './util/data-loader.js'
import doc_splitter from './util/doc-splitter.js';
import vectorizer from './util/vectorizer.js';
import express from 'express';
import multer from 'multer';
import bodyParser from 'body-parser';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { dirname } from 'path';

const app = express();
const port = 3000;
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

app.use(bodyParser.json());

/**
* Embed documentation
*/
app.get('/embedding-document', async (req, res) => {
try{
var filePath = path.resolve(__dirname, "./uploads/"+req.query.document);
const docs = await dataLoader.load_documents(filePath);
const splitted_doc = await doc_splitter.split_documents(docs);
await vectorizer.embed_and_store(req.query.document, splitted_doc);

res.send({status:"SUCCESS"});
}catch (error) {
res.send({status:"FAILED", message:"I\'ve encountered an unexpected error. :)" });
}
});
/**
* Start the server
*/
app.listen(port, () => {
console.log(`Server is listening at http://localhost:${port}`);
});

Inside the Embedding-Document REST endpoint, we perform three main tasks:

3.1. Load Document: Using the document ID, we load the document with the PDFLoader LangChain library and pass it to the next steps.

Inside the util/data-loader.js file, we define the data loader method like this:

import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const data_loader = {

load_documents: async function(file_location){

const loader = new PDFLoader(file_location, {
splitPages: true,
});

const docs = await loader.load();

return docs;
}

};

export default data_loader;

2.2. Split Document: Using LangChain’s RecursiveCharacterTextSplitter, we split the loaded document into chunks for the embedding process. You can adjust the chunkSize and chunkOverlap values based on your requirements. Generally, smaller chunks provide better results, but it depends on the specific use case.

Inside the util/doc-splitter.js file, we define the document splitter method like this.:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const doc_splitter = {

split_documents: async function(documents){

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 100,
});

const docOutput = await splitter.splitDocuments(documents);

return docOutput;
}

}

export default doc_splitter;

3.3. Embed & Stored in Vector Database: The split chunks are passed to the embedding method. We use the OpenAIEmbeddings library to embed the chunked document and store the data in a Faiss Store Vector database. The vector database is named after the document ID, allowing us to easily retrieve data related to the specific document during the retrieval phase.

Inside the util/vectorizer.js file, we define the embedding and storing method like this:

import { OpenAIEmbeddings } from "@langchain/openai";
import { FaissStore } from "@langchain/community/vectorstores/faiss";

const vectorizer = {

embed_and_store: async function(vector_store, split_documents){

// Load the docs into the vector store
const vectorStore = await FaissStore.fromDocuments(
split_documents,
new OpenAIEmbeddings({openAIApiKey: '{OpenAI-API-KEY}'})
);

// Save the vector store to a directory
const directory = "./vector-db/faiss-store/"+vector_store;

await vectorStore.save(directory);
}

}

export default vectorizer;

4. Answer to Your Query

curl --location 'http://{HOST}/?question={QUESTION}&document={DOCUMENT-ID}'

This endpoint orchestrates the retrieval part of the process. When a user asks a question related to a document, if the document has already been processed and has a document ID, we can simply pass the document ID and question as query parameters. If not, we first need to perform the previous two steps to obtain a document ID.

Inside the index.js file, we define the /?document={document-id}&question={question} endpoint like this:

import retrieval_qa_chain from './util/retrieval-qa-chain.js';
import express from 'express';
import multer from 'multer';
import bodyParser from 'body-parser';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { dirname } from 'path';

const app = express();
const port = 3000;
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

app.use(bodyParser.json());

/**
* Define a GET route
*/
app.get('/', async (req, res) => {

try{
const documentID = req.query.document;

const answer = await retrieval_qa_chain.ask_question(documentID, req.query.question, []);
res.send(answer);
}catch (error) {
res.send({status:"FAILED", answer : "Ooops, I\'ve encountered an unexpected error. :)" });
}

});

/**
* Start the server
*/
app.listen(port, () => {
console.log(`Server is listening at http://localhost:${port}`);
});

Inside the ask-question method, we perform three main tasks:

4.1. Load the Vector Store:
Using the document ID, we load the Faiss Store Vector database belonging to the relevant document with the FaissStore.load method.

4.2. Create ConversationalRetrievalQAChain:
To define a ConversationalRetrievalQAChain, we require two parameters: the LLM model and the relevant documents retrieved for processing your data. Additionally, we set returnSourceDocuments to true to get the reference document data.

4.3. Invoke ConversationalRetrievalQAChain:
When invoking the ConversationalRetrievalQAChain, we can pass the chat history as well, allowing for a rich conversation experience, not limited to just one question.

Once all three steps are successful, you can send the response to the user. For ease of processing and to compress the JSON object, I performed some data orchestration, but this part is completely up to you.

4.1, 4.2 & 4.3 activities are defined inside the util/retrieval-qa-chain.js file like this.

import { OpenAIEmbeddings, OpenAI } from "@langchain/openai";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { ConversationalRetrievalQAChain } from "langchain/chains"
import {HumanMessage, AIMessage} from "@langchain/core/messages"
import {ChatMessageHistory} from "langchain/stores/message/in_memory"

const retrieval_qa_chain = {

ask_question : async function(document_id, question, chat_history = [] ){

const directory = "./vector-db/faiss-store/"+document_id;
const model = new OpenAI({openAIApiKey: '{OpenAI-API-KEY}'});

// Load the vector store from the same directory
const loadedVectorStore = await FaissStore.load(
directory,
new OpenAIEmbeddings({openAIApiKey: '{OpenAI-API-KEY}'})
);

const chain = ConversationalRetrievalQAChain.fromLLM(
model,
loadedVectorStore.asRetriever(),
{
returnSourceDocuments: true,
}
);

const responce = await chain.invoke({question: question, chat_history: chat_history});

const history = new ChatMessageHistory();
await history.addMessage(new HumanMessage(question));
await history.addMessage(new AIMessage(responce.text));

chat_history.push(history.messages[0]);
chat_history.push(history.messages[1]);

const answer = {
answer: responce.text,
chat_history: chat_history,
source: responce.sourceDocuments
}

return answer;
}

}

export default retrieval_qa_chain;

Congratulations! You’ve successfully built a comprehensive RAG ecosystem on a Node server. Remember, fine-tuning the default settings can lead to even more accurate answers to your queries. This is just the first step on a long journey of innovation and exploration.

Happy Coding…

--

--