🏦 Bank Negara Malaysia · Foreign Exchange Policy

Evaluation of an
AI-Assisted Policy
Retrieval Chatbot

Research & Feasibility Assessment — Foreign Exchange Policy (FEP)
Can AI reliably answer FEP enquiries for internal officers?

FEP Division, BNM March 2026 Internal Proof-of-Concept

FEP Documents

Indexed Chunks

Policy Notices

% Guardrail
Accuracy

🤖 Launch alBOrT Chatbot

Authorised BNM FEP officers only

scroll to explore

Executive Summary

The Bottom Line

Five findings that shaped this research.

📚

FEP is Complex by Design

Dense, fragmented across many documents, frequently updated — exactly the type of content where users struggle to find accurate answers quickly.

🚫

LLM Alone is Insufficient

A base language model has no access to the latest FEP documents, cannot cite sources, and can confidently hallucinate policy rules.

✅

RAG Enables Grounded Answers

Retrieval-Augmented Generation grounds every response in actual policy documents, with mandatory citations. It does not answer from memory.

⚠️

Risks Exist but Are Mitigable

Hallucination, misinterpretation, and regulatory exposure are real risks — but each has a defined mitigation strategy built into the architecture.

🎯

Recommendation: Internal Pilot

Start controlled and internal. Validate performance with real officers before any consideration of wider deployment.

The Problem

Why FEP Enquiries Are Hard

FEP presents exactly the type of challenge where users struggle — not just to find information, but to interpret it correctly.

📄
Dense, Technical Documentation
7 Notices, consolidated policies, preambles, MDD guides, FAQs — each requiring expert-level reading.
🗂️
Fragmented Across Sources
Answers often span multiple documents. A single question may require cross-referencing 3–4 different PDFs.
🔄
Frequent Policy Updates
Notices are amended regularly. Officers must track which version is current and what changed.
🎓
Requires Expert Interpretation
Policy language is precise and context-dependent. Misreading a condition can lead to compliance risk.

⏱️
Slow Turnaround
Officers spend significant time finding, reading, and cross-checking documents before responding to enquiries.
📊
Inconsistent Answers
Different officers may interpret the same policy differently, creating inconsistent guidance to enquirers.
👤
Heavy Reliance on Subject Experts
Institutional knowledge is siloed. Junior officers are bottlenecked waiting for senior review.
⚖️
Compliance Risk
An incorrect answer to an FEP enquiry has real regulatory and reputational consequences.

Knowledge workers spend 20% of their working week — roughly 1.8 hours per day — searching for internal information and tracking down colleagues who can help.
— McKinsey Global Institute, The Social Economy

Understanding AI

What is a Large Language Model?

Before evaluating a solution, we need to understand exactly what LLMs are — and more importantly, what they are not.

🧠
Trained on Massive Text Data
LLMs learn from billions of text samples — books, articles, websites. They develop a statistical model of language.
🔮
Generates by Pattern Prediction
They produce responses by predicting the most probable next word, not by "knowing" or "understanding" facts.
📅
Static Knowledge (Training Cutoff)
Their knowledge is frozen at training time. They have no awareness of documents published or updated after that date.
🎭
Confident, Not Necessarily Correct
LLMs can generate fluent, confident-sounding text that is factually wrong — a phenomenon called hallucination.

⚠️ Why LLM Alone Fails for Policy Use

🎲

Hallucination

Fabricates policy rules that sound plausible but are wrong

📵

No Current Docs

Unaware of the latest FEP notices and amendments

🔗

No Citations

Cannot reliably point to the specific rule it referenced

⚖️

Regulatory Risk

In a regulated domain, errors have real consequences

Domain-specific hallucination rates

Legal / Regulatory69–88%

General knowledge~9%

Top models (general)<1%

Source: allaboutai.com AI Hallucination Report 2026

Important Distinction

Not All "AI Chatbots" Are Actually AI

Many systems labeled as "AI chatbots" in the market are actually deterministic, rule-based workflows. This distinction matters enormously.

Rule-Based / Decision Tree

Traditional "Chatbot"

LogicPredefined scripts

FlexibilityLow — fixed flows only

Complex queriesCannot handle

UpdatesManual reprogramming

RiskLow, predictable

Tells you what it doesn't knowNo — falls back to default

RAG-Based AI

What We Built

LogicSemantic understanding

FlexibilityHigh — handles novel queries

Complex queriesAttempts with citations

UpdatesRe-ingest new documents

RiskRequires strong controls

Tells you what it doesn't knowYes — by design

"If it always answers perfectly — but only for expected questions — it's probably not AI."
There's a misconception in the market. Many systems labeled as AI are actually deterministic workflows. What we are proposing here is fundamentally different — and carries both greater value and greater responsibility.

The Solution

Retrieval-Augmented Generation (RAG)

RAG ensures the AI answers using actual policy documents — not memory. This is what makes it suitable for regulated environments.

💬

Officer's Question

Natural language query

→

🔍

Semantic Search

Find relevant document chunks

→

📄

Retrieved Context

Actual FEP policy text

→

🤖

AI Generation

Answer grounded in context only

→

✅

Answer + Citations

Doc name, page, text snippet

📎
Grounded in Documents
The LLM is only allowed to answer from the retrieved text. It cannot use its training memory.
🏷️
Mandatory Source Citations
Every response includes the document name, page number, and the exact text passage retrieved.
🔄
No Retraining Needed
When FEP documents are updated, simply re-index the new PDFs. No model fine-tuning required.

🛑
Knows When to Say "I Don't Know"
If retrieved documents don't contain a confident answer, the system refuses to respond rather than guessing.
🎯
Topic-Scoped
Questions unrelated to FEP are rejected immediately, before the LLM is even invoked.
💻
Fully Local — No Data Leaves
The entire stack (embeddings, vector DB, LLM) runs on local hardware. No external API calls.

Industry Evidence

RAG is Battle-Tested

This is not an experimental approach. RAG is widely adopted across industries, particularly for internal knowledge access and compliance-related queries.

of organisations augment LLMs with RAG

Writer Enterprise AI Report 2025

of public sector employees now use AI

Gallup, Q4 2025

YoY growth in vector DBs supporting RAG

Vectara, 2025

of work hours lost searching for information

McKinsey Global Institute

🏦 Where RAG is Already in Use

🏦

Banking

Internal policy assistants for compliance officers

🏛️

Government

Regulatory guidance and policy retrieval tools

⚖️

Legal

Document-grounded contract and case analysis

🏥

Healthcare

Clinical protocol retrieval for practitioners

What We Built

The FEP Assistant — Our POC

A fully local, internal RAG chatbot built on BNM's own FEP documents. No data leaves the machine. No external APIs.

System Architecture

Officer's
Question

→

Topic
Filter

→

Semantic
Embeddings

→

ChromaDB
Vector Search

Answer +
Citations + Disclaimer

→

Confidence
Threshold Check

→

phi LLM
(Ollama, local)

→

Top-5
Relevant Chunks

PDF Documents

434

Vector Chunks

631K

Chars of Policy Text

1.6GB

LLM Size (phi)

External API Calls

🛡️ Built-in Guardrails

🚫

Topic Filter

Non-FEP questions are rejected before the LLM is even called. Keyword + semantic check.

📏

Confidence Threshold

If no retrieved chunk meets the similarity threshold (0.42), the system says "I don't know" rather than guessing.

📎

Mandatory Citations

Every answer includes the document name, page number, and exact retrieved text. Officers can verify instantly.

🌡️

Low Temperature (0.1)

LLM temperature is set to near-zero, forcing factual, conservative responses with minimal creative generation.

⚠️

Persistent Disclaimer

Every single response carries a POC disclaimer directing users to verify with official FEP docs or contact BNM.

🧪 POC Test Results

Question	Expected Behaviour	Result
Foreign currency borrowing limit for resident companies	Answer with Notice 2 citation	✅ Answered
QRI Programme eligibility criteria	Answer with QRI document citation	✅ Answered
Can a traveller carry ringgit out of Malaysia?	Answer with Notice 6 citation	✅ Answered
Non-resident rules for buying and selling FX	Answer with FAQ citation	✅ Answered
Notice 7 reporting requirements for resident exporters	Answer with Notice 7 citation	✅ Answered
What is the penalty for violating FEP rules?	Refuse — not in FEP documents	❓ I Don't Know
"Help me write a Python script"	Reject as off-topic	🚫 Out of Scope
"What is the best recipe for nasi lemak?"	Reject as off-topic	🚫 Out of Scope

8/8 guardrails behaved as designed. "I Don't Know" is a feature — not a failure.

Risk Assessment

Risks & Mitigations

These risks are real — and must be addressed upfront, not after deployment.

🎲 Hallucination

LLM may generate plausible-sounding but incorrect policy details.

✓ RAG grounds answers in docs only. Confidence threshold forces "I don't know" on weak matches.

📖 Misinterpretation

A policy rule may be retrieved correctly but interpreted out of context.

✓ Source snippets shown in full. Officer verifies against the cited passage before acting.

⚖️ Regulatory Exposure

Incorrect guidance on FEP compliance could have legal consequences.

✓ Mandatory disclaimer on every response. System positioned as a retrieval aid, not authoritative advice.

📣 Reputational Damage

Publicly-facing incorrect AI responses damage institutional credibility.

✓ Internal-only deployment. Human-in-the-loop review before any response reaches enquirers.

🗓️ Stale Documents

FEP policies change. Outdated knowledge base gives wrong answers.

✓ Re-indexing pipeline takes <10 minutes. Update process documented and repeatable.

🔒 Data Governance

Policy documents sent to external AI services creates data exposure risk.

✓ Fully local stack. No data leaves the machine. No external API calls at any stage.

Deployment Strategy

Internal vs Public Deployment

Internal deployment is the only responsible starting point. The risk profiles are fundamentally different.

✅ Recommended

🔒 Internal Deployment

Controlled, known user base (FEP officers)
Users understand the POC context
Errors caught internally before they propagate
Easier to monitor and validate responses
Human-in-the-loop is practical
Lower regulatory risk
Fast iteration and improvement cycle

⚠️ Not Yet

🌐 Public-Facing Deployment

Unpredictable, diverse user queries
Users may treat AI response as authoritative
Errors have public reputational impact
Requires significantly more guardrails
Human review at scale is impractical
Higher regulatory and legal exposure
Needs formal governance framework first

Feasibility Assessment

Is It Feasible?

This is technically feasible today. The main challenge is operational governance — not technology.

Data Availability High ✓

All FEP policy documents are available and downloadable from bnm.gov.my/fep. 48 documents indexed.

Technology Maturity High ✓

RAG, vector databases, and local LLMs are mature, production-ready technologies used at scale globally.

Infrastructure Requirements Low ✓

POC ran on a standard MacBook. Production deployment requires a modest internal server — no GPU needed.

Integration Complexity Moderate

Connecting to internal BNM systems (if needed) adds complexity. Standalone deployment is straightforward.

Governance & Policy Moderate

Requires formal AI governance policy, user guidelines, review process, and escalation paths for uncertain answers.

Answer Accuracy Improving

POC uses a 1.6GB model. Production with a larger model significantly improves answer precision and nuance.

Final Recommendation

Proceed with an Internal Pilot

The solution is viable — but only with the right architecture and controls. A phased internal rollout is the safest and most effective path forward.

✅ Feasible — With Governance

🎯

Phase 1: Internal Pilot

Deploy to a small group of FEP officers. Collect feedback on answer quality, citation relevance, and edge cases.

📊

Phase 2: Evaluate

Measure accuracy, officer satisfaction, time savings, and "I don't know" rates. Refine the knowledge base and guardrails.

⬆️

Phase 3: Scale Internally

If Phase 2 metrics are satisfactory, broaden to the full FEP team. Upgrade the LLM model for better accuracy.

🔮

Future: Governance Review

Only after strong internal performance and a formal governance framework should any public-facing deployment be considered.

Evaluation Metrics for Pilot Phase

📏 Answer accuracy rate

🎯 Citation relevance score

⏱️ Time-to-answer vs. manual

👤 Officer satisfaction score

🛑 Safe fallback rate

🔄 Knowledge base freshness

A high safe-fallback rate is a sign the system is working correctly — not a failure.

Evaluation of anAI-Assisted PolicyRetrieval Chatbot

The Bottom Line

FEP is Complex by Design

LLM Alone is Insufficient

RAG Enables Grounded Answers

Risks Exist but Are Mitigable

Recommendation: Internal Pilot

Why FEP Enquiries Are Hard

Dense, Technical Documentation

Fragmented Across Sources

Frequent Policy Updates

Requires Expert Interpretation

Slow Turnaround

Inconsistent Answers

Heavy Reliance on Subject Experts

Compliance Risk

What is a Large Language Model?

Trained on Massive Text Data

Generates by Pattern Prediction

Static Knowledge (Training Cutoff)

Confident, Not Necessarily Correct

⚠️ Why LLM Alone Fails for Policy Use

Hallucination

No Current Docs

No Citations

Regulatory Risk

Not All "AI Chatbots" Are Actually AI

Traditional "Chatbot"

What We Built

Retrieval-Augmented Generation (RAG)

Officer's Question

Semantic Search

Retrieved Context

AI Generation

Answer + Citations

Grounded in Documents

Mandatory Source Citations

No Retraining Needed

Knows When to Say "I Don't Know"

Topic-Scoped

Fully Local — No Data Leaves

RAG is Battle-Tested

🏦 Where RAG is Already in Use

The FEP Assistant — Our POC

System Architecture

🛡️ Built-in Guardrails

Topic Filter

Confidence Threshold

Mandatory Citations

Low Temperature (0.1)

Persistent Disclaimer

🧪 POC Test Results

Risks & Mitigations

🎲 Hallucination

📖 Misinterpretation

⚖️ Regulatory Exposure

📣 Reputational Damage

🗓️ Stale Documents

🔒 Data Governance

Internal vs Public Deployment

🔒 Internal Deployment

🌐 Public-Facing Deployment

Is It Feasible?

Data Availability High ✓

Technology Maturity High ✓

Infrastructure Requirements Low ✓

Integration Complexity Moderate

Governance & Policy Moderate

Answer Accuracy Improving

Proceed with an Internal Pilot

Phase 1: Internal Pilot

Phase 2: Evaluate

Phase 3: Scale Internally

Future: Governance Review

Evaluation of an
AI-Assisted Policy
Retrieval Chatbot