🏦 Bank Negara Malaysia · Foreign Exchange Policy

Evaluation of an
AI-Assisted Policy
Retrieval Chatbot

Research & Feasibility Assessment — Foreign Exchange Policy (FEP)
Can AI reliably answer FEP enquiries for internal officers?

FEP Division, BNM March 2026 Internal Proof-of-Concept
0
FEP Documents
0
Indexed Chunks
0
Policy Notices
0
% Guardrail
Accuracy
🤖 Launch alBOrT Chatbot
Authorised BNM FEP officers only
scroll to explore

The Bottom Line

Five findings that shaped this research.

📚

FEP is Complex by Design

Dense, fragmented across many documents, frequently updated — exactly the type of content where users struggle to find accurate answers quickly.

🚫

LLM Alone is Insufficient

A base language model has no access to the latest FEP documents, cannot cite sources, and can confidently hallucinate policy rules.

RAG Enables Grounded Answers

Retrieval-Augmented Generation grounds every response in actual policy documents, with mandatory citations. It does not answer from memory.

⚠️

Risks Exist but Are Mitigable

Hallucination, misinterpretation, and regulatory exposure are real risks — but each has a defined mitigation strategy built into the architecture.

🎯

Recommendation: Internal Pilot

Start controlled and internal. Validate performance with real officers before any consideration of wider deployment.

Why FEP Enquiries Are Hard

FEP presents exactly the type of challenge where users struggle — not just to find information, but to interpret it correctly.

  • 📄

    Dense, Technical Documentation

    7 Notices, consolidated policies, preambles, MDD guides, FAQs — each requiring expert-level reading.

  • 🗂️

    Fragmented Across Sources

    Answers often span multiple documents. A single question may require cross-referencing 3–4 different PDFs.

  • 🔄

    Frequent Policy Updates

    Notices are amended regularly. Officers must track which version is current and what changed.

  • 🎓

    Requires Expert Interpretation

    Policy language is precise and context-dependent. Misreading a condition can lead to compliance risk.

  • ⏱️

    Slow Turnaround

    Officers spend significant time finding, reading, and cross-checking documents before responding to enquiries.

  • 📊

    Inconsistent Answers

    Different officers may interpret the same policy differently, creating inconsistent guidance to enquirers.

  • 👤

    Heavy Reliance on Subject Experts

    Institutional knowledge is siloed. Junior officers are bottlenecked waiting for senior review.

  • ⚖️

    Compliance Risk

    An incorrect answer to an FEP enquiry has real regulatory and reputational consequences.

Knowledge workers spend 20% of their working week — roughly 1.8 hours per day — searching for internal information and tracking down colleagues who can help.
— McKinsey Global Institute, The Social Economy

What is a Large Language Model?

Before evaluating a solution, we need to understand exactly what LLMs are — and more importantly, what they are not.

  • 🧠

    Trained on Massive Text Data

    LLMs learn from billions of text samples — books, articles, websites. They develop a statistical model of language.

  • 🔮

    Generates by Pattern Prediction

    They produce responses by predicting the most probable next word, not by "knowing" or "understanding" facts.

  • 📅

    Static Knowledge (Training Cutoff)

    Their knowledge is frozen at training time. They have no awareness of documents published or updated after that date.

  • 🎭

    Confident, Not Necessarily Correct

    LLMs can generate fluent, confident-sounding text that is factually wrong — a phenomenon called hallucination.

⚠️ Why LLM Alone Fails for Policy Use

🎲

Hallucination

Fabricates policy rules that sound plausible but are wrong

📵

No Current Docs

Unaware of the latest FEP notices and amendments

🔗

No Citations

Cannot reliably point to the specific rule it referenced

⚖️

Regulatory Risk

In a regulated domain, errors have real consequences

Domain-specific hallucination rates
Legal / Regulatory69–88%
General knowledge~9%
Top models (general)<1%
Source: allaboutai.com AI Hallucination Report 2026

Not All "AI Chatbots" Are Actually AI

Many systems labeled as "AI chatbots" in the market are actually deterministic, rule-based workflows. This distinction matters enormously.

Rule-Based / Decision Tree

Traditional "Chatbot"

LogicPredefined scripts
FlexibilityLow — fixed flows only
Complex queriesCannot handle
UpdatesManual reprogramming
RiskLow, predictable
Tells you what it doesn't knowNo — falls back to default
RAG-Based AI

What We Built

LogicSemantic understanding
FlexibilityHigh — handles novel queries
Complex queriesAttempts with citations
UpdatesRe-ingest new documents
RiskRequires strong controls
Tells you what it doesn't knowYes — by design
"If it always answers perfectly — but only for expected questions — it's probably not AI."
There's a misconception in the market. Many systems labeled as AI are actually deterministic workflows. What we are proposing here is fundamentally different — and carries both greater value and greater responsibility.

Retrieval-Augmented Generation (RAG)

RAG ensures the AI answers using actual policy documents — not memory. This is what makes it suitable for regulated environments.

💬

Officer's Question

Natural language query

🔍

Semantic Search

Find relevant document chunks

📄

Retrieved Context

Actual FEP policy text

🤖

AI Generation

Answer grounded in context only

Answer + Citations

Doc name, page, text snippet

  • 📎

    Grounded in Documents

    The LLM is only allowed to answer from the retrieved text. It cannot use its training memory.

  • 🏷️

    Mandatory Source Citations

    Every response includes the document name, page number, and the exact text passage retrieved.

  • 🔄

    No Retraining Needed

    When FEP documents are updated, simply re-index the new PDFs. No model fine-tuning required.

  • 🛑

    Knows When to Say "I Don't Know"

    If retrieved documents don't contain a confident answer, the system refuses to respond rather than guessing.

  • 🎯

    Topic-Scoped

    Questions unrelated to FEP are rejected immediately, before the LLM is even invoked.

  • 💻

    Fully Local — No Data Leaves

    The entire stack (embeddings, vector DB, LLM) runs on local hardware. No external API calls.

RAG is Battle-Tested

This is not an experimental approach. RAG is widely adopted across industries, particularly for internal knowledge access and compliance-related queries.

0%
of organisations augment LLMs with RAG
Writer Enterprise AI Report 2025
0%
of public sector employees now use AI
Gallup, Q4 2025
0%
YoY growth in vector DBs supporting RAG
Vectara, 2025
0%
of work hours lost searching for information
McKinsey Global Institute

🏦 Where RAG is Already in Use

🏦
Banking
Internal policy assistants for compliance officers
🏛️
Government
Regulatory guidance and policy retrieval tools
⚖️
Legal
Document-grounded contract and case analysis
🏥
Healthcare
Clinical protocol retrieval for practitioners

The FEP Assistant — Our POC

A fully local, internal RAG chatbot built on BNM's own FEP documents. No data leaves the machine. No external APIs.

System Architecture

Officer's
Question
Topic
Filter
Semantic
Embeddings
ChromaDB
Vector Search
Answer +
Citations + Disclaimer
Confidence
Threshold Check
phi LLM
(Ollama, local)
Top-5
Relevant Chunks
48
PDF Documents
434
Vector Chunks
631K
Chars of Policy Text
1.6GB
LLM Size (phi)
0
External API Calls

🛡️ Built-in Guardrails

🚫

Topic Filter

Non-FEP questions are rejected before the LLM is even called. Keyword + semantic check.

📏

Confidence Threshold

If no retrieved chunk meets the similarity threshold (0.42), the system says "I don't know" rather than guessing.

📎

Mandatory Citations

Every answer includes the document name, page number, and exact retrieved text. Officers can verify instantly.

🌡️

Low Temperature (0.1)

LLM temperature is set to near-zero, forcing factual, conservative responses with minimal creative generation.

⚠️

Persistent Disclaimer

Every single response carries a POC disclaimer directing users to verify with official FEP docs or contact BNM.

🧪 POC Test Results

Question Expected Behaviour Result
Foreign currency borrowing limit for resident companies Answer with Notice 2 citation ✅ Answered
QRI Programme eligibility criteria Answer with QRI document citation ✅ Answered
Can a traveller carry ringgit out of Malaysia? Answer with Notice 6 citation ✅ Answered
Non-resident rules for buying and selling FX Answer with FAQ citation ✅ Answered
Notice 7 reporting requirements for resident exporters Answer with Notice 7 citation ✅ Answered
What is the penalty for violating FEP rules? Refuse — not in FEP documents ❓ I Don't Know
"Help me write a Python script" Reject as off-topic 🚫 Out of Scope
"What is the best recipe for nasi lemak?" Reject as off-topic 🚫 Out of Scope

8/8 guardrails behaved as designed. "I Don't Know" is a feature — not a failure.

Risks & Mitigations

These risks are real — and must be addressed upfront, not after deployment.

🎲 Hallucination

LLM may generate plausible-sounding but incorrect policy details.

✓ RAG grounds answers in docs only. Confidence threshold forces "I don't know" on weak matches.

📖 Misinterpretation

A policy rule may be retrieved correctly but interpreted out of context.

✓ Source snippets shown in full. Officer verifies against the cited passage before acting.

⚖️ Regulatory Exposure

Incorrect guidance on FEP compliance could have legal consequences.

✓ Mandatory disclaimer on every response. System positioned as a retrieval aid, not authoritative advice.

📣 Reputational Damage

Publicly-facing incorrect AI responses damage institutional credibility.

✓ Internal-only deployment. Human-in-the-loop review before any response reaches enquirers.

🗓️ Stale Documents

FEP policies change. Outdated knowledge base gives wrong answers.

✓ Re-indexing pipeline takes <10 minutes. Update process documented and repeatable.

🔒 Data Governance

Policy documents sent to external AI services creates data exposure risk.

✓ Fully local stack. No data leaves the machine. No external API calls at any stage.

Internal vs Public Deployment

Internal deployment is the only responsible starting point. The risk profiles are fundamentally different.

✅ Recommended

🔒 Internal Deployment

  • Controlled, known user base (FEP officers)
  • Users understand the POC context
  • Errors caught internally before they propagate
  • Easier to monitor and validate responses
  • Human-in-the-loop is practical
  • Lower regulatory risk
  • Fast iteration and improvement cycle
⚠️ Not Yet

🌐 Public-Facing Deployment

  • Unpredictable, diverse user queries
  • Users may treat AI response as authoritative
  • Errors have public reputational impact
  • Requires significantly more guardrails
  • Human review at scale is impractical
  • Higher regulatory and legal exposure
  • Needs formal governance framework first

Is It Feasible?

This is technically feasible today. The main challenge is operational governance — not technology.

Data Availability High ✓

All FEP policy documents are available and downloadable from bnm.gov.my/fep. 48 documents indexed.

Technology Maturity High ✓

RAG, vector databases, and local LLMs are mature, production-ready technologies used at scale globally.

Infrastructure Requirements Low ✓

POC ran on a standard MacBook. Production deployment requires a modest internal server — no GPU needed.

Integration Complexity Moderate

Connecting to internal BNM systems (if needed) adds complexity. Standalone deployment is straightforward.

Governance & Policy Moderate

Requires formal AI governance policy, user guidelines, review process, and escalation paths for uncertain answers.

Answer Accuracy Improving

POC uses a 1.6GB model. Production with a larger model significantly improves answer precision and nuance.

Proceed with an Internal Pilot

The solution is viable — but only with the right architecture and controls. A phased internal rollout is the safest and most effective path forward.

✅ Feasible — With Governance
🎯

Phase 1: Internal Pilot

Deploy to a small group of FEP officers. Collect feedback on answer quality, citation relevance, and edge cases.

📊

Phase 2: Evaluate

Measure accuracy, officer satisfaction, time savings, and "I don't know" rates. Refine the knowledge base and guardrails.

⬆️

Phase 3: Scale Internally

If Phase 2 metrics are satisfactory, broaden to the full FEP team. Upgrade the LLM model for better accuracy.

🔮

Future: Governance Review

Only after strong internal performance and a formal governance framework should any public-facing deployment be considered.

Evaluation Metrics for Pilot Phase
📏 Answer accuracy rate
🎯 Citation relevance score
⏱️ Time-to-answer vs. manual
👤 Officer satisfaction score
🛑 Safe fallback rate
🔄 Knowledge base freshness
A high safe-fallback rate is a sign the system is working correctly — not a failure.