Research & Feasibility Assessment — Foreign Exchange Policy (FEP)
Can AI reliably answer FEP enquiries for internal officers?
Five findings that shaped this research.
Dense, fragmented across many documents, frequently updated — exactly the type of content where users struggle to find accurate answers quickly.
A base language model has no access to the latest FEP documents, cannot cite sources, and can confidently hallucinate policy rules.
Retrieval-Augmented Generation grounds every response in actual policy documents, with mandatory citations. It does not answer from memory.
Hallucination, misinterpretation, and regulatory exposure are real risks — but each has a defined mitigation strategy built into the architecture.
Start controlled and internal. Validate performance with real officers before any consideration of wider deployment.
FEP presents exactly the type of challenge where users struggle — not just to find information, but to interpret it correctly.
7 Notices, consolidated policies, preambles, MDD guides, FAQs — each requiring expert-level reading.
Answers often span multiple documents. A single question may require cross-referencing 3–4 different PDFs.
Notices are amended regularly. Officers must track which version is current and what changed.
Policy language is precise and context-dependent. Misreading a condition can lead to compliance risk.
Officers spend significant time finding, reading, and cross-checking documents before responding to enquiries.
Different officers may interpret the same policy differently, creating inconsistent guidance to enquirers.
Institutional knowledge is siloed. Junior officers are bottlenecked waiting for senior review.
An incorrect answer to an FEP enquiry has real regulatory and reputational consequences.
Before evaluating a solution, we need to understand exactly what LLMs are — and more importantly, what they are not.
LLMs learn from billions of text samples — books, articles, websites. They develop a statistical model of language.
They produce responses by predicting the most probable next word, not by "knowing" or "understanding" facts.
Their knowledge is frozen at training time. They have no awareness of documents published or updated after that date.
LLMs can generate fluent, confident-sounding text that is factually wrong — a phenomenon called hallucination.
Fabricates policy rules that sound plausible but are wrong
Unaware of the latest FEP notices and amendments
Cannot reliably point to the specific rule it referenced
In a regulated domain, errors have real consequences
Many systems labeled as "AI chatbots" in the market are actually deterministic, rule-based workflows. This distinction matters enormously.
RAG ensures the AI answers using actual policy documents — not memory. This is what makes it suitable for regulated environments.
Natural language query
Find relevant document chunks
Actual FEP policy text
Answer grounded in context only
Doc name, page, text snippet
The LLM is only allowed to answer from the retrieved text. It cannot use its training memory.
Every response includes the document name, page number, and the exact text passage retrieved.
When FEP documents are updated, simply re-index the new PDFs. No model fine-tuning required.
If retrieved documents don't contain a confident answer, the system refuses to respond rather than guessing.
Questions unrelated to FEP are rejected immediately, before the LLM is even invoked.
The entire stack (embeddings, vector DB, LLM) runs on local hardware. No external API calls.
This is not an experimental approach. RAG is widely adopted across industries, particularly for internal knowledge access and compliance-related queries.
A fully local, internal RAG chatbot built on BNM's own FEP documents. No data leaves the machine. No external APIs.
Non-FEP questions are rejected before the LLM is even called. Keyword + semantic check.
If no retrieved chunk meets the similarity threshold (0.42), the system says "I don't know" rather than guessing.
Every answer includes the document name, page number, and exact retrieved text. Officers can verify instantly.
LLM temperature is set to near-zero, forcing factual, conservative responses with minimal creative generation.
Every single response carries a POC disclaimer directing users to verify with official FEP docs or contact BNM.
8/8 guardrails behaved as designed. "I Don't Know" is a feature — not a failure.
These risks are real — and must be addressed upfront, not after deployment.
LLM may generate plausible-sounding but incorrect policy details.
A policy rule may be retrieved correctly but interpreted out of context.
Incorrect guidance on FEP compliance could have legal consequences.
Publicly-facing incorrect AI responses damage institutional credibility.
FEP policies change. Outdated knowledge base gives wrong answers.
Policy documents sent to external AI services creates data exposure risk.
Internal deployment is the only responsible starting point. The risk profiles are fundamentally different.
This is technically feasible today. The main challenge is operational governance — not technology.
All FEP policy documents are available and downloadable from bnm.gov.my/fep. 48 documents indexed.
RAG, vector databases, and local LLMs are mature, production-ready technologies used at scale globally.
POC ran on a standard MacBook. Production deployment requires a modest internal server — no GPU needed.
Connecting to internal BNM systems (if needed) adds complexity. Standalone deployment is straightforward.
Requires formal AI governance policy, user guidelines, review process, and escalation paths for uncertain answers.
POC uses a 1.6GB model. Production with a larger model significantly improves answer precision and nuance.
The solution is viable — but only with the right architecture and controls. A phased internal rollout is the safest and most effective path forward.
Deploy to a small group of FEP officers. Collect feedback on answer quality, citation relevance, and edge cases.
Measure accuracy, officer satisfaction, time savings, and "I don't know" rates. Refine the knowledge base and guardrails.
If Phase 2 metrics are satisfactory, broaden to the full FEP team. Upgrade the LLM model for better accuracy.
Only after strong internal performance and a formal governance framework should any public-facing deployment be considered.