RegNLP - Resources

This page has been created as a curated overview of regulatory NLP resources. It brings together publications, datasets, models, and events from across the wider community to support researchers and practitioners.
If you would like to suggest additional resources, please contact tuba.gokhan@mbzuai.ac.ae.
Unless explicitly marked as a RegNLP contribution, the resources listed on this page are external and were not produced by the RegNLP team.

Publications

Proceedings of the First Workshop on Regulatory Natural Language Processing (RegNLP 2025).
Association for Computational Linguistics, Abu Dhabi, UAE.
View Anthology
Tuba Gokhan and Ted Briscoe. 2025.
Grounded Answers from Multi-Passage Regulations: Learning-to-Rank for Regulatory RAG.
In Proceedings of the Natural Legal Language Processing Workshop 2025, pages 135–146, Suzhou, China. Association for Computational Linguistics. Link
Jivitesh Jain, Nivedhitha Dhanasekaran, and Mona T. Diab. 2025.
From Complexity to Clarity: AI/NLP’s Role in Regulatory Compliance.
Findings of the Association for Computational Linguistics: ACL 2025. Link
Tuba Gokhan, Kexin Wang, Iryna Gurevych, and Ted Briscoe. 2024.
RIRAG: Regulatory Information Retrieval and Answer Generation.
arXiv preprint arXiv:2409.05677. Link
Chalkidis, I., et al. 2023.
LexFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Link
Sallam Abualhaija et al. 2022.
Automated Question Answering for Improved Understanding of Compliance Requirements: A Multi-Document Study.
In Proceedings of the IEEE International Requirements Engineering Conference (RE 2022). Link
Chalkidis, I., et al. 2022.
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English.
In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Link
Henderson, P., Krass, M. S., et al. 2022.
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset.
Advances in Neural Information Processing Systems (NeurIPS). Link

Datasets

ObliQA Dataset

A specialized dataset for RegNLP researchers focusing on obligation extraction and regulatory question answering from ADGM regulations.

View on GitHub

ObliQA-MP Dataset

A multi-passage extension of ObliQA for regulatory information retrieval and answer generation with multi-passage evidence.

View on GitHub

Community Datasets

MultiEURLEX: A multilingual dataset of 65k EU laws in 23 official EU languages, annotated with EUROVOC concepts. [Link]
LEDGAR: A dataset for legal provision classification in contracts, containing over 80k labeled provisions (part of LexGLUE). [Link]
BillSum: A dataset for summarization of US Congressional and California state bills. [Link]
CUAD: The Contract Understanding Atticus Dataset, expert-annotated for legal contract review. [Link]
LexGLUE: A benchmark suite of seven legal NLP tasks (ECtHR, EUR-LEX, LEDGAR, UNFAIR-ToS, CaseHOLD, etc.) for legal language understanding. [Link]
Pile of Law: A 256GB corpus of diverse legal texts (case law, statutes, regulations) designed for training and evaluating legal-domain models. [Link]

Models & Code

You can find our official open-source contributions and models on the RegNLP GitHub organization:

RegNLP GitHub Organization

Selected RegNLP Repositories

ObliQADataset – full pipeline and data for the ObliQA regulatory QA benchmark.
GitHub
ObliQA-MultiPassage – multi-passage extension of ObliQA, including validation and splits.
GitHub
RePASs – Regulatory Passage Answer Stability Score, an evaluation metric for regulatory QA answers.
GitHub
MultiPassage-RegulatoryRAG – end-to-end RAG pipeline for multi-passage regulatory QA (BM25 + dense + RRF + Learning-to-Rank).
GitHub
ObligationClassifier – fine-tuning LegalBERT for obligation vs. non-obligation classification in regulatory texts.
GitHub

Community Models

SaulLM-7B: A large language model (based on Mistral 7B) designed and fine-tuned for the legal domain.
HuggingFace Link
LegalBERT: A family of BERT models pre-trained on large-scale legal corpora (legislation, court cases, contracts).
HuggingFace Link
CaseLaw-BERT: A BERT variant trained on US case law, used widely in LexGLUE benchmarks for case-based reasoning tasks.
HuggingFace Link
PoL-LegalBERT: LegalBERT models trained on the Pile of Law corpus, suitable for broad legal NLP and retrieval tasks.
HuggingFace Link
Lawformer: A Longformer-style transformer tailored for long legal documents and case materials.
HuggingFace Link

Related events and workshops

NLLP Workshop: The Workshop on Natural Legal Language Processing (typically co-located with EMNLP/NAACL).
JURIX: The International Conference on Legal Knowledge and Information Systems.
ICAIL: International Conference on Artificial Intelligence and Law.
FinNLP: Workshop on Financial Technology and Natural Language Processing.