Resources
Publications, Datasets, and Models

Publications


  • Proceedings of the First Workshop on Regulatory Natural Language Processing (RegNLP 2025).
    Association for Computational Linguistics, Abu Dhabi, UAE.
    View Anthology
  • Tuba Gokhan and Ted Briscoe. 2025.
    Grounded Answers from Multi-Passage Regulations: Learning-to-Rank for Regulatory RAG.
    In Proceedings of the Natural Legal Language Processing Workshop 2025, pages 135–146, Suzhou, China. Association for Computational Linguistics. Link
  • Jivitesh Jain, Nivedhitha Dhanasekaran, and Mona T. Diab. 2025.
    From Complexity to Clarity: AI/NLP’s Role in Regulatory Compliance.
    Findings of the Association for Computational Linguistics: ACL 2025. Link
  • Tuba Gokhan, Kexin Wang, Iryna Gurevych, and Ted Briscoe. 2024.
    RIRAG: Regulatory Information Retrieval and Answer Generation.
    arXiv preprint arXiv:2409.05677. Link
  • Chalkidis, I., et al. 2023.
    LexFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.
    In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Link
  • Sallam Abualhaija et al. 2022.
    Automated Question Answering for Improved Understanding of Compliance Requirements: A Multi-Document Study.
    In Proceedings of the IEEE International Requirements Engineering Conference (RE 2022). Link
  • Chalkidis, I., et al. 2022.
    LexGLUE: A Benchmark Dataset for Legal Language Understanding in English.
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Link
  • Henderson, P., Krass, M. S., et al. 2022.
    Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset.
    Advances in Neural Information Processing Systems (NeurIPS). Link

Datasets


ObliQA Dataset

A specialized dataset for RegNLP researchers focusing on obligation extraction and regulatory question answering from ADGM regulations.

View on GitHub
ObliQA-MP Dataset

A multi-passage extension of ObliQA for regulatory information retrieval and answer generation with multi-passage evidence.

View on GitHub
Community Datasets
  • MultiEURLEX: A multilingual dataset of 65k EU laws in 23 official EU languages, annotated with EUROVOC concepts. [Link]
  • LEDGAR: A dataset for legal provision classification in contracts, containing over 80k labeled provisions (part of LexGLUE). [Link]
  • BillSum: A dataset for summarization of US Congressional and California state bills. [Link]
  • CUAD: The Contract Understanding Atticus Dataset, expert-annotated for legal contract review. [Link]
  • LexGLUE: A benchmark suite of seven legal NLP tasks (ECtHR, EUR-LEX, LEDGAR, UNFAIR-ToS, CaseHOLD, etc.) for legal language understanding. [Link]
  • Pile of Law: A 256GB corpus of diverse legal texts (case law, statutes, regulations) designed for training and evaluating legal-domain models. [Link]

Models & Code


You can find our official open-source contributions and models on the RegNLP GitHub organization:

RegNLP GitHub Organization
Selected RegNLP Repositories
  • ObliQADataset – full pipeline and data for the ObliQA regulatory QA benchmark.
    GitHub
  • ObliQA-MultiPassage – multi-passage extension of ObliQA, including validation and splits.
    GitHub
  • RePASs – Regulatory Passage Answer Stability Score, an evaluation metric for regulatory QA answers.
    GitHub
  • MultiPassage-RegulatoryRAG – end-to-end RAG pipeline for multi-passage regulatory QA (BM25 + dense + RRF + Learning-to-Rank).
    GitHub
  • ObligationClassifier – fine-tuning LegalBERT for obligation vs. non-obligation classification in regulatory texts.
    GitHub
Community Models
  • SaulLM-7B: A large language model (based on Mistral 7B) designed and fine-tuned for the legal domain.
    HuggingFace Link
  • LegalBERT: A family of BERT models pre-trained on large-scale legal corpora (legislation, court cases, contracts).
    HuggingFace Link
  • CaseLaw-BERT: A BERT variant trained on US case law, used widely in LexGLUE benchmarks for case-based reasoning tasks.
    HuggingFace Link
  • PoL-LegalBERT: LegalBERT models trained on the Pile of Law corpus, suitable for broad legal NLP and retrieval tasks.
    HuggingFace Link
  • Lawformer: A Longformer-style transformer tailored for long legal documents and case materials.
    HuggingFace Link

Related events and workshops


  • NLLP Workshop: The Workshop on Natural Legal Language Processing (typically co-located with EMNLP/NAACL).
  • JURIX: The International Conference on Legal Knowledge and Information Systems.
  • ICAIL: International Conference on Artificial Intelligence and Law.
  • FinNLP: Workshop on Financial Technology and Natural Language Processing.