ArReqConflicts

Detect and Resolve Conflicts in Arabic Software Requirements using LLMs

Requirement Conflict Benchmark (RCB), 26,545 requirement pairs, generated from 4 source datasets (PURE, WorldVista, OPENCOSS, UAV). (Open Source)

Evaluation (Zero- and Few-Shot LLM Settings):
Fanar (Thinking Mode), GPT-4o, DeepSeek-Reasoner, LLaMA 3.1 8B
Performance (Average Micro-F1 across all datasets): Zero-Shot (93%), Few-Shot (94%)

(Type a list of requirements, each on a new line.)
Requirement1_ID - Requirement2_ID Requirement 1 Requirement 2 Result Action
  • Evaluation of 4 LLMs
    We evaluate GPT-4o, Fanar, DeepSeek-Reasoner, and LLaMA 3.1 8B Instruct. The models are evaluated on four datasets (PURE, WorldVista, OPENCOSS, and UAV) under zero-shot and few-shot settings. See article [1] for details.
    Wojood Benchmarks
  • GitHub: Evaluation source code.

    AraREQ Datasets (Datasets only)

  • Tymaa Hammouda, Alaa Aljabari, Nagham Hamad, Mustafa Jarrar:  AraREQ: A Dataset and End-to-End Conflict Detection and Resolution in Software Requirements. In Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026), Palma, Mallorca (Spain).