Establishing robust benchmarks for evaluating contextual reasoning in large language models. JRPS. 2025;16(1):215-228. doi:10.36676/jrps.v16.i1.43