Establishing robust benchmarks for evaluating contextual reasoning in large language models

Aatish Kumar Dhami; Er. Siddharth

doi:10.36676/jrps.v16.i1.43

Authors

Aatish Kumar Dhami California State University Long Beach Long Beach, CA 90840 Author
Er. Siddharth Independent Researcher Bennett University, Techzone 2, Greater Noida, Uttar Pradesh 201310 Author

DOI:

https://doi.org/10.36676/jrps.v16.i1.43

Keywords:

Contextual reasoning, robust benchmarks, evaluation metrics, large language models, natural language processing, deep learning, AI interpretability, performance assessment.

Abstract

The growing prevalence of large language models in real-world applications necessitates a deeper understanding of their contextual reasoning capabilities. Despite impressive performance on a variety of tasks, these models often struggle to consistently interpret and integrate complex contextual information, highlighting a critical gap in current evaluation practices. This paper introduces a novel suite of robust benchmarks specifically designed to assess contextual reasoning in large language models. By incorporating diverse and challenging test cases that mirror real-world ambiguity and multi-layered context, our benchmarks aim to uncover both the strengths and limitations of these systems. Extensive experimental evaluations reveal significant variability in performance across different models, emphasizing the need for standardized, context-aware assessment tools. The insights gained from this study not only advance our understanding of contextual reasoning in AI but also provide a solid foundation for the development of next-generation models with improved interpretative and reasoning capabilities.

Downloads

Download data is not yet available.

References

• Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

• Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.

• Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

• Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Michael, M., & Bowman, S. R. (2019). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, 32, 3266–3276.

• Levesque, H. J., Davis, E., & Morgenstern, L. (2011). The Winograd schema challenge. In Proceedings of the AAAI Workshop on Logical Formalizations of Commonsense Reasoning.

• Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.

• Mostafazadeh, N., Allen, J., Callison-Burch, C., Chapman, W., & Dzikovska, M. O. (2016). A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 839–849).

• Lin, Z., Wang, X., Liu, X., & Li, Z. (2021). On the robustness of contextual language models: A case study on adversarial attacks. Journal of Artificial Intelligence Research, 70, 897–924.

• Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

• Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog. Retrieved from https://openai.com/blog/language-unsupervised/

• Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.

• Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? In Proceedings of the China National Conference on Chinese Computational Linguistics (pp. 194–206).

• Schick, T., & Schütze, H. (2020). It's not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118.

• Gao, T., Fisch, A., & Chen, D. (2021). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.

• Khashabi, D., Min, S., Roth, D., & Hajishirzi, H. (2020). Looking beyond the surface: A challenge dataset for deeper understanding of natural language inference. arXiv preprint arXiv:2004.12211.

• Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., & Miller, A. (2019). Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (pp. 2463–2473).

• Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don't know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (pp. 784–789).

• Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 1870–1879).

• Clark, K., Lee, K., & Manning, C. D. (2018). Neural coreference resolution: Progress and challenges. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 5008–5018).

Establishing robust benchmarks for evaluating contextual reasoning in large language models

Authors

DOI:

Keywords:

Abstract

Downloads

References

Published

Issue

Section

License

How to Cite

Similar Articles

Make a Submission

Published Volumes

Keywords

Similar Articles

Multi-Tenant Low Latency Scalable Architectures for Large-Scale Customer Data Processing

The Role of AI and Machine Learning in Automating Tasks and Decision-Making

EFFECT OF BLOOM’S MASTERY LEARNING MODEL ON STUDENTS ACADEMIC ACHIEVEMENT AND ATTITUDE TOWARDS SOCIAL SCIENCE

A Study on effect of toy-based pedagogy on the academic achievement of Foundational Stage Students

Decoding Investor Behaviour in Financial Decision-Making: A Critical Evaluation of Standard Finance vs Behavioural Finance

Designing Scalable and HIPAA-Compliant Notification Systems for Healthcare: Leveraging Cloud, Microservices, and Secure Architectures

Implementing A/B Testing and Hypothesis-driven Development for Product Performance Optimization

A SYSTEMATIC REVIEW OF IMPLICATIONS OF TECHNOLOGY IN FINANCIAL SECTOR

Science Self-efficacy of Secondary School Students

DIGITAL TRANSFORMATION IN RUBBER PRODUCT MARKETING