Multi-dimensional Constraint-based Test Case Generation and Evaluation Framework for Large Language Models

Xuebing Wang

doi:10.62677/IJETAA.2504134

Authors

Xuebing Wang Yonyou Network Technology Co., Ltd., Beijing, China Author

DOI:

https://doi.org/10.62677/IJETAA.2504134

Keywords:

Large Language Models, Test case generation, Multi-dimensional constraints, Reinforcement learning, Functional testing

Abstract

Addressing the challenges of complex test case design and insufficient coverage in functional testing of large language models, this paper presents a multi-dimensional constraint-based test case generation framework. The framework defines constraint rules across four dimensions: syntactic correctness, semantic consistency, task relevance, and boundary conditions, employing reinforcement learning methods to optimize the test case generation process. Through the design of reward function-based generation strategies, the system can automatically produce high-quality functional test samples covering core tasks including text classification, sentiment analysis, and machine translation. Experimental results demonstrate that test cases generated by this method achieve a 42% improvement in functional coverage compared to random generation methods and a 28% increase in defect detection rate. Further ablation experiments validate the effectiveness of each dimensional constraint, providing a systematic solution for large language model quality assurance.

Downloads

Download data is not yet available.

References

J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Transactions on Software Engineering, 2024.

K. Li and Y. Yuan, “Large language models as test case generators: Performance evaluation and enhancement,” arXiv preprint arXiv:2404.13340, 2024.

W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song, and L.Ma, “Testeval: Benchmarking large language models for test case generation,” arXiv preprint arXiv:2406.04531, 2024.

H. F. Chang and M. Shokrolah Shirazi, “A systematic approach for assessing large language models’ test case generation capability,” Software,vol. 4, no. 1, p. 5, 2025.

K. Liu, Y. Liu, Z. Chen, J. M. Zhang, Y. Han, Y. Ma, and G. Huang, “Llm-powered test case generation for detecting tricky bugs,” arXiv preprint arXiv:2404.10304, 2024.

S. Rehan, B. Al-Bander, and A. Al-Said Ahmad, “Harnessing large language models for automated software testing: A leap towards scalable test case generation,” Electronics, vol. 14, no. 7, 2025.

M. Ferreira, L. Viegas, J. P. Faria, and B. Lima, “Acceptance test generation with large language models: An industrial case study,” arXiv preprint arXiv:2504.07244, 2025.

S. K. S. Hari, V. V. R. Konda, V. Kamakoti, V. M. Vedula, and K. S. Maneperambil, “Automatic constraint based test generation for behavioral HDL models,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 4, pp. 408–421, 2008.

J. Liu, R. Liang, X. Zhu, Y. Zhang, Y. Liu, and Q. Liu, “LLM4TDG: testdriven generation of large language models based on enhanced constraint reasoning,” Cybersecurity, vol. 8, no. 1, pp. 1–23, 2025.

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” arXiv preprint arXiv:1805.11074, 2018.

J. Kim, M. Kwon, and S. Yoo, “Generating test input with deep reinforcement learning,” in IEEE/ACM 11th International Workshop on Search-Based Software Testing, 2018.

L. Peng, P. Song, X. Jiang, and L. Xie, “Model-based interlocking software test case generation method,” Railway Signalling & Communication Engineering, vol. 19, no. 11, 2022.

J. Song, X. Zuo, X. Zhang, and H. Huang, “Comprehensive survey of large language model evaluation methods,” Aerospace Measurement Technology, vol. 45, no. 2, p. 1, 2025.

C. Qin, S. Chen, K. Li, H. Liu, L. Yang, and Z. Ma, “Automatic generation technology of white-box unit test cases for MC/DC coverage,” Science Technology & Engineering, vol. 24, no. 30, 2024.

B. Steenhoek, M. Tufano, N. Sundaresan, and A. Svyatkovskiy, “Rein-forcement learning from automatic feedback for high-quality unit test generation,” arXiv preprint arXiv:2310.02368, 2023.

B. Yang, J. Wu, L. Xu, K. Bi, and C. Liu, “A software testing requirement modeling and test case generation method,” Journal of Computer Research and Development, vol. 37, no. 3, pp. 522–538, 2014.

C. Tan, R. Behjati, and E. Arisholm, “Enhancing synthetic test data generation with language models using a more expressive domain specific language,” in ICTSS 2023 Proceedings, pp. 15–30, 2023.

Z. Qian, Q. Yu, D. Zhang, C. Yao, L. Qin, and W. Cheng, “Chain-type multi-path coverage test case generation combining SVM and XGBoost,” Journal of Software, vol. 35, no. 6, pp. 2795–2820, 2023.

S. Kang, J. Yoon, and S. Yoo, “Large language models are few shot testers: Exploring llm-based general bug reproduction,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, 2023, pp. 2312–2323.

N. Hansen, H. Su, and X. Wang, “TD-MPC2: Scalable, robust world models for continuous control,” arXiv preprint arXiv:2310.16828, 2023.