SOFTWARE AND SOCIETAL SYSTEMS DEPARTMENT TECHNICAL REPORT ABSTRACTS

CMU-S3D-25-108
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University

CMU-S3D-25-108

Embeddings-Assisted Requirements Extraction and Elicitation

Yuchen Shen

July 2025

Ph.D. Thesis
Software Engineering

CMU-S3D-25-108.pdf

Keywords: Requirements engineering, machine learning, natural language processing, large language models

Companies leverage personalization techniques to tailor user experiences. Personalization appears in search engines and online stores, which include salutations and statistically learned correlations over search-, browsing- and purchase-histories. However, users have a wider variety of substantive, domain-specific preferences that influence their choices when they use directory services, and these have largely been overlooked or ignored in both the scientific research and the state-of-the-art practices. Specifically, users have preferences about what they are looking for, and are using services with varying levels of personalization to aid in discovering their things of interest. In the realm of requirements engineering (RE), requirements analysts endeavor to gather, comprehend, and prioritize requirements, with an important focus on stakeholder preferences and needs, employing diverse requirement elicitation techniques. Advances in Machine Learning (ML) and Natural Language Processing (NLP) have revolutionized the way people understand and interact with natural language, and opened up new opportunities to enhance and automate various facets of requirements engineering, including stakeholder requirements elicitation. These technologies have enabled the analysis of vast amounts of textual data that were previously impractical to process manually, hence improving the accuracy and efficiency of understanding stakeholder needs and preferences. Now, organizations have the potential to automatically extract and prioritize requirements from diverse sources that contain large amounts of natural language, which not only accelerates the requirements engineering process but also reduces the likelihood of human errors, leading to more reliable and user-centered software solutions. Additionally, it allows for continuous feedback loops and real-time updates, ensuring that the project stays aligned with evolving stakeholder demands.

Despite the benefits, there are still a lot left to explore about how the emerging new technologies may be used to aid in and improve both domain knowledge modeling and requirements acquisition. Human access to domain knowledge is challenging in that such knowledge is often tacit and specialized. Requirements elicitation practices often see imbalances in domain knowledge between requirements analysts and stakeholders. This thesis aims address such balances, to explore the potential of embeddings-assisted techniques to enhance stakeholder preference extraction and elicitation practices, particularly by utilizing semantic information encoded in natural language embeddings to supplement gaps in stakeholder knowledge, guide elicitation with the obtained knowledge, and in turn support requirements acquisition. We believe natural language embeddings can be used to improve: 1) domain knowledge modeling: identifying and obtaining requirement-related domain elements; and 2) guided elicitation: given requirement-related domain elements, use them to refine requirements artifacts, such as interviews. Specifically, we demonstrate the following embeddings-assisted preference extraction and elicitation methods: 1) domain knowledge modeling with user-authored scenarios: we research on how stakeholder preferences are expressed in text scenarios and report our success in extracting and modeling domain knowledge from scenarios using classifiers and linkers; 2) domain knowledge modeling with MLM: we study the efficacy of identifying and obtaining requirement-related domain knowledge from word embeddings using a BERT-based Masked Language Model (MLM), with the aim of discovering missing relationships for requirements, and report our success in discovering associated actors, actions, and modifiers for constructing the domain model; 3) guided elicitation for interviews: we study a method to refine requirements artifacts, in this case interviews, by guiding real-time requirements elicitation interviews with domain knowledge extracted from MLM, and report an improvement on elicitation outcome for both eliciting more concepts and eliciting more specific concepts; 4) interview question generation: we describe a framework outlining common interviewer mistake types and for evaluating follow-up question quality, and conduct studies to test the capability of GPT-4o to generate interview questions, which show that minimally-guided LLM-generated questions are no better or worse than human-authored questions with respect to clarity, relevancy and informativeness, and that LLM-generated questions outperform human-authored questions in mistake-guided question generation. The outcome of the thesis is to shed light on how embeddings-assisted techniques can be integrated into existing requirement extraction and elicitation practices to enhance and enrich them, while advancing our understanding about how stakeholder requirements may be elicited more effectively and comprehensively.

131 pages

Thesis Committee:
Travis Breaux (Chair)
Christian Kästner
Bogdan Vasilescu
Fabiano Dalpiaz (Utrecht University)

Nicolas Christin, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu