ASTER: Natural and Multi-Language Unit Test Generation with LLMs
Published in 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2025
🏆 Distinguished Paper Award at ICSE-SEIP 2025
Implementing automated unit tests is an important but time-consuming activity in software development. While many automated test generation (ATG) techniques have been developed over the decades, they suffer from poor readability and do not resemble developer-written tests, inhibiting their adoption in practice. This paper presents ASTER, a rigorous investigation of how Large Language Models (LLMs) can bridge this gap.
Resources:
The Problem
Automated test generation tools have achieved considerable success in generating high-coverage test suites with good fault-detection ability, but they have several key limitations:
Limitations of Existing ATG Tools
Poor Test Naturalness:
- ❌ Tests lack readability and comprehensibility
- ❌ Non-meaningful test and variable names (e.g.,
test0(),stringArray1) - ❌ Trivial or ineffective assertions
- ❌ Contain anti-patterns and test smells
- ❌ Do not resemble developer-written tests
Developer Perception Study Results:
- EvoSuite (Java): 42% of developers would NOT add tests to regression suite, 21% only after significant changes
- CodaMosa (Python): 73% would not add tests, 14% only after significant changes
- Consequence: Developers find automatically generated tests hard to maintain and are reluctant to adopt them
Limited Language Support:
- Most tools work for very few programming languages
- Significant engineering effort required for each new language
- Complex applications requiring mocking often not supported
Our Approach: ASTER
ASTER (Automated Software TEsting with LLMs for improved Readability) is a generic pipeline that incorporates static analysis to guide LLMs in generating compilable, high-coverage, and natural test cases.
Key Innovation
Unlike existing approaches that rely on:
- Symbolic analysis (Pex, KLEE)
- Search-based techniques (EvoSuite, Pynguin)
- Random testing (Randoop)
ASTER leverages:
- LLM’s inherent ability to synthesize natural-looking code
- Static program analysis to provide rich context
- Multi-language support through LLM’s intrinsic knowledge
- Mocking capability for complex enterprise applications
Multi-Language Support
ASTER is implemented for:
- Java (both Java SE and Java EE applications)
- Python
Demonstrating the feasibility of building multilingual unit test generators with LLMs guided by lightweight program analysis.
Architecture and Workflow
Source Code → Preprocessing → LLM Prompting → Postprocessing → Enhanced Test Suite
(Analysis) (Generation) (Repair)
Three-Phase Pipeline
1. Preprocessing Phase: Performs static analysis to extract context for LLM prompts:
- Testing scope identification: Public/protected methods, inherited implementations
- Constructor chains: Relevant constructors for creating required objects
- Auxiliary methods: Getters/setters for object state manipulation
- Private method call chains: Enabling indirect testing of private methods
- Mocking candidates: Fields, types, and methods requiring mocking
2. LLM Prompting: Stage-specific prompts for:
- â‘ Initial test generation
- ② Test repair for compilation/runtime errors
- ③ Coverage augmentation
3. Postprocessing Phase:
- Output sanitization: Remove extraneous content, ensure syntax correctness
- Error remediation: Rule-based fixes + LLM-guided repair
- Coverage augmentation: Target uncovered code lines
Mocking Support
ASTER incorporates comprehensive mocking capability for Java:
Systematic Mocking Identification:
- Fields and Types: Identifies mockable APIs in focal class and method parameters
- Method Stubs: Determines scope for “when-then” clauses
Supported Mocking Patterns:
- Field mocking
- Constructor mocking
- Static API mocking
- Non-static API mocking
Example: Apache Commons JXPath test with mocking:
@Test
public void testisLeaf() throws Exception {
node = mock(Node.class);
domnodepointer = new DOMNodePointer(node, new Locale("en"), "id");
when(node.hasChildNodes()).thenReturn(false);
assertTrue(domnodepointer.isLeaf());
}
Evaluation Setup
Models Evaluated (6 LLMs)
| Model | Provider | Size | Type | License |
|---|---|---|---|---|
| GPT-4-turbo | OpenAI | 1.76T†| Generic | Closed |
| Llama3-70b | Meta | 70B | Generic | Open |
| CodeLlama-34b | Meta | 34B | Code | Open |
| Granite-34b | IBM | 34B | Code | Open |
| Llama3-8b | Meta | 8B | Generic | Open |
| Granite-8b | IBM | 8B | Code | Open |
Datasets
Java SE Applications (4):
- Apache Commons CLI (310 methods)
- Apache Commons Codec (977 methods)
- Apache Commons Compress (5,003 methods)
- Apache Commons JXPath (1,501 methods)
Java EE Applications (4):
- CargoTracker (1,074 methods)
- DayTrader (1,481 methods)
- PetClinic (238 methods)
- App X (proprietary enterprise app, 1,402 methods)
Python Applications (283 modules):
- 20 projects from CodaMosa benchmark
- 2,216 methods total
Baselines
- Java: EvoSuite 1.2.0 (state-of-the-art evolutionary testing)
- Python: CodaMosa (latest LLM-enhanced search-based testing)
Evaluation Results
RQ1: Code Coverage Effectiveness
Java SE Applications:
ASTER is very competitive with EvoSuite:
| Application | Line Coverage | Branch Coverage | Method Coverage |
|---|---|---|---|
| Commons CLI | EvoSuite: 86% / ASTER: 82% | EvoSuite: 71% / ASTER: 74% | EvoSuite: 90% / ASTER: 88% |
| Commons Codec | EvoSuite: 84% / ASTER: 77% | EvoSuite: 76% / ASTER: 69% | EvoSuite: 89% / ASTER: 86% |
| Commons Compress | Similar low coverage (complex compression APIs) | Â | Â |
| Commons JXPath | EvoSuite: 13% / ASTER: 66% | EvoSuite: 11% / ASTER: 51% | EvoSuite: 16% / ASTER: 67% |
Key Finding: ASTER significantly outperforms on JXPath (4-5x better) due to mocking support.
Java EE Applications:
ASTER significantly outperforms EvoSuite:
| Application | Improvement |
|---|---|
| DayTrader | +26.4% line, +10.6% branch, +18.5% method |
| App X | +7x line coverage, +84x branch coverage |
Python Applications:
| Metric | CodaMosa | ASTER (GPT-4) | Improvement |
|---|---|---|---|
| Line Coverage | 44.0% | 78.0% | +77% |
| Branch Coverage | 53.7% | 77.2% | +44% |
| Method Coverage | 61.2% | 86.7% | +42% |
Finding: ASTER substantially outperforms CodaMosa across all metrics.
Model Performance Comparison
Surprising Result: Smaller models (Granite-34b, Llama-8b) perform competitively with larger models:
- Granite-34b: Only -0.1% line coverage vs. GPT-4
- Llama-8b: Only -1.35% line coverage vs. GPT-4
Implications:
- Cost-effective alternative to GPT-4
- Suitable for enterprise deployment (on-premise, privacy-preserving)
- Reduced carbon footprint
- Can operate on workstations
RQ2: Developer Perception
Survey Methodology:
- 161 professional software developers
- 9 focal methods with test pairs
- Questions on comprehensibility and usability
Survey Results:
Java Tests:
Question: "I understand what this test case is doing"
- ASTER: 91.6% positive (agree/strongly agree)
- EvoSuite: 67.7% positive
- Developer-written: 69.1% positive
Python Tests:
Question: "I understand what this test case is doing"
- ASTER: 100% positive
- CodaMosa: 45.1% positive
- Developer-written: 84.0% positive
Readiness for Regression Suite:
| Tool | Add without change | Minor changes | Significant changes | Will not add |
|---|---|---|---|---|
| ASTER (Java) | 30% | 40% | 22% | 8% |
| EvoSuite | 8% | 29% | 21% | 42% |
| ASTER (Python) | 48% | 40% | 10% | 2% |
| CodaMosa | 2% | 10% | 16% | 72% |
Key Finding: 70% (Java) and 88% (Python) of developers would add ASTER tests with no or minor changes.
RQ3: Test Naturalness
Automated Naturalness Metrics:
1. Test Name Meaningfulness:
- ASTER: ~15% more natural than EvoSuite
- ASTER: ~23% more natural than developer-written tests
- Metric captures focal method inclusion and semantic relevance
2. Variable Name Meaningfulness:
EvoSuite Example:
String[] stringArray1 = ...
ASTER Example:
String[] flattenedArguments = ...
3. Assertion Quality:
- ASTER: 99% of Python tests have assertions (vs. 0% for CodaMosa baseline)
- ASTER: Fewer exception-related tests than EvoSuite (38% reduction)
- ASTER: More meaningful assertions checking expected transformations
Example Comparison:
EvoSuite Generated:
@Test
public void test0() throws Throwable {
BasicParser basicParser0 = new BasicParser();
String[] stringArray0 = new String[4];
String[] stringArray1 = basicParser0.flatten(
(Options)null, stringArray0, true);
assertEquals(4, stringArray1.length);
}
ASTER Generated:
@Test
public void testFlattenWithShortOption_c5_o0() {
Options options = new Options();
options.addOption("option1", "o", true, "description1");
String[] arguments = new String[] {
"-o", "value1", "--option2", "value2"
};
BasicParser parser = new BasicParser();
String[] flattenedArguments = parser.flatten(
options, arguments, false);
assertArrayEquals(new String[] {
"-o", "value1", "-option2", "value2"
}, flattenedArguments);
}
Improvements:
- âś… Meaningful test name describing functionality
- âś… Descriptive variable names (
flattenedArgumentsvs.stringArray1) - âś… Assertion checks expected transformation, not just length
Key Advantages of ASTER
1. Superior Naturalness:
- Meaningful test and variable names
- Quality assertions with semantic understanding
- Resembles developer-written tests
- High developer acceptance (70%+ Java, 88%+ Python)
2. Competitive/Superior Coverage:
- Java SE: Matches EvoSuite (-0.5% to +5.1%)
- Java EE: Significantly outperforms (+10.6% to +84x)
- Python: Substantially better than CodaMosa (+42% to +77%)
3. Multi-Language Support:
- Java (SE and EE)
- Python
- Extensible to other languages with lightweight analysis
4. Mocking Capability:
- Systematic identification of mocking candidates
- Supports complex enterprise applications
- Handles database operations, services, libraries
5. Model Flexibility:
- Works with 6 different LLMs
- Smaller models competitive with GPT-4
- Pluggable architecture
6. Cost-Effective:
- Smaller models (8B-34B) perform well
- On-premise deployment possible
- Privacy-preserving for enterprise use
Industry Deployment
ASTER’s Java test generation capability is offered in:
- IBM watsonx Code Assistant for Enterprise Java Applications
This demonstrates real-world applicability and industrial validation.
Lessons Learned
1. Multi-Language/Framework Support
LLMs combined with lightweight program analysis offer versatile support for multiple languages and frameworks:
- Extends to new languages with minimal engineering effort
- Leverages LLM’s intrinsic knowledge of syntax and semantics
- Critical for legacy and continuous development contexts
2. Naturalness is Achievable
LLMs trained on developer-written code inherently produce natural output:
- Superior to conventional ATG tools
- When combined with program analysis, achieves high coverage AND naturalness
- Research directions: enhance conventional tests, incorporate naturalness in pretraining
3. Affordability Through Smaller Models
Smaller models (8B-34B parameters) are competitive:
- Cost-effective alternatives to GPT-4
- Enable on-premise/local deployment
- Address privacy concerns in enterprise settings
- Future: develop efficient, quantized test-generation-specific models
Related Work Comparison
vs. Conventional ATG:
- Symbolic analysis (KLEE, Pex)
- Search-based (EvoSuite, Pynguin)
- Random testing (Randoop)
- ASTER: Leverages LLMs for naturalness + multi-language support
vs. Other LLM-based Testing:
- AthenaTest: Fine-tuning pipeline
- TestPilot: JavaScript/TypeScript specific
- ChatUnitTest/ChatTester: GPT-specific, lacks program analysis
- Recent arXiv work: Builds on CodaMosa (lower baseline), single language
- ASTER: Multi-language, from-scratch generation, higher coverage, program analysis guidance
vs. Industry Studies:
- Meta study [Alshahwan et al. 2024]: 73% acceptance rate
- ASTER: 70%+ (Java), 88%+ (Python) acceptance
- Common finding: Naturalness key to acceptance
- ASTER contribution: Mocking support, multi-language, model size comparison, rigorous naturalness study
Future Research Directions
1. Extended Language Support
- Apply to more programming languages (C++, Go, Rust)
- Different testing levels (integration, system testing)
2. Fine-Tuned Models
- Create test-generation-specific models
- Train on APIs-guru dataset (4,000+ specifications)
- Reduce cost of LLM interactions
- Lightweight models for commodity CPUs
3. Enhanced Fault Detection
- Beyond code coverage: fault-detection ability
- Semantic error detection
- Security vulnerability identification
- Mutation testing integration
4. RAG-Based Approaches
- Incorporate retrieval-augmented generation
- In-context learning with relevant examples
- Project-specific test pattern learning
5. Assertion Improvement
- Reduce tests without assertions
- Generate more meaningful assertions
- Property-based testing integration
Key Contributions
- ASTER Technique: Generic LLM-assisted test generation pipeline guided by static analysis
- Multi-Language Implementation: Demonstrated for Java and Python
- Mocking Support: Systematic approach for complex applications
- Comprehensive Evaluation: 6 LLMs, 12 Java + 283 Python applications
- Superior Results: Competitive coverage + significantly better naturalness
- Developer Study: 161 professionals confirming developer preference
- Naturalness Analysis: Automated metrics + qualitative assessment
- Industry Deployment: IBM watsonx Code Assistant
- Open-Source Artifact: Available for research community
Implications
For Practitioners:
- Dramatically improved test quality and usability
- Tests ready for regression suites with minimal changes
- Support for complex enterprise applications
- Multi-language capability
For Researchers:
- Demonstrates LLM effectiveness for test generation
- Smaller models viable for resource-constrained settings
- Opens directions for naturalness-focused research
- Provides baseline and artifact for future work
For Tool Developers:
- Clear integration path for LLMs in testing tools
- Program analysis enhances LLM effectiveness
- Pluggable architecture supports different models
- Scalable solution across languages
BibTeX
@INPROCEEDINGS{pan2025aster,
author={Pan, Rangeet and Kim, Myeongsoo and Krishna, Rahul and Pavuluri, Raju and Sinha, Saurabh},
booktitle={2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)},
title={ASTER: Natural and Multi-Language Unit Test Generation with LLMs},
year={2025},
pages={413-424},
doi={10.1109/ICSE-SEIP66354.2025.00042},
address={Ottawa, ON, Canada},
month={May},
publisher={IEEE Computer Society},
keywords={Java;Large language models;Pipelines;Static analysis;Software;Test pattern generators;Standards;Software engineering;Python;Software development management}
}
Recommended citation: R. Pan, M. Kim, R. Krishna, R. Pavuluri and S. Sinha, "ASTER: Natural and Multi-Language Unit Test Generation with LLMs," in 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Ottawa, ON, Canada, 2025, pp. 413-424, doi: 10.1109/ICSE-SEIP66354.2025.00042.
Download Paper
