LlamaRestTest: Effective REST API Testing with Small Language Models

Published in Proceedings of the ACM on Software Engineering, Volume 2, FSE, 2025

Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. This paper presents LlamaRestTest, a novel approach that demonstrates how small language models can perform as well as, or better than, large language models in REST API testing.

Resources:

Key Innovation

LlamaRestTest employs two custom LLMs created by fine-tuning and quantizing the Llama3-8B model:

LlamaREST-IPD: Identifies inter-parameter dependencies by analyzing parameter descriptions and server responses
LlamaREST-EX: Generates semantically valid test inputs that conform to API domain requirements

These models were fine-tuned using a comprehensive dataset combining:

Martin-Lopez’s dependency database (551 parameters with IPDs)
Over 1.8 million API operation parameters mined from APIs-guru

Approach

Fine-Tuning Strategy:

Used Parameter-Efficient Fine-Tuning (PEFT) with QLoRA technique
Trained on custom tokenized formats with special tokens for IPD rules and example values
Applied early stopping to prevent overfitting (35 epochs for LlamaREST-EX, 51 for LlamaREST-IPD)

Quantization Exploration:

Evaluated 2-bit, 4-bit, and 8-bit quantization levels
Balanced model size reduction with accuracy maintenance
8-bit quantization achieved optimal effectiveness-efficiency tradeoff

Integration with ARAT-RL:

Leverages reinforcement learning for adaptive API exploration
Uses ε-greedy strategy to balance exploration and exploitation
Activates LLMs upon repeated failures to refine parameter selection

Dynamic Server Feedback:

Analyzes server response messages in real-time
Refines test inputs based on error messages and API behavior
Discovers hidden dependencies not documented in specifications

Evaluation Results

Evaluated on 12 real-world services (including Spotify, Language Tool, Ohsome) comparing against RESTGPT, RESTler, EvoMaster, MoRest, and ARAT-RL:

Value Generation Performance

Semantically Valid Input Generation:

LlamaREST-EX (fine-tuned): 72.44% precision
RESTGPT: 68.82% precision
Llama3-8B (base model): 22.94% precision

LlamaREST-EX outperformed RESTGPT on challenging APIs:

Ohsome: 92.16% vs 76.83%
Genome-Nexus: 57.02% vs 36.25%

Inter-Parameter Dependency Detection

IPD Rules Correctly Identified:

LlamaREST-IPD (fine-tuned): 12 out of 17 rules
RESTGPT: 9 rules
Llama3-8B (base model): 2 rules

Code Coverage (Open-Source Services)

Average Coverage with 8-bit Quantization:

Method Coverage: 55.8% (vs 41.7-45.8% for other tools)
- +14.1 to +22.7 percentage points improvement
Branch Coverage: 28.3% (vs 9.6-17.8% for other tools)
- +9.2 to +18.7 percentage points improvement
Line Coverage: 55.3% (vs 33.2-45.3% for other tools)
- +10.0 to +22.1 percentage points improvement

Operation Coverage (Online Services)

Operations Successfully Processed:

LlamaRestTest: 37.5 operations
ARAT-RL: 11.6 operations
EvoMaster: 11.2 operations
MoRest: 9.9 operations
RESTler: 10.6 operations

Notable achievements:

Only tool to process operations on Ohsome (24.5 operations vs 0 for others)
Best performance on Spotify (7 operations vs 3.9-5.6 for others)

Fault Detection

Internal Server Errors (500) Detected:

LlamaRestTest: 204 errors
ARAT-RL: 160 errors (+44 for LlamaRestTest)
MoRest: 140 errors (+64 for LlamaRestTest)
EvoMaster: 130 errors (+74 for LlamaRestTest)
RESTler: 130 errors (+74 for LlamaRestTest)

Fine-Tuning vs Quantization Impact

Fine-Tuning Alone:

Modest improvements: +0.9 to +1.4 percentage points over base model
24.5 operations processed (+1.5 over base model’s 23)

8-bit Quantization with Fine-Tuning:

Substantial improvements: +5.1 to +6.8 percentage points
37.5 operations processed (+14.5 over base model)
Best balance of effectiveness and efficiency

4-bit Quantization:

Strong performance maintained
34.4 operations processed (+11.4 over base model)
+4.7 to +5.0 percentage points coverage improvement

2-bit Quantization:

Moderate improvements but limitations in complex scenarios
25.4 operations processed (+2.4 over base model)
+2.5 to +4.1 percentage points coverage improvement

Ablation Study Results

Impact of removing each component (percentage points decrease):

Component Removed	Method	Branch	Line
Full LlamaRestTest	55.8%	28.3%	55.3%
Without LlamaREST-IPD	-9.2	-6.2	-8.4
Without Server Response	-6.1	-4.4	-6.3
Without Parameter Description	-5.9	-1.4	-4.5
Without LlamaREST-EX	-3.7	-4.5	-3.0

Key Findings:

LlamaREST-IPD (dependency detection) is the most critical component
Server response analysis significantly contributes to effectiveness
All components contribute meaningfully to overall performance

Key Contributions

First LM-based REST API Testing with Runtime Feedback: Demonstrates that fine-tuned small language models can dynamically interact with servers during testing
Fine-Tuning Effectiveness: Shows that fine-tuning enables smaller models (8B parameters) to outperform much larger models (GPT-3.5 Turbo) in specialized tasks
Quantization Benefits: Demonstrates that quantization (especially 8-bit) improves both effectiveness and efficiency through faster inference and regularization effects
Comprehensive Evaluation: Extensive comparison across 12 real-world services with multiple state-of-the-art tools using code coverage, operation coverage, and fault detection metrics
Open-Source Artifact: Complete tool, fine-tuned models, training datasets, and evaluation results publicly available

Technical Highlights

Model Architecture:

Base: Llama3-8B with Llama2 tokenization
Fine-tuning: QLoRA (Quantized Low-Rank Adaptation)
Quantization: 2-bit, 4-bit, and 8-bit variants using llama.cpp

Training Configuration:

Batch size: 8 (optimized for 40GB A100 GPU)
Early stopping based on loss plateau
Custom tokens for IPD rules and example values

Inference Performance:

Fine-tuned model: 48.9 seconds per IPD
8-bit quantized: 36.9 seconds per IPD
4-bit quantized: 26.2 seconds per IPD
2-bit quantized: 26.1 seconds per IPD

Practical Applications

Advantages:

Runs on consumer-grade hardware (CPU-based systems supported)
Significantly lower cost than GPT-based approaches
No exposure of sensitive API information to external services
Real-time adaptation to server feedback

Use Cases:

Testing complex APIs with intricate parameter constraints
Discovering undocumented parameter dependencies
Generating domain-specific valid inputs (emails, IDs, coordinates)
APIs with informative error messages

BibTeX

@article{10.1145/3715737,
  author = {Kim, Myeongsoo and Sinha, Saurabh and Orso, Alessandro},
  title = {LlamaRestTest: Effective REST API Testing with Small Language Models},
  year = {2025},
  issue_date = {July 2025},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {2},
  number = {FSE},
  url = {https://doi.org/10.1145/3715737},
  doi = {10.1145/3715737},
  journal = {Proc. ACM Softw. Eng.},
  month = {jun},
  articleno = {FSE022},
  numpages = {24},
  keywords = {Automated REST API Testing, Language Models for Testing}
}

Recommended citation: Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2025. LlamaRestTest: Effective REST API Testing with Small Language Models. Proc. ACM Softw. Eng. 2, FSE, Article FSE022 (July 2025), 24 pages. https://doi.org/10.1145/3715737
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)