Leveraging Large Language Models to Improve REST API Testing

Published in Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER 2024), 2024

The widespread adoption of REST APIs, coupled with their growing complexity and size, has led to the need for automated REST API testing tools. However, current tools focus on structured data in REST API specifications but often neglect valuable insights available in unstructured natural-language descriptions, leading to suboptimal test coverage.

Resources:

The Problem

Automated REST API testing tools derive test cases primarily from API specifications like OpenAPI. However, they struggle to achieve high code coverage due to difficulties in comprehending the semantics and constraints present in parameter names and descriptions.

Limitations of Existing Assistant Tools

NLP2REST (NLP-based Approach):

✅ Extracts constraints using keyword-driven extraction
❌ Limited precision (50% without validation, 79% with validation)
❌ Requires expensive validation process with deployed service
❌ Demands significant engineering effort
❌ Limited to specific rule types

ARTE (Knowledge Base Approach):

✅ Queries DBPedia for parameter values
❌ Low accuracy (16.93% on average)
❌ Context-blind: generates irrelevant values
❌ Cannot understand parameter context from descriptions
❌ Only 17% syntactically and semantically valid inputs

Our Approach: RESTGPT

We introduce RESTGPT, an innovative approach that harnesses the power and intrinsic context-awareness of Large Language Models (LLMs) to improve REST API testing. RESTGPT extracts machine-interpretable rules and generates example parameter values from natural-language descriptions in API specifications.

Key Innovation

Unlike existing approaches that rely on:

Keyword matching (NLP2REST)
External knowledge bases (ARTE)

RESTGPT leverages:

Semantic understanding of LLMs
Context-awareness from descriptions
Few-shot learning for precise outputs
No validation required - high precision without expensive checking

Architecture and Workflow

OpenAPI Spec → Specification Parser → Rule Generator (GPT-3.5 Turbo)
                                              ↓
                                    Enhanced Specification

Component Details

1. Specification Parser:

Extracts machine-readable sections
Identifies human-readable natural language descriptions
Prepares data for rule extraction

2. Rule Generator: Uses carefully crafted prompts to extract four constraint types:

Operational Constraints: Dependencies between operations
Parameter Constraints: Value ranges, formats, restrictions
Parameter Type and Format: Data types and structures
Parameter Examples: Sample valid values

3. Specification Builder:

Combines extracted rules with original specification
Ensures no conflicts between restrictions
Produces enhanced OpenAPI specification

Prompt Engineering Strategy

RESTGPT’s effectiveness stems from its sophisticated prompt design with four core components:

1. Guidelines:

- Identify the parameter using its name and description
- Extract logical constraints from the parameter description
- Interpret the description in the least constraining way

2. Cases (Chain-of-Thought): Decomposes rule extraction into specific, manageable pieces:

Case 1: Non-definitive descriptions → Output “None”
…
Case 10: Complex parameter relationships → Combine rules

3. Grammar Highlights: Emphasizes key operators the model should recognize:

Relational: <, >, <=, >=, ==, !=
Arithmetic: +, −, ∗, /
Dependency: AllOrNone, ZeroOrOne, etc.

4. Output Configurations: Defines structured output format for easy processing

Few-Shot Learning

RESTGPT employs few-shot prompts with:

Concise, contextually-rich instructions
Representative examples
Clear output formatting
Ensures relevant and precise model outputs

Motivating Example: FDIC Bank Data API

Consider this parameter from the FDIC Bank Data API specification:

- name: filters
  description: |
    The filter for the bank search. Examples:
    * Filter by State name: STNAME:"West Virginia"
    * Filter for multiple States: STNAME:("West Virginia", "Delaware")

- name: sort_order
  description: |
    Indicator if ascending (ASC) or descending (DESC)

Tool Comparison

ARTE (Knowledge Base Approach):

filters: ❌ No relevant values (DBPedia has no context)
sort_order: ❌ Generates “List of colonial heads of Portuguese Timor”

NLP2REST (Keyword-Driven):

filters: ✓ Captures examples with “example” keyword
sort_order: ❌ Fails without identifiable keywords

RESTGPT (LLM-based):

filters: ✓ Generates STNAME: "California", STNAME: ("California", "New York")
sort_order: ✓ Correctly identifies “ASC” and “DESC”

Result: RESTGPT demonstrates superior contextual understanding.

Evaluation Results

Evaluated on 9 RESTful services from the NLP2REST study with ground truth data.

Rule Extraction Effectiveness

Tool	Precision	Recall	F1 Score
NLP2REST (no validation)	50%	94%	65%
NLP2REST (with validation)	79%	91%	85%
RESTGPT	97%	92%	94%

Key Findings:

97% precision without any validation process
Eliminates need for expensive validation infrastructure
Outperforms NLP2REST even with validation (79%)
Maintains high recall (92%) while achieving superior precision

Detailed Results by Service

Service	NLP2REST (no val)	NLP2REST (with val)	RESTGPT
FDIC	54% / 93%	93% / 93%	100% / 98%
Genome Nexus	96% / 98%	96% / 98%	100% / 93%
Language Tool	63% / 100%	90% / 90%	100% / 86%
OCVN	88% / 88%	93% / 76%	88% / 94%
OhSome	16% / 93%	52% / 80%	80% / 86%
OMDb	100% / 100%	100% / 100%	100% / 100%
REST Countries	97% / 88%	100% / 88%	100% / 94%
Spotify	55% / 94%	75% / 93%	98% / 96%
YouTube	19% / 88%	76% / 82%	92% / 75%

Format: Precision / Recall

Value Generation Accuracy

Accuracy measures percentage of syntactically and semantically valid inputs generated:

Service	ARTE	RESTGPT	Improvement
FDIC	25.35%	77.46%	+206%
Genome Nexus	9.21%	38.16%	+314%
Language Tool	0%	82.98%	∞
OCVN	33.73%	39.76%	+18%
OhSome	4.88%	87.80%	+1,700%
OMDb	36.00%	96.00%	+167%
REST Countries	29.66%	92.41%	+212%
Spotify	14.79%	76.06%	+414%
YouTube	0%	65.33%	∞
Average	16.93%	72.68%	+329%

Key Achievement: RESTGPT generates valid inputs for 73% of parameters vs. 17% for ARTE.

Why RESTGPT Outperforms

Context-Awareness Example (LanguageTool service):

ARTE: Generates generic language names: “Arabic”, “Chinese”, “English”, “Spanish”
RESTGPT: Understands context and generates proper language codes: “en-US”, “de-DE”

This demonstrates LLM’s superior ability to understand parameter context from descriptions.

Key Advantages

1. No Validation Required:

Achieves 97% precision without expensive validation
No need for deployed service instance
Reduced engineering effort

2. Superior Context Understanding:

Interprets semantic meaning from descriptions
Generates contextually relevant values
Understands domain-specific conventions

3. Broader Rule Coverage:

Not limited to specific keywords or patterns
Handles complex constraint expressions
Adapts to diverse specification styles

4. High Accuracy:

329% improvement over ARTE in value generation
94% improvement over NLP2REST without validation
23% improvement over NLP2REST with validation

5. Scalability:

Works across diverse API domains
No manual rule configuration required
Generalizes to unseen specifications

Benchmark Services

Evaluated on 9 real-world REST APIs:

FDIC - Federal Deposit Insurance Corporation Bank Data
Genome Nexus - Genomics data service
LanguageTool - Grammar checking service
OCVN - Open Contracting Vietnam
OhSome - OpenStreetMap data analysis
OMDb - Open Movie Database
REST Countries - Country information service
Spotify - Music streaming API
YouTube - Video platform API

Technical Implementation

LLM Selection: GPT-3.5 Turbo

Chosen for accuracy and efficiency balance
Demonstrated in OpenAI reports
Cost-effective for practical deployment

Prompt Components:

Guidelines: Framework for rule interpretation
Cases: Chain-of-Thought decomposition
Grammar: Domain-specific operators
Output Format: Structured result templates

Few-Shot Learning:

Provides contextually-rich examples
Ensures precise, relevant outputs
Reduces hallucination and errors

Future Research Directions

1. Model Improvement

Task-Specific Fine-Tuning:

Fine-tune LLMs using APIs-guru (4,000+ specifications)
Incorporate RapidAPI dataset
Enhance understanding of diverse API contexts

Lightweight Model Development:

Create models for commodity CPUs
Model pruning and compression
Retain essential neurons for REST API tasks

2. Enhanced Fault Detection

Beyond 500 Status Codes:

Detect CRUD semantic errors
Identify producer-consumer relationship discrepancies
Recognize logical inconsistencies

Broader Bug Categories:

Security vulnerabilities
Performance issues
Specification-implementation mismatches

3. LLM-Based Testing Tool

Server Message Leveraging:

Parse and understand server error messages
Autonomously generate corrective test cases
Adaptive testing based on feedback

Dynamic Test Generation:

Real-time test adaptation
Context-aware test sequences
Intelligent exploration strategies

Key Contributions

RESTGPT Framework: First application of LLMs to REST API specification enhancement
Superior Performance: 97% precision in rule extraction, 73% accuracy in value generation
Validation-Free Approach: High accuracy without expensive validation infrastructure
Context-Aware Generation: Semantic understanding enables contextually relevant outputs
Open-Source Artifact: Publicly available tool and experimental data
Future Roadmap: Clear directions for advancing LLM-based REST API testing

Implications

For Practitioners:

Dramatically improved test input quality
Reduced manual effort in test creation
Better coverage without additional engineering

For Researchers:

Demonstrates LLM effectiveness for specification analysis
Opens new directions for AI-assisted testing
Provides baseline for future comparisons

For Tool Developers:

Clear integration path for LLMs in testing tools
Validation-free approach reduces complexity
Scalable solution for diverse APIs

BibTeX

@inproceedings{kim2024restapi,
  author = {Kim, Myeongsoo and Stennett, Tyler and Shah, Dhruv and Sinha, Saurabh and Orso, Alessandro},
  title = {Leveraging Large Language Models to Improve REST API Testing},
  year = {2024},
  isbn = {9798400705007},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3639476.3639769},
  doi = {10.1145/3639476.3639769},
  booktitle = {Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results},
  pages = {37–41},
  numpages = {5},
  keywords = {large language models for testing, OpenAPI specification analysis},
  location = {Lisbon, Portugal},
  series = {ICSE-NIER'24}
}

Recommended citation: Myeongsoo Kim, Tyler Stennett, Dhruv Shah, Saurabh Sinha, and Alessandro Orso. 2024. Leveraging Large Language Models to Improve REST API Testing. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER 2024). Association for Computing Machinery, New York, NY, USA, 37–41.
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)