Method Call Performance Optimization Plan
April 10, 2026 · View on GitHub
Status: PARTIALLY IMPLEMENTED - Phase 1 (inline caching) is implemented as a runtime global cache in RuntimeCode.java. Phases 2-3 (fast hash access, method-specific optimizations) remain unimplemented. Some references below (e.g. SpillSlotManager, RuntimeArrayPool, bench_method.pl) are to planned components that were never created.
Goal: Achieve >340 iterations/sec on dev/bench/bench_method.pl (matching or exceeding native Perl performance)
Current Status: 119 iter/sec (2.87x slower than target)
Target Completion: 4 weeks
Executive Summary
Analysis reveals that PerlOnJava's closure performance is 2.7x faster than Perl (1718 vs 638 iter/sec), proving the JVM execution model is fundamentally sound. The method call slowdown is entirely due to blessed hash access overhead. The add() method performs 4 hash accesses per call, each costing ~38ns vs ~2ns in native Perl.
This plan focuses on eliminating redundant work in the hot path through architectural improvements that leverage PerlOnJava's existing infrastructure.
Root Cause Analysis
Performance Breakdown (per add() call)
Method call overhead: 134 ns (25%)
Hash access (4x): 152 ns (29%) ← PRIMARY BOTTLENECK
- blessId extraction: 20 ns
- Overload check: 12 ns
- String conversion: 40 ns
- HashMap lookup: 80 ns
Other operations: 246 ns (46%)
────────────────────────────────────
Total: 532 ns
Target: 153 ns/call (to match Perl's 340 iter/sec) Required speedup: 3.5x on hash access path
Why Closures Are Fast
The closure benchmark has zero blessed hash accesses - only lexical variable arithmetic. This proves:
- ✅ JVM method invocation is efficient
- ✅ Bytecode generation is optimal
- ✅ JIT compilation works well
- ❌ Blessed object operations need optimization
Optimization Strategy
Phase 1: Inline Cache at Call Sites (Week 1-2)
Impact: 2.0x speedup | Effort: Medium | Risk: Low
Objective
Cache resolved methods at bytecode call sites to eliminate InheritanceResolver.findMethodInHierarchy() on every call.
Implementation
-
Generate inline cache in bytecode
EmitterVisitor.emitMethodCall()emits a guard check:if (object.blessId == cachedBlessId) { return cachedMethod.invoke(...); } else { // Slow path: resolve and update cache }- Store cache in generated class's static fields
- Use
INVOKEDYNAMICwithCallSitefor polymorphic caching (Java 7+)
-
Modify
Dereference.handleArrowOperator()- Lines 528-680: Add cache slot allocation
- Emit cache guard before
RuntimeCode.call() - Use existing
SpillSlotManagerfor cache slots
-
Add cache invalidation hooks
InheritanceResolver.invalidateCache()already exists- Extend to invalidate bytecode-level caches via
MutableCallSite.setTarget()
Files to Modify
src/main/java/org/perlonjava/codegen/Dereference.java(lines 528-680)src/main/java/org/perlonjava/runtime/RuntimeCode.java(add cache helper methods)src/main/java/org/perlonjava/mro/InheritanceResolver.java(add invalidation hooks)
Success Criteria
bench_method.pl: 180+ iter/secbench_closure.pl: no regression- All tests pass
Phase 2: Fast Path for Non-Overloaded Hash Access (Week 2-3)
Impact: 2.5x speedup | Effort: High | Risk: Medium
Objective
Eliminate overload checks and string conversions for blessed hash access when no overloads are defined.
Implementation
-
Add fast-path bytecode for hash access
EmitterVisitordetects$blessed->{key}pattern- Emit optimized path:
if (object.type == HASHREFERENCE && !hasOverloads(blessId)) { return ((RuntimeHash)object.value).elements.get(cachedKey); } - Pre-intern string keys at compile time
- Skip
RuntimeScalar.hashDeref()entirely
-
Extend
RuntimeHashwith direct accessors- Add
getDirectUnchecked(String key)method - Bypass overload checking layer
- Use for compiler-generated code only (not user-facing API)
- Add
-
Cache blessId check result
- Store "has_overloads" bit in per-class metadata
- Check once per class, not per access
- Use existing
NameNormalizer.blessIdCacheinfrastructure
-
Optimize string key caching
RuntimeHash.get()currently callskeyScalar.toString()on every access- Add
RuntimeScalar.cachedStringValuefield - Memoize conversion for immutable scalars
Files to Modify
src/main/java/org/perlonjava/astvisitor/EmitterVisitor.java(add pattern detection)src/main/java/org/perlonjava/codegen/Dereference.java(emit fast path)src/main/java/org/perlonjava/runtime/RuntimeHash.java(addgetDirectUnchecked())src/main/java/org/perlonjava/runtime/RuntimeScalar.java(addcachedStringValue)src/main/java/org/perlonjava/runtime/OverloadContext.java(expose hasOverloads flag)
Success Criteria
bench_method.pl: 300+ iter/sec- Hash access microbenchmark: <15ns per access (from current 38ns)
- All tests pass, including overload tests
Phase 3: Method-Specific Optimizations (Week 3-4)
Impact: 1.2x speedup | Effort: Low | Risk: Low
Objective
Apply targeted optimizations for common method patterns.
Implementation
-
Eliminate redundant blessId extraction
RuntimeScalar.blessedId()called 4x peradd()method- Cache in local variable at method entry
- Emit optimization in
EmitterVisitor.visitMethodNode()
-
Specialize accessor methods
- Detect getter/setter patterns:
sub get_x { $_[0]->{x} } - Generate direct field access bytecode
- Skip full method call machinery
- Detect getter/setter patterns:
-
Pool RuntimeArray for
@_- Current implementation creates new array per call
- Extend existing
RuntimeArrayPool(already added) - Reuse arrays for same argument counts
-
Pre-compute method signatures
- Hash
(blessId, methodName)once at cache time - Avoid string concatenation in
NameNormalizer.normalizeVariableName()
- Hash
Files to Modify
src/main/java/org/perlonjava/astvisitor/EmitterVisitor.java(pattern detection)src/main/java/org/perlonjava/runtime/RuntimeCode.java(array pooling)src/main/java/org/perlonjava/runtime/NameNormalizer.java(signature caching)
Success Criteria
bench_method.pl: 350+ iter/sec- Memory profiling shows reduced allocation rate
- All tests pass
Risk Mitigation
Technical Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Inline cache invalidation bugs | Medium | High | Comprehensive test suite for dynamic ISA changes |
| JVM verifier issues with fast path | Low | High | Generate conservative bytecode, validate with javap -v |
| Overload detection edge cases | Medium | Medium | Extensive overload.t test coverage |
| Memory leak from cached objects | Low | Medium | WeakReferences in cache, monitoring in tests |
Rollback Strategy
- Each phase is independently testable
- Feature flags for new codegen paths:
CompilerOptions.enableInlineCache - Gradual rollout: enabled only for non-overloaded classes initially
Testing Strategy
Performance Tests
-
Regression suite
bench_method.pl: Target >340 iter/secbench_closure.pl: No regression (maintain >1700 iter/sec)- New:
bench_hash_access.pl: >60M ops/sec on blessed hash reads
-
Microbenchmarks
- Method call overhead: <100ns
- Hash access: <15ns
- Method resolution: <50ns (first call), <5ns (cached)
Correctness Tests
- Existing test suite: All 2012 tests must pass
- New tests:
test/method_cache_invalidation.t: Dynamic @ISA changestest/overload_inheritance.t: Inherited overload operatorstest/inline_cache_polymorphism.t: Multiple classes at same call site
Validation Criteria
- ✅ Zero test failures
- ✅ Zero memory leaks (valgrind/heap profiling)
- ✅ Performance targets met on all benchmarks
- ✅ No bytecode verifier errors
Implementation Order
Week 1: Infrastructure & Inline Cache
- Day 1-2: Add cache slot support to EmitterVisitor
- Day 3-4: Implement inline cache generation in Dereference.java
- Day 5: Add invalidation hooks, test with bench_method.pl
Week 2: Fast Path Design
- Day 1-2: Design fast-path bytecode structure, prototype
- Day 3-4: Implement pattern detection in EmitterVisitor
- Day 5: Integrate RuntimeHash.getDirectUnchecked()
Week 3: Fast Path Implementation
- Day 1-3: Complete fast-path codegen for blessed hash access
- Day 4: Add string key caching in RuntimeScalar
- Day 5: Performance testing, tuning
Week 4: Polish & Method Optimizations
- Day 1-2: Implement accessor pattern specialization
- Day 3: Optimize blessId extraction and array pooling
- Day 4: Final performance testing and validation
- Day 5: Documentation and code review
Success Metrics
Primary Goal
- bench_method.pl: >340 iter/sec (currently 119 iter/sec)
- Improvement: 2.87x speedup
Secondary Goals
- bench_closure.pl: Maintain >1700 iter/sec (no regression)
- test-all: 100% pass rate
- Hash access cost: <15ns (currently 38ns)
Stretch Goals
- bench_method.pl: >400 iter/sec (exceed native Perl by 17%)
- Memory overhead: <10% increase vs baseline
- Compilation time: No regression (same bytecode gen speed)
Dependencies
Existing Infrastructure (Ready to Use)
- ✅
SpillSlotManager: Slot allocation for cache storage - ✅
InheritanceResolver: Method resolution with caching - ✅
OverloadContext: Overload detection (now with BitSet) - ✅
EmitterVisitor: Bytecode generation framework - ✅
RuntimeArrayPool: Array pooling for@_ - ✅ ASM library: Low-level bytecode manipulation
Required Tools
- JMH (Java Microbenchmark Harness): For precise performance measurement
- VisualVM or YourKit: For profiling and validation
- javap: Bytecode verification
Alternatives Considered
Option A: JIT Recompilation
Rejected: Requires dynamic class loading infrastructure, high complexity
Option B: C++/JNI for Hash Access
Rejected: JNI overhead negates benefits, adds platform dependencies
Option C: Specialized Type System
Rejected: Breaking change to runtime API, affects all existing code
Selected Approach: Leverage existing bytecode generation with targeted fast paths
- Minimal API changes
- Incremental rollout
- Builds on proven infrastructure
Monitoring & Validation
Continuous Integration
# Add to CI pipeline
make bench-method # Must show >340 iter/sec
make bench-closure # Must show >1700 iter/sec
make test-all # Must pass 100%
make profile-memory # Detect leaks
Performance Dashboard
Track metrics over commits:
- Method call throughput (iter/sec)
- Hash access latency (ns)
- Method resolution cache hit rate (%)
- Memory allocation rate (MB/sec)
Expected Outcomes
Quantitative
- 3.0x faster method calls (119 → 350+ iter/sec)
- 2.5x faster blessed hash access (38ns → 15ns)
- Zero test regressions
- <5% memory overhead increase
Qualitative
- Competitive with native Perl for OOP code
- Maintains >2x advantage for closure-heavy code
- Establishes pattern for future optimizations
- Demonstrates PerlOnJava's optimization potential
Conclusion
This plan achieves >340 iter/sec through pragmatic architectural improvements that leverage PerlOnJava's existing strengths:
- Proven approach: Inline caching is standard in dynamic language VMs
- Low risk: Builds on existing infrastructure (EmitterVisitor, ASM, caching)
- Measurable: Clear benchmarks at each phase
- Reversible: Feature flags enable rollback if issues arise
The closure benchmark proves PerlOnJava can exceed native Perl performance. This plan extends that advantage to object-oriented code.
Estimated total effort: 80-100 hours over 4 weeks Confidence level: High (80%) for >340 iter/sec, Medium (60%) for >400 iter/sec