Improve benchmarking system and performance tests #34

Closed
opened 2026-03-18 23:55:20 +00:00 by david · 2 comments
Owner

User Story

As a developer, I want to have reliable and accurate performance benchmarks that can run consistently across different environments (development, CI with coverage, etc.), so that I can detect performance regressions without false positives from coverage instrumentation or timing variations.

Rationale

The current benchmark tests are failing intermittently when running with coverage instrumentation:

FAILED tests/test_benchmark.py::TestBenchmarkNullDisplay::test_effects_minimum_throughput

This occurs because:

  1. Coverage instrumentation significantly slows down Python execution (typically 2-5x slower)
  2. The minimum FPS thresholds (10,000 for effects) are not adjusted for coverage runs
  3. Timing-sensitive tests are unreliable in CI environments with varying load
  4. There's no way to distinguish between "actual performance regression" and "coverage overhead"

Proposed Improvements

  1. Adjust thresholds for coverage mode: Detect if coverage is active and adjust minimum FPS thresholds accordingly
  2. Add environment-aware benchmarks: Different thresholds for CI vs local development
  3. Use statistical methods: Instead of single-run timing, use median of multiple runs with outlier rejection
  4. Add performance baseline tracking: Store historical performance data to detect trends
  5. Make benchmark tests optional: Mark them with a custom marker like @pytest.mark.performance so they can be skipped when not needed
  6. Add configurable thresholds: Allow thresholds to be set via environment variables for different CI environments

Acceptance Criteria

  • Benchmark tests pass consistently when running with coverage
  • Performance thresholds are adjusted based on execution environment
  • Documentation on how to run and interpret benchmarks
  • CI pipeline includes performance regression detection
  • Part of broader performance monitoring work
  • Related to issue #26 (Streaming display backend) which may have performance implications
## User Story As a developer, I want to have reliable and accurate performance benchmarks that can run consistently across different environments (development, CI with coverage, etc.), so that I can detect performance regressions without false positives from coverage instrumentation or timing variations. ## Rationale The current benchmark tests are failing intermittently when running with coverage instrumentation: ``` FAILED tests/test_benchmark.py::TestBenchmarkNullDisplay::test_effects_minimum_throughput ``` This occurs because: 1. Coverage instrumentation significantly slows down Python execution (typically 2-5x slower) 2. The minimum FPS thresholds (10,000 for effects) are not adjusted for coverage runs 3. Timing-sensitive tests are unreliable in CI environments with varying load 4. There's no way to distinguish between "actual performance regression" and "coverage overhead" ## Proposed Improvements 1. **Adjust thresholds for coverage mode**: Detect if coverage is active and adjust minimum FPS thresholds accordingly 2. **Add environment-aware benchmarks**: Different thresholds for CI vs local development 3. **Use statistical methods**: Instead of single-run timing, use median of multiple runs with outlier rejection 4. **Add performance baseline tracking**: Store historical performance data to detect trends 5. **Make benchmark tests optional**: Mark them with a custom marker like `@pytest.mark.performance` so they can be skipped when not needed 6. **Add configurable thresholds**: Allow thresholds to be set via environment variables for different CI environments ## Acceptance Criteria - [ ] Benchmark tests pass consistently when running with coverage - [ ] Performance thresholds are adjusted based on execution environment - [ ] Documentation on how to run and interpret benchmarks - [ ] CI pipeline includes performance regression detection ## Related Issues - Part of broader performance monitoring work - Related to issue #26 (Streaming display backend) which may have performance implications
Author
Owner

Implementation Plan (2026-03-18)

Phase 1: More Coverage + Better Isolation

Based on our planning, here's what we'll implement:

  1. Add more test coverage:

    • Test all display backends performance
    • Test effect pipeline execution throughput
    • Test WebSocket streaming performance
    • Target: 80% coverage for benchmark system
  2. Improve test isolation:

    • Better fixtures for benchmark tests
    • Mock external dependencies (network, file I/O)
    • Use subprocess isolation for timing-critical tests
  3. Fix CI interference:

    • Review mise.toml for the duplication issue (CI runs tasks twice)
    • Ensure benchmarks don't run during coverage tests
    • The test-cov task already uses -m "not benchmark" to exclude benchmarks

Current Issues Found

  1. mise.toml CI task has redundant configuration:
    • ci task has both depends list AND run string that repeats the same commands
    • This causes tasks to run twice
    • Fix: Clean up mise.toml to use either depends OR run, not both

Testing Strategy

  • Use @pytest.mark.benchmark marker for performance tests
  • Use @pytest.mark.slow for tests that take >1 second
  • Run benchmarks in isolated subprocess to avoid coverage interference

Should we proceed with implementing these improvements?

## Implementation Plan (2026-03-18) ### Phase 1: More Coverage + Better Isolation Based on our planning, here's what we'll implement: 1. **Add more test coverage:** - Test all display backends performance - Test effect pipeline execution throughput - Test WebSocket streaming performance - Target: 80% coverage for benchmark system 2. **Improve test isolation:** - Better fixtures for benchmark tests - Mock external dependencies (network, file I/O) - Use subprocess isolation for timing-critical tests 3. **Fix CI interference:** - Review `mise.toml` for the duplication issue (CI runs tasks twice) - Ensure benchmarks don't run during coverage tests - The `test-cov` task already uses `-m "not benchmark"` to exclude benchmarks ### Current Issues Found 1. **`mise.toml` CI task has redundant configuration:** - `ci` task has both `depends` list AND `run` string that repeats the same commands - This causes tasks to run twice - Fix: Clean up `mise.toml` to use either `depends` OR `run`, not both ### Testing Strategy - Use `@pytest.mark.benchmark` marker for performance tests - Use `@pytest.mark.slow` for tests that take >1 second - Run benchmarks in isolated subprocess to avoid coverage interference Should we proceed with implementing these improvements?
Author
Owner

Progress Update

Implemented in commit 3a2138a:

  • Coverage-aware FPS thresholds (_is_coverage_active())
  • Adjusted min_fps to 25% when coverage is active
  • Reduced iterations in coverage mode
  • Added BENCHMARK_ITERATIONS env var override
  • Added separate benchmark task in mise.toml

The transform.py performance fix (c57617b) also addresses the root cause - PIL-based height estimation was causing benchmarks to be slow.

## Progress Update Implemented in commit `3a2138a`: - ✅ Coverage-aware FPS thresholds (`_is_coverage_active()`) - ✅ Adjusted min_fps to 25% when coverage is active - ✅ Reduced iterations in coverage mode - ✅ Added `BENCHMARK_ITERATIONS` env var override - ✅ Added separate `benchmark` task in mise.toml The transform.py performance fix (`c57617b`) also addresses the root cause - PIL-based height estimation was causing benchmarks to be slow.
david closed this issue 2026-03-19 20:13:19 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: klubhaus/sideline#34