Feature #5
openEnhance BackfillOrchestrator to auto-fix problem games with metadata issues
0%
Description
Overview¶
Now that we fixed the admin dashboard bug (https://docs.v2.icydata.hockey/issues/4), we have an accurate list of "problem games". The BackfillOrchestrator should automatically detect and fix these problem games.
Current State¶
The BackfillOrchestrator currently handles:
- ✅ Incomplete features - Games with missing GameFeature records
- ✅ Stale scheduled games - FUT/PRE games that should have started (checked in
check_stale_scheduled_games)
But the admin dashboard also identifies:
3. ❌ Metadata problems - Final games with missing scores, venue, or teams
4. ❌ Stale live games - Live games not updated in 1+ hours
Proposed Enhancement¶
Add a new method to BackfillOrchestrator to detect and fix metadata problems:
def check_problem_games
# Find final games with metadata problems
problem_games = Game.joins(:season)
.where(seasons: { enabled: true })
.where(game_state: 'FINAL')
.where(
"home_score IS NULL OR away_score IS NULL OR venue IS NULL OR " \
"home_team_id IS NULL OR away_team_id IS NULL"
)
.order(:game_date)
.limit(10)
problem_games.each do |game|
logger.info("BackfillOrchestrator: Refetching game #{game.external_id} with metadata problems")
update_game_state(game)
end
# Also check for stale live games
stale_live = Game.joins(:season)
.where(seasons: { enabled: true })
.where(game_state: ['LIVE', 'CRIT'])
.where("last_polled_at IS NULL OR last_polled_at < ?", 1.hour.ago)
.limit(5)
stale_live.each do |game|
logger.info("BackfillOrchestrator: Updating stale live game #{game.external_id}")
update_game_state(game)
end
end
Implementation Plan¶
- Add
check_problem_gamesmethod to BackfillOrchestrator - Call it in the
processmethod (before or aftercheck_stale_scheduled_games) - Limit the number of problem games fixed per run (e.g., 10) to avoid overwhelming the queue
- Log which games are being fixed and why
- Update tests to cover the new functionality
Benefits¶
- Automatic self-healing for data quality issues
- Reduces manual intervention needed
- Ensures admin dashboard "Problem Games" section stays clean
- Catches games that may have had API failures during initial fetch
Notes¶
- Should run BEFORE the regular incomplete games processing
- Should respect the same "skip if live games exist" logic
- Should handle API errors gracefully (don't crash the entire orchestrator run)
- Consider adding metrics/alerts for persistent problem games
Related¶
- Issue #4: Admin dashboard bug fix (now accurately identifies problem games)
app/services/backfill_orchestrator.rb-
app/controllers/api/v1/admin/seasons_controller.rb-find_problem_gamesmethod for reference
Updated by William Lang about 2 months ago
Real Production Example Found¶
Game 2023020775 (NSH @ OTT, 2024-01-29) has goalie_stats stuck in processing:
Status: processing
Fetched At: nil
Created At: 2025-12-29 12:07:43 UTC (2 days ago)
Updated At: 2025-12-31 04:12:02 UTC (2 minutes ago)
Fetch Attempts: 0
Is stale (>5 min): false
Root Cause¶
The FeatureManager checks game_feature.updated_at < 5.minutes.ago to detect stale "processing" features (line 57 in feature_manager.rb). However:
- Feature was stuck in "processing" for 2 days
- Something recently touched the record (updated_at was refreshed)
- This reset the stale timer, so it's no longer detected as stale
- BackfillOrchestrator won't retry it until 5 minutes pass again
The Bug¶
Using updated_at to determine staleness is unreliable because:
- Any code that touches the record resets the timer
- Updates unrelated to fetch status can mask stuck features
- Features can be stuck indefinitely if touched periodically
Better Approach¶
Instead of using updated_at, track when the feature was marked as "processing":
- Add a
processing_started_attimestamp to GameFeature - Set it when status changes to "processing"
- Check
processing_started_at < 5.minutes.agofor staleness - This won't be affected by unrelated updates
Alternatively, check against created_at if fetched_at is nil:
if game_feature.status == 'processing'
stale_threshold = game_feature.fetched_at || game_feature.created_at
if stale_threshold && stale_threshold < 5.minutes.ago
# Feature is stale
end
end
Verification Command¶
docker -H ssh://git@v2.icydata.hockey exec icydata-ror-deploy-web-1 bin/rails runner "..."
Updated by William Lang about 2 months ago
Starting work on fixing stale processing detection in FeatureManager.
Updated by William Lang about 2 months ago
Fix Implemented¶
Root Cause:
FeatureManager was using updated_at to detect stale "processing" features, but any code touching the GameFeature record would reset this timestamp, preventing detection of truly stuck features.
Solution:
Changed staleness detection to use a baseline timestamp that won't be affected by unrelated updates:
baseline_timestamp = game_feature.fetched_at || game_feature.created_at
Logic:
- For first fetch (fetched_at is nil): Use
created_atas baseline - For refetch (fetched_at exists): Use
fetched_atas baseline - Check if
baseline_timestamp < 5.minutes.agoto detect staleness
Files Modified:
-
app/services/feature_manager.rb(lines 55-82)
Testing:
- ✅ Simulated production scenario:
- OLD LOGIC: Would NOT detect stuck feature (updated 2 min ago)
- NEW LOGIC: DETECTED! (created 48 hours ago)
- ✅ All FeatureManager tests pass (8 tests, 14 assertions, 0 failures)
Example:
Game 2023020775 with goalie_stats stuck for 48 hours will now be detected and retried by BackfillOrchestrator.
Impact:
- BackfillOrchestrator will now detect truly stuck features
- Features stuck for days won't be masked by recent timestamp updates
- Automatic self-healing for stuck processing features
Next Steps:
- Deploy to production
- Monitor BackfillOrchestrator logs for stuck feature detection
- Verify game 2023020775 gets auto-fixed