Project

General

Profile

Actions

Feature #5

open

Enhance BackfillOrchestrator to auto-fix problem games with metadata issues

Added by William Lang about 2 months ago. Updated about 2 months ago.

Status:
Backlog
Priority:
Low
Assignee:
-
Start date:
12/31/2025
Due date:
% Done:

0%

Estimated time:

Description

Overview

Now that we fixed the admin dashboard bug (https://docs.v2.icydata.hockey/issues/4), we have an accurate list of "problem games". The BackfillOrchestrator should automatically detect and fix these problem games.

Current State

The BackfillOrchestrator currently handles:

  1. Incomplete features - Games with missing GameFeature records
  2. Stale scheduled games - FUT/PRE games that should have started (checked in check_stale_scheduled_games)

But the admin dashboard also identifies:
3. ❌ Metadata problems - Final games with missing scores, venue, or teams
4. ❌ Stale live games - Live games not updated in 1+ hours

Proposed Enhancement

Add a new method to BackfillOrchestrator to detect and fix metadata problems:

def check_problem_games
  # Find final games with metadata problems
  problem_games = Game.joins(:season)
    .where(seasons: { enabled: true })
    .where(game_state: 'FINAL')
    .where(
      "home_score IS NULL OR away_score IS NULL OR venue IS NULL OR " \
      "home_team_id IS NULL OR away_team_id IS NULL"
    )
    .order(:game_date)
    .limit(10)
  
  problem_games.each do |game|
    logger.info("BackfillOrchestrator: Refetching game #{game.external_id} with metadata problems")
    update_game_state(game)
  end
  
  # Also check for stale live games
  stale_live = Game.joins(:season)
    .where(seasons: { enabled: true })
    .where(game_state: ['LIVE', 'CRIT'])
    .where("last_polled_at IS NULL OR last_polled_at < ?", 1.hour.ago)
    .limit(5)
  
  stale_live.each do |game|
    logger.info("BackfillOrchestrator: Updating stale live game #{game.external_id}")
    update_game_state(game)
  end
end

Implementation Plan

  1. Add check_problem_games method to BackfillOrchestrator
  2. Call it in the process method (before or after check_stale_scheduled_games)
  3. Limit the number of problem games fixed per run (e.g., 10) to avoid overwhelming the queue
  4. Log which games are being fixed and why
  5. Update tests to cover the new functionality

Benefits

  • Automatic self-healing for data quality issues
  • Reduces manual intervention needed
  • Ensures admin dashboard "Problem Games" section stays clean
  • Catches games that may have had API failures during initial fetch

Notes

  • Should run BEFORE the regular incomplete games processing
  • Should respect the same "skip if live games exist" logic
  • Should handle API errors gracefully (don't crash the entire orchestrator run)
  • Consider adding metrics/alerts for persistent problem games

Related

  • Issue #4: Admin dashboard bug fix (now accurately identifies problem games)
  • app/services/backfill_orchestrator.rb
  • app/controllers/api/v1/admin/seasons_controller.rb - find_problem_games method for reference
Actions #1

Updated by William Lang about 2 months ago

Real Production Example Found

Game 2023020775 (NSH @ OTT, 2024-01-29) has goalie_stats stuck in processing:

Status: processing
Fetched At: nil
Created At: 2025-12-29 12:07:43 UTC (2 days ago)
Updated At: 2025-12-31 04:12:02 UTC (2 minutes ago)
Fetch Attempts: 0
Is stale (>5 min): false

Root Cause

The FeatureManager checks game_feature.updated_at < 5.minutes.ago to detect stale "processing" features (line 57 in feature_manager.rb). However:

  1. Feature was stuck in "processing" for 2 days
  2. Something recently touched the record (updated_at was refreshed)
  3. This reset the stale timer, so it's no longer detected as stale
  4. BackfillOrchestrator won't retry it until 5 minutes pass again

The Bug

Using updated_at to determine staleness is unreliable because:

  • Any code that touches the record resets the timer
  • Updates unrelated to fetch status can mask stuck features
  • Features can be stuck indefinitely if touched periodically

Better Approach

Instead of using updated_at, track when the feature was marked as "processing":

  1. Add a processing_started_at timestamp to GameFeature
  2. Set it when status changes to "processing"
  3. Check processing_started_at < 5.minutes.ago for staleness
  4. This won't be affected by unrelated updates

Alternatively, check against created_at if fetched_at is nil:

if game_feature.status == 'processing'
  stale_threshold = game_feature.fetched_at || game_feature.created_at
  if stale_threshold && stale_threshold < 5.minutes.ago
    # Feature is stale
  end
end

Verification Command

docker -H ssh://git@v2.icydata.hockey exec icydata-ror-deploy-web-1 bin/rails runner "..."
Actions #2

Updated by William Lang about 2 months ago

Starting work on fixing stale processing detection in FeatureManager.

Actions #3

Updated by William Lang about 2 months ago

Fix Implemented

Root Cause:
FeatureManager was using updated_at to detect stale "processing" features, but any code touching the GameFeature record would reset this timestamp, preventing detection of truly stuck features.

Solution:
Changed staleness detection to use a baseline timestamp that won't be affected by unrelated updates:

baseline_timestamp = game_feature.fetched_at || game_feature.created_at

Logic:

  • For first fetch (fetched_at is nil): Use created_at as baseline
  • For refetch (fetched_at exists): Use fetched_at as baseline
  • Check if baseline_timestamp < 5.minutes.ago to detect staleness

Files Modified:

  • app/services/feature_manager.rb (lines 55-82)

Testing:

  • ✅ Simulated production scenario:
    • OLD LOGIC: Would NOT detect stuck feature (updated 2 min ago)
    • NEW LOGIC: DETECTED! (created 48 hours ago)
  • ✅ All FeatureManager tests pass (8 tests, 14 assertions, 0 failures)

Example:
Game 2023020775 with goalie_stats stuck for 48 hours will now be detected and retried by BackfillOrchestrator.

Impact:

  • BackfillOrchestrator will now detect truly stuck features
  • Features stuck for days won't be masked by recent timestamp updates
  • Automatic self-healing for stuck processing features

Next Steps:

  • Deploy to production
  • Monitor BackfillOrchestrator logs for stuck feature detection
  • Verify game 2023020775 gets auto-fixed
Actions

Also available in: Atom PDF