# Architecture Review: Legacy vs New — Critical Infrastructure Improvements > Pre-implementation review. This system controls traffic/tunnel cameras in critical infrastructure. Every failure mode must be addressed. The system may run on Windows, Linux, or Android tablets in the future. ## 1. Side-by-Side Failure Mode Comparison ### 1.1 Camera Server Unreachable | Aspect | Legacy (WPF) | New (Flutter) | Verdict | |--------|-------------|---------------|---------| | Detection | Driver `IsConnected` check every 2 seconds | HTTP timeout (5s) | Legacy better — faster detection | | Recovery | `CameraServerDriverReconnectService` retries every 2s | **None** — user must click retry button | **Critical gap** | | Partial failure | Skips disconnected drivers, other servers still work | Each bridge is independent — OK | Equal | | State on reconnect | Reloads media channels, fires `DriverConnected` event | No state resync after reconnect | **Gap** | ### 1.2 Coordination Layer Down (AppServer / PRIMARY) | Aspect | Legacy (WPF) | New (Flutter) | Verdict | |--------|-------------|---------------|---------| | Detection | SignalR built-in disconnect detection | Not implemented yet | Equal (both need this) | | Recovery | SignalR auto-reconnect: 0s, 5s, 10s, 15s fixed delays | Not implemented yet | To be built | | Degraded mode | CrossSwitch/PTZ work, locks/sequences don't | Same design — correct | Equal | | State on reconnect | Hub client calls `GetLockedCameraIds()`, `GetRunningSequences()` | Not implemented yet | Must match | ### 1.3 Network Failure | Aspect | Legacy (WPF) | New (Flutter) | Verdict | |--------|-------------|---------------|---------| | Detection | `NetworkAvailabilityWorker` polls every 5s (checks NIC status) | **None** — no network detection | **Critical gap** | | UI feedback | `NetworkAvailabilityState` updates UI commands | Connection status bar (manual) | **Gap** | | Recovery | Automatic — reconnect services activate when NIC comes back | **Manual only** — user clicks retry | **Critical gap** | ### 1.4 Bridge Process Crash | Aspect | Legacy (WPF) | New (Flutter) | Verdict | |--------|-------------|---------------|---------| | Detection | N/A (SDK was in-process) | HTTP timeout → connection status false | OK | | Recovery | N/A (app restarts) | **None** — bridge stays dead | **Critical gap** | | Prevention | N/A | Process supervision needed | Must add | ### 1.5 Flutter App Crash | Aspect | Legacy (WPF) | New (Flutter) | Verdict | |--------|-------------|---------------|---------| | Recovery | App restarts, reconnects in ~5s | App restarts, must reinitialize | Equal | | State recovery | Queries AppServer for locks, sequences, viewer states | Queries bridges for monitor states, alarms | Equal | | Lock state | Restored via `GetLockedCameraIds()` | Restored from coordination service | Equal | ## 2. Critical Improvements Required ### 2.1 Automatic Reconnection (MUST HAVE) The legacy system reconnects automatically at every level. Our Flutter app does not. For tunnel/traffic camera control, an operator cannot be expected to click a retry button during an emergency. **Required reconnection layers:** ``` Layer 1: Bridge Health Polling Flutter → periodic GET /health to each bridge If bridge was down and comes back → auto-reconnect WebSocket + resync state Layer 2: WebSocket Auto-Reconnect On disconnect → exponential backoff retry (1s, 2s, 4s, 8s, max 30s) On reconnect → resync state from bridge Layer 3: Coordination Auto-Reconnect On PRIMARY disconnect → retry connection with backoff After 6s → STANDBY promotion (if configured) On reconnect to (new) PRIMARY → resync lock/sequence state Layer 4: Network Change Detection Monitor network interface status On network restored → trigger reconnection at all layers ``` **Legacy equivalent:** - Camera drivers: 2-second reconnect loop (`CameraServerDriverReconnectService`) - SignalR: built-in auto-reconnect with `HubRetryPolicy` (0s, 5s, 10s, 15s) - Network: 5-second NIC polling (`NetworkAvailabilityWorker`) ### 2.2 Process Supervision (MUST HAVE) Every .NET process (bridges + coordination service) must auto-restart on crash. An operator should never have to SSH into a machine to restart a bridge. | Platform | Supervision Method | |----------|--------------------| | Windows | Windows Service (via `Microsoft.Extensions.Hosting.WindowsServices`) or NSSM | | Linux | systemd units with `Restart=always` | | Docker | `restart: always` policy | | Android tablet | Bridges run on server, not locally | **Proposed process tree:** ``` LattePanda Sigma (per keyboard) ├── copilot-geviscope-bridge.service (auto-restart) ├── copilot-gcore-bridge.service (auto-restart) ├── copilot-geviserver-bridge.service (auto-restart) ├── copilot-coordinator.service (auto-restart, PRIMARY only) └── copilot-keyboard.service (auto-restart, Flutter desktop) or browser tab (Flutter web) ``` ### 2.3 Health Monitoring Dashboard (SHOULD HAVE) The operator must see at a glance what's working and what's not. ``` ┌──────────────────────────────────────────────────────────┐ │ System Status │ │ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │ │ │ GeViScope │ │ G-Core │ │ Coordination │ │ │ │ ● Online │ │ ● Online │ │ ● PRIMARY active │ │ │ │ 12 cams │ │ 8 cams │ │ 2 keyboards │ │ │ │ 6 viewers │ │ 4 viewers │ │ 1 lock active │ │ │ └────────────┘ └────────────┘ └────────────────────┘ │ │ │ │ ⚠ G-Core bridge reconnecting (attempt 3/∞) │ └──────────────────────────────────────────────────────────┘ ``` ### 2.4 Command Retry with Idempotency (SHOULD HAVE) Critical commands (CrossSwitch) should retry on transient failure: ```dart Future viewerConnectLive(int viewer, int channel) async { for (int attempt = 1; attempt <= 3; attempt++) { try { final response = await _client.post('/viewer/connect-live', ...); if (response.statusCode == 200) return true; } catch (e) { if (attempt == 3) rethrow; await Future.delayed(Duration(milliseconds: 200 * attempt)); } } return false; } ``` PTZ commands should NOT retry (they're continuous — a stale retry would cause unexpected movement). ### 2.5 State Verification After Reconnection (MUST HAVE) After any reconnection event, the app must not trust its cached state: ``` On bridge reconnect: 1. Query GET /monitors → rebuild monitor state 2. Query GET /alarms/active → rebuild alarm state 3. Re-subscribe WebSocket events On coordination reconnect: 1. Query locks → rebuild lock state 2. Query running sequences → update sequence state 3. Re-subscribe lock/sequence change events ``` Legacy does this: `ViewerStatesInitWorker` rebuilds viewer state on startup/reconnect. `ConfigurationService.OnChangeAvailability` resyncs config when AppServer comes back. ## 3. Platform Independence Analysis ### 3.1 Current Platform Assumptions | Component | Current Assumption | Future Need | |-----------|-------------------|-------------| | C# Bridges | Run locally on Windows (LattePanda) | Linux, Docker, remote server | | Flutter App | Windows desktop or browser | Linux, Android tablet, browser | | Coordination | Runs on PRIMARY keyboard (Windows) | Linux, Docker, any host | | Hardware I/O | USB Serial + HID on local machine | Remote keyboard via network, or Bluetooth | | Bridge URLs | `http://localhost:7720` | `http://192.168.x.y:7720` (already configurable) | ### 3.2 Architecture for Platform Independence ```mermaid graph TB subgraph "Deployment A: LattePanda (Current)" LP_App["Flutter Desktop"] LP_Bridge1["GeViScope Bridge"] LP_Bridge2["G-Core Bridge"] LP_Coord["Coordinator"] LP_Serial["USB Serial/HID"] LP_App --> LP_Bridge1 LP_App --> LP_Bridge2 LP_App --> LP_Coord LP_Serial --> LP_App end subgraph "Deployment B: Android Tablet (Future)" AT_App["Flutter Android"] AT_BT["Bluetooth Keyboard"] AT_App -->|"HTTP over WiFi"| Remote_Bridge1["Bridge on Server"] AT_App -->|"HTTP over WiFi"| Remote_Bridge2["Bridge on Server"] AT_App -->|"WebSocket"| Remote_Coord["Coordinator on Server"] AT_BT --> AT_App end subgraph "Deployment C: Linux Kiosk (Future)" LX_App["Flutter Linux"] LX_Bridge1["GeViScope Bridge"] LX_Bridge2["G-Core Bridge"] LX_Coord["Coordinator"] LX_Serial["USB Serial/HID"] LX_App --> LX_Bridge1 LX_App --> LX_Bridge2 LX_App --> LX_Coord LX_Serial --> LX_App end Remote_Bridge1 --> CS1["Camera Server 1"] Remote_Bridge2 --> CS2["Camera Server 2"] LP_Bridge1 --> CS1 LP_Bridge2 --> CS2 LX_Bridge1 --> CS1 LX_Bridge2 --> CS2 ``` ### 3.3 Key Design Rules for Platform Independence 1. **Flutter app never assumes bridges are on localhost.** Bridge URLs come from `servers.json`. Already the case. 2. **Bridges are deployable anywhere .NET 8 runs.** Currently Windows x86/x64. Must also build for Linux x64 and linux-arm64. 3. **Coordination service is just another network service.** Flutter app connects to it like a bridge — via configured URL. 4. **Hardware I/O is abstracted behind a service interface.** `KeyboardService` interface has platform-specific implementations: - `NativeSerialKeyboardService` (desktop with USB) - `WebSerialKeyboardService` (browser with Web Serial API) - `BluetoothKeyboardService` (tablet with BT keyboard, future) - `EmulatedKeyboardService` (development/testing) 5. **No platform-specific code in business logic.** All platform differences are in the service layer, injected via DI. ## 4. Coordination Service Design (Option B) ### 4.1 Service Overview A minimal .NET 8 ASP.NET Core application (~400 lines) running on the PRIMARY keyboard: ``` copilot-coordinator/ ├── Program.cs # Minimal API setup, WebSocket, endpoints ├── Services/ │ ├── LockManager.cs # Camera lock state (ported from legacy CameraLocksService) │ ├── SequenceRunner.cs # Sequence execution (ported from legacy SequenceService) │ └── KeyboardRegistry.cs # Track connected keyboards ├── Models/ │ ├── CameraLock.cs # Lock state model │ ├── SequenceState.cs # Running sequence model │ └── Messages.cs # WebSocket message types └── appsettings.json # Lock timeout, heartbeat interval config ``` ### 4.2 REST API ``` GET /health → Service health GET /status → Connected keyboards, active locks, sequences POST /locks/try {cameraId, keyboardId, priority} → Acquire lock POST /locks/release {cameraId, keyboardId} → Release lock POST /locks/takeover {cameraId, keyboardId, priority} → Request takeover POST /locks/confirm {cameraId, keyboardId, confirm} → Confirm/reject takeover POST /locks/reset {cameraId, keyboardId} → Reset expiration GET /locks → All active locks GET /locks/{keyboardId} → Locks held by keyboard POST /sequences/start {viewerId, sequenceId} → Start sequence POST /sequences/stop {viewerId} → Stop sequence GET /sequences/running → Active sequences WS /ws → Real-time events ``` ### 4.3 WebSocket Events (broadcast to all connected keyboards) ```json {"type": "lock_acquired", "cameraId": 5, "keyboardId": "KB1", "expiresAt": "..."} {"type": "lock_released", "cameraId": 5} {"type": "lock_expiring", "cameraId": 5, "keyboardId": "KB1", "expiresIn": 60} {"type": "lock_takeover", "cameraId": 5, "from": "KB1", "to": "KB2"} {"type": "sequence_started", "viewerId": 1001, "sequenceId": 3} {"type": "sequence_stopped", "viewerId": 1001} {"type": "keyboard_online", "keyboardId": "KB2"} {"type": "keyboard_offline", "keyboardId": "KB2"} {"type": "heartbeat"} ``` ### 4.4 Failover (Configured STANDBY) ``` keyboards.json: { "keyboards": [ {"id": "KB1", "role": "PRIMARY", "coordinatorPort": 8090}, {"id": "KB2", "role": "STANDBY", "coordinatorPort": 8090} ] } ``` - PRIMARY starts coordinator service on `:8090` - STANDBY monitors PRIMARY's `/health` endpoint - If PRIMARY unreachable for 6 seconds → STANDBY starts its own coordinator - When old PRIMARY recovers → checks if another coordinator is running → defers (becomes STANDBY) - Lock state after failover: **empty** (locks expire naturally in ≤5 minutes, same as legacy AppServer restart behavior) ## 5. Improvement Summary: Legacy vs New ### What the New System Does BETTER | Improvement | Detail | |-------------|--------| | No central server hardware | Coordinator runs on keyboard, not separate machine | | Alarm reliability | Query + Subscribe + Periodic sync (legacy had event-only + hourly refresh) | | Direct command path | CrossSwitch/PTZ bypass coordinator entirely (legacy routed some through AppServer) | | Multiplatform | Flutter + .NET 8 run on Windows, Linux, Android. Legacy was Windows-only WPF | | No SDK dependency in UI | Bridges abstract SDKs behind REST. UI never touches native code | | Independent operation | Each keyboard works standalone for critical ops. Legacy needed AppServer for several features | | Deployable anywhere | Bridges + coordinator can run on any server, not just the keyboard | ### What the New System Must MATCH (Currently Missing) | Legacy Feature | Legacy Implementation | New Implementation Needed | |---------------|----------------------|---------------------------| | Auto-reconnect to camera servers | 2-second periodic retry service | Bridge health polling + WebSocket auto-reconnect | | Auto-reconnect to AppServer | SignalR built-in (0s, 5s, 10s, 15s) | Coordinator WebSocket auto-reconnect with backoff | | Network detection | 5-second NIC polling worker | `connectivity_plus` package or periodic health checks | | State resync on reconnect | `ViewerStatesInitWorker`, config resync on availability change | Query bridges + coordinator on any reconnect event | | Graceful partial failure | `Parallel.ForEach` with per-driver try-catch | Already OK (each bridge independent) | | Process watchdog | Windows Service | systemd / Windows Service / Docker restart policy | | Media channel refresh | 10-minute periodic refresh | Periodic bridge status query | ### What the New System Should Do BETTER THAN Legacy | Improvement | Legacy Gap | New Approach | |-------------|-----------|--------------| | Exponential backoff | Fixed delays (0, 5, 10, 15s) — no backoff | Exponential: 1s, 2s, 4s, 8s, max 30s with jitter | | Circuit breaker | None — retries forever even if server is gone | After N failures, back off to slow polling (60s) | | Command retry | None — single attempt | Retry critical commands (CrossSwitch) 3x with 200ms delay | | Health visibility | Hidden in logs | Operator-facing status dashboard in UI | | Structured logging | Basic ILogger | JSON structured logging → ELK (already in design) | | Graceful degradation UI | Commands silently disabled | Clear visual indicator: "Degraded mode — locks unavailable" | ## 6. Proposed Resilience Architecture ```mermaid graph TB subgraph "Flutter App" UI["UI Layer"] BLoCs["BLoC Layer"] RS["ReconnectionService"] HS["HealthService"] BS["BridgeService"] CS["CoordinationClient"] KS["KeyboardService"] end subgraph "Health & Reconnection" RS -->|"periodic /health"| Bridge1["GeViScope Bridge"] RS -->|"periodic /health"| Bridge2["G-Core Bridge"] RS -->|"periodic /health"| Coord["Coordinator"] RS -->|"on failure"| BS RS -->|"on failure"| CS HS -->|"status stream"| BLoCs end subgraph "Normal Operation" BS -->|"REST commands"| Bridge1 BS -->|"REST commands"| Bridge2 BS -->|"WebSocket events"| Bridge1 BS -->|"WebSocket events"| Bridge2 CS -->|"REST + WebSocket"| Coord end BLoCs --> UI KS -->|"Serial/HID"| BLoCs ``` **New services needed in Flutter app:** | Service | Responsibility | |---------|---------------| | `ReconnectionService` | Polls bridge `/health` endpoints, auto-reconnects WebSocket, triggers state resync | | `HealthService` | Aggregates health of all bridges + coordinator, exposes stream to UI | | `CoordinationClient` | REST + WebSocket client to coordinator (locks, sequences, heartbeat) | ## 7. Action Items Before Implementation - [ ] **Create coordination service** (.NET 8 minimal API, ~400 lines) - [ ] **Add `ReconnectionService`** to Flutter app (exponential backoff, health polling) - [ ] **Add `HealthService`** to Flutter app (status aggregation for UI) - [ ] **Add `CoordinationClient`** to Flutter app (locks, sequences) - [ ] **Fix WebSocket auto-reconnect** in `BridgeService` - [ ] **Add command retry** for CrossSwitch (3x with backoff) - [ ] **Add bridge process supervision** (systemd/Windows Service configs) - [ ] **Add state resync** on every reconnect event - [ ] **Build health status UI** component - [ ] **Update `servers.json`** schema to include coordinator URL - [ ] **Build for Linux** — verify .NET 8 bridges compile for linux-x64 - [ ] **Abstract keyboard input** behind `KeyboardService` interface with platform impls