Files
COPILOT/Docs/legacy-architecture/architecture-review.md
klas 40143734fc Initial commit: COPILOT D6 Flutter keyboard controller
Flutter web app replacing legacy WPF CCTV surveillance keyboard controller.
Includes wall overview, section view with monitor grid, camera input,
PTZ control, alarm/lock/sequence BLoCs, and legacy-matching UI styling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:57:38 +01:00

18 KiB

Architecture Review: Legacy vs New — Critical Infrastructure Improvements

Pre-implementation review. This system controls traffic/tunnel cameras in critical infrastructure. Every failure mode must be addressed. The system may run on Windows, Linux, or Android tablets in the future.

1. Side-by-Side Failure Mode Comparison

1.1 Camera Server Unreachable

Aspect Legacy (WPF) New (Flutter) Verdict
Detection Driver IsConnected check every 2 seconds HTTP timeout (5s) Legacy better — faster detection
Recovery CameraServerDriverReconnectService retries every 2s None — user must click retry button Critical gap
Partial failure Skips disconnected drivers, other servers still work Each bridge is independent — OK Equal
State on reconnect Reloads media channels, fires DriverConnected event No state resync after reconnect Gap

1.2 Coordination Layer Down (AppServer / PRIMARY)

Aspect Legacy (WPF) New (Flutter) Verdict
Detection SignalR built-in disconnect detection Not implemented yet Equal (both need this)
Recovery SignalR auto-reconnect: 0s, 5s, 10s, 15s fixed delays Not implemented yet To be built
Degraded mode CrossSwitch/PTZ work, locks/sequences don't Same design — correct Equal
State on reconnect Hub client calls GetLockedCameraIds(), GetRunningSequences() Not implemented yet Must match

1.3 Network Failure

Aspect Legacy (WPF) New (Flutter) Verdict
Detection NetworkAvailabilityWorker polls every 5s (checks NIC status) None — no network detection Critical gap
UI feedback NetworkAvailabilityState updates UI commands Connection status bar (manual) Gap
Recovery Automatic — reconnect services activate when NIC comes back Manual only — user clicks retry Critical gap

1.4 Bridge Process Crash

Aspect Legacy (WPF) New (Flutter) Verdict
Detection N/A (SDK was in-process) HTTP timeout → connection status false OK
Recovery N/A (app restarts) None — bridge stays dead Critical gap
Prevention N/A Process supervision needed Must add

1.5 Flutter App Crash

Aspect Legacy (WPF) New (Flutter) Verdict
Recovery App restarts, reconnects in ~5s App restarts, must reinitialize Equal
State recovery Queries AppServer for locks, sequences, viewer states Queries bridges for monitor states, alarms Equal
Lock state Restored via GetLockedCameraIds() Restored from coordination service Equal

2. Critical Improvements Required

2.1 Automatic Reconnection (MUST HAVE)

The legacy system reconnects automatically at every level. Our Flutter app does not. For tunnel/traffic camera control, an operator cannot be expected to click a retry button during an emergency.

Required reconnection layers:

Layer 1: Bridge Health Polling
  Flutter → periodic GET /health to each bridge
  If bridge was down and comes back → auto-reconnect WebSocket + resync state

Layer 2: WebSocket Auto-Reconnect
  On disconnect → exponential backoff retry (1s, 2s, 4s, 8s, max 30s)
  On reconnect → resync state from bridge

Layer 3: Coordination Auto-Reconnect
  On PRIMARY disconnect → retry connection with backoff
  After 6s → STANDBY promotion (if configured)
  On reconnect to (new) PRIMARY → resync lock/sequence state

Layer 4: Network Change Detection
  Monitor network interface status
  On network restored → trigger reconnection at all layers

Legacy equivalent:

  • Camera drivers: 2-second reconnect loop (CameraServerDriverReconnectService)
  • SignalR: built-in auto-reconnect with HubRetryPolicy (0s, 5s, 10s, 15s)
  • Network: 5-second NIC polling (NetworkAvailabilityWorker)

2.2 Process Supervision (MUST HAVE)

Every .NET process (bridges + coordination service) must auto-restart on crash. An operator should never have to SSH into a machine to restart a bridge.

Platform Supervision Method
Windows Windows Service (via Microsoft.Extensions.Hosting.WindowsServices) or NSSM
Linux systemd units with Restart=always
Docker restart: always policy
Android tablet Bridges run on server, not locally

Proposed process tree:

LattePanda Sigma (per keyboard)
├── copilot-geviscope-bridge.service    (auto-restart)
├── copilot-gcore-bridge.service        (auto-restart)
├── copilot-geviserver-bridge.service   (auto-restart)
├── copilot-coordinator.service         (auto-restart, PRIMARY only)
└── copilot-keyboard.service            (auto-restart, Flutter desktop)
    or browser tab (Flutter web)

2.3 Health Monitoring Dashboard (SHOULD HAVE)

The operator must see at a glance what's working and what's not.

┌──────────────────────────────────────────────────────────┐
│  System Status                                            │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐  │
│  │ GeViScope  │  │  G-Core    │  │  Coordination      │  │
│  │  ● Online  │  │  ● Online  │  │  ● PRIMARY active  │  │
│  │  12 cams   │  │  8 cams    │  │  2 keyboards       │  │
│  │  6 viewers │  │  4 viewers │  │  1 lock active     │  │
│  └────────────┘  └────────────┘  └────────────────────┘  │
│                                                           │
│  ⚠ G-Core bridge reconnecting (attempt 3/∞)              │
└──────────────────────────────────────────────────────────┘

2.4 Command Retry with Idempotency (SHOULD HAVE)

Critical commands (CrossSwitch) should retry on transient failure:

Future<bool> viewerConnectLive(int viewer, int channel) async {
  for (int attempt = 1; attempt <= 3; attempt++) {
    try {
      final response = await _client.post('/viewer/connect-live', ...);
      if (response.statusCode == 200) return true;
    } catch (e) {
      if (attempt == 3) rethrow;
      await Future.delayed(Duration(milliseconds: 200 * attempt));
    }
  }
  return false;
}

PTZ commands should NOT retry (they're continuous — a stale retry would cause unexpected movement).

2.5 State Verification After Reconnection (MUST HAVE)

After any reconnection event, the app must not trust its cached state:

On bridge reconnect:
  1. Query GET /monitors → rebuild monitor state
  2. Query GET /alarms/active → rebuild alarm state
  3. Re-subscribe WebSocket events

On coordination reconnect:
  1. Query locks → rebuild lock state
  2. Query running sequences → update sequence state
  3. Re-subscribe lock/sequence change events

Legacy does this: ViewerStatesInitWorker rebuilds viewer state on startup/reconnect. ConfigurationService.OnChangeAvailability resyncs config when AppServer comes back.

3. Platform Independence Analysis

3.1 Current Platform Assumptions

Component Current Assumption Future Need
C# Bridges Run locally on Windows (LattePanda) Linux, Docker, remote server
Flutter App Windows desktop or browser Linux, Android tablet, browser
Coordination Runs on PRIMARY keyboard (Windows) Linux, Docker, any host
Hardware I/O USB Serial + HID on local machine Remote keyboard via network, or Bluetooth
Bridge URLs http://localhost:7720 http://192.168.x.y:7720 (already configurable)

3.2 Architecture for Platform Independence

graph TB
    subgraph "Deployment A: LattePanda (Current)"
        LP_App["Flutter Desktop"]
        LP_Bridge1["GeViScope Bridge"]
        LP_Bridge2["G-Core Bridge"]
        LP_Coord["Coordinator"]
        LP_Serial["USB Serial/HID"]
        LP_App --> LP_Bridge1
        LP_App --> LP_Bridge2
        LP_App --> LP_Coord
        LP_Serial --> LP_App
    end

    subgraph "Deployment B: Android Tablet (Future)"
        AT_App["Flutter Android"]
        AT_BT["Bluetooth Keyboard"]
        AT_App -->|"HTTP over WiFi"| Remote_Bridge1["Bridge on Server"]
        AT_App -->|"HTTP over WiFi"| Remote_Bridge2["Bridge on Server"]
        AT_App -->|"WebSocket"| Remote_Coord["Coordinator on Server"]
        AT_BT --> AT_App
    end

    subgraph "Deployment C: Linux Kiosk (Future)"
        LX_App["Flutter Linux"]
        LX_Bridge1["GeViScope Bridge"]
        LX_Bridge2["G-Core Bridge"]
        LX_Coord["Coordinator"]
        LX_Serial["USB Serial/HID"]
        LX_App --> LX_Bridge1
        LX_App --> LX_Bridge2
        LX_App --> LX_Coord
        LX_Serial --> LX_App
    end

    Remote_Bridge1 --> CS1["Camera Server 1"]
    Remote_Bridge2 --> CS2["Camera Server 2"]
    LP_Bridge1 --> CS1
    LP_Bridge2 --> CS2
    LX_Bridge1 --> CS1
    LX_Bridge2 --> CS2

3.3 Key Design Rules for Platform Independence

  1. Flutter app never assumes bridges are on localhost. Bridge URLs come from servers.json. Already the case.

  2. Bridges are deployable anywhere .NET 8 runs. Currently Windows x86/x64. Must also build for Linux x64 and linux-arm64.

  3. Coordination service is just another network service. Flutter app connects to it like a bridge — via configured URL.

  4. Hardware I/O is abstracted behind a service interface. KeyboardService interface has platform-specific implementations:

    • NativeSerialKeyboardService (desktop with USB)
    • WebSerialKeyboardService (browser with Web Serial API)
    • BluetoothKeyboardService (tablet with BT keyboard, future)
    • EmulatedKeyboardService (development/testing)
  5. No platform-specific code in business logic. All platform differences are in the service layer, injected via DI.

4. Coordination Service Design (Option B)

4.1 Service Overview

A minimal .NET 8 ASP.NET Core application (~400 lines) running on the PRIMARY keyboard:

copilot-coordinator/
├── Program.cs              # Minimal API setup, WebSocket, endpoints
├── Services/
│   ├── LockManager.cs      # Camera lock state (ported from legacy CameraLocksService)
│   ├── SequenceRunner.cs   # Sequence execution (ported from legacy SequenceService)
│   └── KeyboardRegistry.cs # Track connected keyboards
├── Models/
│   ├── CameraLock.cs       # Lock state model
│   ├── SequenceState.cs    # Running sequence model
│   └── Messages.cs         # WebSocket message types
└── appsettings.json        # Lock timeout, heartbeat interval config

4.2 REST API

GET  /health                              → Service health
GET  /status                              → Connected keyboards, active locks, sequences

POST /locks/try         {cameraId, keyboardId, priority}  → Acquire lock
POST /locks/release     {cameraId, keyboardId}             → Release lock
POST /locks/takeover    {cameraId, keyboardId, priority}   → Request takeover
POST /locks/confirm     {cameraId, keyboardId, confirm}    → Confirm/reject takeover
POST /locks/reset       {cameraId, keyboardId}             → Reset expiration
GET  /locks                                                → All active locks
GET  /locks/{keyboardId}                                   → Locks held by keyboard

POST /sequences/start   {viewerId, sequenceId}             → Start sequence
POST /sequences/stop    {viewerId}                          → Stop sequence
GET  /sequences/running                                     → Active sequences

WS   /ws                                                    → Real-time events

4.3 WebSocket Events (broadcast to all connected keyboards)

{"type": "lock_acquired",    "cameraId": 5, "keyboardId": "KB1", "expiresAt": "..."}
{"type": "lock_released",    "cameraId": 5}
{"type": "lock_expiring",    "cameraId": 5, "keyboardId": "KB1", "expiresIn": 60}
{"type": "lock_takeover",    "cameraId": 5, "from": "KB1", "to": "KB2"}
{"type": "sequence_started", "viewerId": 1001, "sequenceId": 3}
{"type": "sequence_stopped", "viewerId": 1001}
{"type": "keyboard_online",  "keyboardId": "KB2"}
{"type": "keyboard_offline", "keyboardId": "KB2"}
{"type": "heartbeat"}

4.4 Failover (Configured STANDBY)

keyboards.json:
{
  "keyboards": [
    {"id": "KB1", "role": "PRIMARY", "coordinatorPort": 8090},
    {"id": "KB2", "role": "STANDBY", "coordinatorPort": 8090}
  ]
}
  • PRIMARY starts coordinator service on :8090
  • STANDBY monitors PRIMARY's /health endpoint
  • If PRIMARY unreachable for 6 seconds → STANDBY starts its own coordinator
  • When old PRIMARY recovers → checks if another coordinator is running → defers (becomes STANDBY)
  • Lock state after failover: empty (locks expire naturally in ≤5 minutes, same as legacy AppServer restart behavior)

5. Improvement Summary: Legacy vs New

What the New System Does BETTER

Improvement Detail
No central server hardware Coordinator runs on keyboard, not separate machine
Alarm reliability Query + Subscribe + Periodic sync (legacy had event-only + hourly refresh)
Direct command path CrossSwitch/PTZ bypass coordinator entirely (legacy routed some through AppServer)
Multiplatform Flutter + .NET 8 run on Windows, Linux, Android. Legacy was Windows-only WPF
No SDK dependency in UI Bridges abstract SDKs behind REST. UI never touches native code
Independent operation Each keyboard works standalone for critical ops. Legacy needed AppServer for several features
Deployable anywhere Bridges + coordinator can run on any server, not just the keyboard

What the New System Must MATCH (Currently Missing)

Legacy Feature Legacy Implementation New Implementation Needed
Auto-reconnect to camera servers 2-second periodic retry service Bridge health polling + WebSocket auto-reconnect
Auto-reconnect to AppServer SignalR built-in (0s, 5s, 10s, 15s) Coordinator WebSocket auto-reconnect with backoff
Network detection 5-second NIC polling worker connectivity_plus package or periodic health checks
State resync on reconnect ViewerStatesInitWorker, config resync on availability change Query bridges + coordinator on any reconnect event
Graceful partial failure Parallel.ForEach with per-driver try-catch Already OK (each bridge independent)
Process watchdog Windows Service systemd / Windows Service / Docker restart policy
Media channel refresh 10-minute periodic refresh Periodic bridge status query

What the New System Should Do BETTER THAN Legacy

Improvement Legacy Gap New Approach
Exponential backoff Fixed delays (0, 5, 10, 15s) — no backoff Exponential: 1s, 2s, 4s, 8s, max 30s with jitter
Circuit breaker None — retries forever even if server is gone After N failures, back off to slow polling (60s)
Command retry None — single attempt Retry critical commands (CrossSwitch) 3x with 200ms delay
Health visibility Hidden in logs Operator-facing status dashboard in UI
Structured logging Basic ILogger JSON structured logging → ELK (already in design)
Graceful degradation UI Commands silently disabled Clear visual indicator: "Degraded mode — locks unavailable"

6. Proposed Resilience Architecture

graph TB
    subgraph "Flutter App"
        UI["UI Layer"]
        BLoCs["BLoC Layer"]
        RS["ReconnectionService"]
        HS["HealthService"]
        BS["BridgeService"]
        CS["CoordinationClient"]
        KS["KeyboardService"]
    end

    subgraph "Health & Reconnection"
        RS -->|"periodic /health"| Bridge1["GeViScope Bridge"]
        RS -->|"periodic /health"| Bridge2["G-Core Bridge"]
        RS -->|"periodic /health"| Coord["Coordinator"]
        RS -->|"on failure"| BS
        RS -->|"on failure"| CS
        HS -->|"status stream"| BLoCs
    end

    subgraph "Normal Operation"
        BS -->|"REST commands"| Bridge1
        BS -->|"REST commands"| Bridge2
        BS -->|"WebSocket events"| Bridge1
        BS -->|"WebSocket events"| Bridge2
        CS -->|"REST + WebSocket"| Coord
    end

    BLoCs --> UI
    KS -->|"Serial/HID"| BLoCs

New services needed in Flutter app:

Service Responsibility
ReconnectionService Polls bridge /health endpoints, auto-reconnects WebSocket, triggers state resync
HealthService Aggregates health of all bridges + coordinator, exposes stream to UI
CoordinationClient REST + WebSocket client to coordinator (locks, sequences, heartbeat)

7. Action Items Before Implementation

  • Create coordination service (.NET 8 minimal API, ~400 lines)
  • Add ReconnectionService to Flutter app (exponential backoff, health polling)
  • Add HealthService to Flutter app (status aggregation for UI)
  • Add CoordinationClient to Flutter app (locks, sequences)
  • Fix WebSocket auto-reconnect in BridgeService
  • Add command retry for CrossSwitch (3x with backoff)
  • Add bridge process supervision (systemd/Windows Service configs)
  • Add state resync on every reconnect event
  • Build health status UI component
  • Update servers.json schema to include coordinator URL
  • Build for Linux — verify .NET 8 bridges compile for linux-x64
  • Abstract keyboard input behind KeyboardService interface with platform impls