Files
COPILOT/Docs/legacy-architecture/architecture-review.md
klas 40143734fc Initial commit: COPILOT D6 Flutter keyboard controller
Flutter web app replacing legacy WPF CCTV surveillance keyboard controller.
Includes wall overview, section view with monitor grid, camera input,
PTZ control, alarm/lock/sequence BLoCs, and legacy-matching UI styling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 14:57:38 +01:00

400 lines
18 KiB
Markdown

# Architecture Review: Legacy vs New — Critical Infrastructure Improvements
> Pre-implementation review. This system controls traffic/tunnel cameras in critical infrastructure. Every failure mode must be addressed. The system may run on Windows, Linux, or Android tablets in the future.
## 1. Side-by-Side Failure Mode Comparison
### 1.1 Camera Server Unreachable
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|--------|-------------|---------------|---------|
| Detection | Driver `IsConnected` check every 2 seconds | HTTP timeout (5s) | Legacy better — faster detection |
| Recovery | `CameraServerDriverReconnectService` retries every 2s | **None** — user must click retry button | **Critical gap** |
| Partial failure | Skips disconnected drivers, other servers still work | Each bridge is independent — OK | Equal |
| State on reconnect | Reloads media channels, fires `DriverConnected` event | No state resync after reconnect | **Gap** |
### 1.2 Coordination Layer Down (AppServer / PRIMARY)
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|--------|-------------|---------------|---------|
| Detection | SignalR built-in disconnect detection | Not implemented yet | Equal (both need this) |
| Recovery | SignalR auto-reconnect: 0s, 5s, 10s, 15s fixed delays | Not implemented yet | To be built |
| Degraded mode | CrossSwitch/PTZ work, locks/sequences don't | Same design — correct | Equal |
| State on reconnect | Hub client calls `GetLockedCameraIds()`, `GetRunningSequences()` | Not implemented yet | Must match |
### 1.3 Network Failure
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|--------|-------------|---------------|---------|
| Detection | `NetworkAvailabilityWorker` polls every 5s (checks NIC status) | **None** — no network detection | **Critical gap** |
| UI feedback | `NetworkAvailabilityState` updates UI commands | Connection status bar (manual) | **Gap** |
| Recovery | Automatic — reconnect services activate when NIC comes back | **Manual only** — user clicks retry | **Critical gap** |
### 1.4 Bridge Process Crash
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|--------|-------------|---------------|---------|
| Detection | N/A (SDK was in-process) | HTTP timeout → connection status false | OK |
| Recovery | N/A (app restarts) | **None** — bridge stays dead | **Critical gap** |
| Prevention | N/A | Process supervision needed | Must add |
### 1.5 Flutter App Crash
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|--------|-------------|---------------|---------|
| Recovery | App restarts, reconnects in ~5s | App restarts, must reinitialize | Equal |
| State recovery | Queries AppServer for locks, sequences, viewer states | Queries bridges for monitor states, alarms | Equal |
| Lock state | Restored via `GetLockedCameraIds()` | Restored from coordination service | Equal |
## 2. Critical Improvements Required
### 2.1 Automatic Reconnection (MUST HAVE)
The legacy system reconnects automatically at every level. Our Flutter app does not. For tunnel/traffic camera control, an operator cannot be expected to click a retry button during an emergency.
**Required reconnection layers:**
```
Layer 1: Bridge Health Polling
Flutter → periodic GET /health to each bridge
If bridge was down and comes back → auto-reconnect WebSocket + resync state
Layer 2: WebSocket Auto-Reconnect
On disconnect → exponential backoff retry (1s, 2s, 4s, 8s, max 30s)
On reconnect → resync state from bridge
Layer 3: Coordination Auto-Reconnect
On PRIMARY disconnect → retry connection with backoff
After 6s → STANDBY promotion (if configured)
On reconnect to (new) PRIMARY → resync lock/sequence state
Layer 4: Network Change Detection
Monitor network interface status
On network restored → trigger reconnection at all layers
```
**Legacy equivalent:**
- Camera drivers: 2-second reconnect loop (`CameraServerDriverReconnectService`)
- SignalR: built-in auto-reconnect with `HubRetryPolicy` (0s, 5s, 10s, 15s)
- Network: 5-second NIC polling (`NetworkAvailabilityWorker`)
### 2.2 Process Supervision (MUST HAVE)
Every .NET process (bridges + coordination service) must auto-restart on crash. An operator should never have to SSH into a machine to restart a bridge.
| Platform | Supervision Method |
|----------|--------------------|
| Windows | Windows Service (via `Microsoft.Extensions.Hosting.WindowsServices`) or NSSM |
| Linux | systemd units with `Restart=always` |
| Docker | `restart: always` policy |
| Android tablet | Bridges run on server, not locally |
**Proposed process tree:**
```
LattePanda Sigma (per keyboard)
├── copilot-geviscope-bridge.service (auto-restart)
├── copilot-gcore-bridge.service (auto-restart)
├── copilot-geviserver-bridge.service (auto-restart)
├── copilot-coordinator.service (auto-restart, PRIMARY only)
└── copilot-keyboard.service (auto-restart, Flutter desktop)
or browser tab (Flutter web)
```
### 2.3 Health Monitoring Dashboard (SHOULD HAVE)
The operator must see at a glance what's working and what's not.
```
┌──────────────────────────────────────────────────────────┐
│ System Status │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │
│ │ GeViScope │ │ G-Core │ │ Coordination │ │
│ │ ● Online │ │ ● Online │ │ ● PRIMARY active │ │
│ │ 12 cams │ │ 8 cams │ │ 2 keyboards │ │
│ │ 6 viewers │ │ 4 viewers │ │ 1 lock active │ │
│ └────────────┘ └────────────┘ └────────────────────┘ │
│ │
│ ⚠ G-Core bridge reconnecting (attempt 3/∞) │
└──────────────────────────────────────────────────────────┘
```
### 2.4 Command Retry with Idempotency (SHOULD HAVE)
Critical commands (CrossSwitch) should retry on transient failure:
```dart
Future<bool> viewerConnectLive(int viewer, int channel) async {
for (int attempt = 1; attempt <= 3; attempt++) {
try {
final response = await _client.post('/viewer/connect-live', ...);
if (response.statusCode == 200) return true;
} catch (e) {
if (attempt == 3) rethrow;
await Future.delayed(Duration(milliseconds: 200 * attempt));
}
}
return false;
}
```
PTZ commands should NOT retry (they're continuous — a stale retry would cause unexpected movement).
### 2.5 State Verification After Reconnection (MUST HAVE)
After any reconnection event, the app must not trust its cached state:
```
On bridge reconnect:
1. Query GET /monitors → rebuild monitor state
2. Query GET /alarms/active → rebuild alarm state
3. Re-subscribe WebSocket events
On coordination reconnect:
1. Query locks → rebuild lock state
2. Query running sequences → update sequence state
3. Re-subscribe lock/sequence change events
```
Legacy does this: `ViewerStatesInitWorker` rebuilds viewer state on startup/reconnect. `ConfigurationService.OnChangeAvailability` resyncs config when AppServer comes back.
## 3. Platform Independence Analysis
### 3.1 Current Platform Assumptions
| Component | Current Assumption | Future Need |
|-----------|-------------------|-------------|
| C# Bridges | Run locally on Windows (LattePanda) | Linux, Docker, remote server |
| Flutter App | Windows desktop or browser | Linux, Android tablet, browser |
| Coordination | Runs on PRIMARY keyboard (Windows) | Linux, Docker, any host |
| Hardware I/O | USB Serial + HID on local machine | Remote keyboard via network, or Bluetooth |
| Bridge URLs | `http://localhost:7720` | `http://192.168.x.y:7720` (already configurable) |
### 3.2 Architecture for Platform Independence
```mermaid
graph TB
subgraph "Deployment A: LattePanda (Current)"
LP_App["Flutter Desktop"]
LP_Bridge1["GeViScope Bridge"]
LP_Bridge2["G-Core Bridge"]
LP_Coord["Coordinator"]
LP_Serial["USB Serial/HID"]
LP_App --> LP_Bridge1
LP_App --> LP_Bridge2
LP_App --> LP_Coord
LP_Serial --> LP_App
end
subgraph "Deployment B: Android Tablet (Future)"
AT_App["Flutter Android"]
AT_BT["Bluetooth Keyboard"]
AT_App -->|"HTTP over WiFi"| Remote_Bridge1["Bridge on Server"]
AT_App -->|"HTTP over WiFi"| Remote_Bridge2["Bridge on Server"]
AT_App -->|"WebSocket"| Remote_Coord["Coordinator on Server"]
AT_BT --> AT_App
end
subgraph "Deployment C: Linux Kiosk (Future)"
LX_App["Flutter Linux"]
LX_Bridge1["GeViScope Bridge"]
LX_Bridge2["G-Core Bridge"]
LX_Coord["Coordinator"]
LX_Serial["USB Serial/HID"]
LX_App --> LX_Bridge1
LX_App --> LX_Bridge2
LX_App --> LX_Coord
LX_Serial --> LX_App
end
Remote_Bridge1 --> CS1["Camera Server 1"]
Remote_Bridge2 --> CS2["Camera Server 2"]
LP_Bridge1 --> CS1
LP_Bridge2 --> CS2
LX_Bridge1 --> CS1
LX_Bridge2 --> CS2
```
### 3.3 Key Design Rules for Platform Independence
1. **Flutter app never assumes bridges are on localhost.** Bridge URLs come from `servers.json`. Already the case.
2. **Bridges are deployable anywhere .NET 8 runs.** Currently Windows x86/x64. Must also build for Linux x64 and linux-arm64.
3. **Coordination service is just another network service.** Flutter app connects to it like a bridge — via configured URL.
4. **Hardware I/O is abstracted behind a service interface.** `KeyboardService` interface has platform-specific implementations:
- `NativeSerialKeyboardService` (desktop with USB)
- `WebSerialKeyboardService` (browser with Web Serial API)
- `BluetoothKeyboardService` (tablet with BT keyboard, future)
- `EmulatedKeyboardService` (development/testing)
5. **No platform-specific code in business logic.** All platform differences are in the service layer, injected via DI.
## 4. Coordination Service Design (Option B)
### 4.1 Service Overview
A minimal .NET 8 ASP.NET Core application (~400 lines) running on the PRIMARY keyboard:
```
copilot-coordinator/
├── Program.cs # Minimal API setup, WebSocket, endpoints
├── Services/
│ ├── LockManager.cs # Camera lock state (ported from legacy CameraLocksService)
│ ├── SequenceRunner.cs # Sequence execution (ported from legacy SequenceService)
│ └── KeyboardRegistry.cs # Track connected keyboards
├── Models/
│ ├── CameraLock.cs # Lock state model
│ ├── SequenceState.cs # Running sequence model
│ └── Messages.cs # WebSocket message types
└── appsettings.json # Lock timeout, heartbeat interval config
```
### 4.2 REST API
```
GET /health → Service health
GET /status → Connected keyboards, active locks, sequences
POST /locks/try {cameraId, keyboardId, priority} → Acquire lock
POST /locks/release {cameraId, keyboardId} → Release lock
POST /locks/takeover {cameraId, keyboardId, priority} → Request takeover
POST /locks/confirm {cameraId, keyboardId, confirm} → Confirm/reject takeover
POST /locks/reset {cameraId, keyboardId} → Reset expiration
GET /locks → All active locks
GET /locks/{keyboardId} → Locks held by keyboard
POST /sequences/start {viewerId, sequenceId} → Start sequence
POST /sequences/stop {viewerId} → Stop sequence
GET /sequences/running → Active sequences
WS /ws → Real-time events
```
### 4.3 WebSocket Events (broadcast to all connected keyboards)
```json
{"type": "lock_acquired", "cameraId": 5, "keyboardId": "KB1", "expiresAt": "..."}
{"type": "lock_released", "cameraId": 5}
{"type": "lock_expiring", "cameraId": 5, "keyboardId": "KB1", "expiresIn": 60}
{"type": "lock_takeover", "cameraId": 5, "from": "KB1", "to": "KB2"}
{"type": "sequence_started", "viewerId": 1001, "sequenceId": 3}
{"type": "sequence_stopped", "viewerId": 1001}
{"type": "keyboard_online", "keyboardId": "KB2"}
{"type": "keyboard_offline", "keyboardId": "KB2"}
{"type": "heartbeat"}
```
### 4.4 Failover (Configured STANDBY)
```
keyboards.json:
{
"keyboards": [
{"id": "KB1", "role": "PRIMARY", "coordinatorPort": 8090},
{"id": "KB2", "role": "STANDBY", "coordinatorPort": 8090}
]
}
```
- PRIMARY starts coordinator service on `:8090`
- STANDBY monitors PRIMARY's `/health` endpoint
- If PRIMARY unreachable for 6 seconds → STANDBY starts its own coordinator
- When old PRIMARY recovers → checks if another coordinator is running → defers (becomes STANDBY)
- Lock state after failover: **empty** (locks expire naturally in ≤5 minutes, same as legacy AppServer restart behavior)
## 5. Improvement Summary: Legacy vs New
### What the New System Does BETTER
| Improvement | Detail |
|-------------|--------|
| No central server hardware | Coordinator runs on keyboard, not separate machine |
| Alarm reliability | Query + Subscribe + Periodic sync (legacy had event-only + hourly refresh) |
| Direct command path | CrossSwitch/PTZ bypass coordinator entirely (legacy routed some through AppServer) |
| Multiplatform | Flutter + .NET 8 run on Windows, Linux, Android. Legacy was Windows-only WPF |
| No SDK dependency in UI | Bridges abstract SDKs behind REST. UI never touches native code |
| Independent operation | Each keyboard works standalone for critical ops. Legacy needed AppServer for several features |
| Deployable anywhere | Bridges + coordinator can run on any server, not just the keyboard |
### What the New System Must MATCH (Currently Missing)
| Legacy Feature | Legacy Implementation | New Implementation Needed |
|---------------|----------------------|---------------------------|
| Auto-reconnect to camera servers | 2-second periodic retry service | Bridge health polling + WebSocket auto-reconnect |
| Auto-reconnect to AppServer | SignalR built-in (0s, 5s, 10s, 15s) | Coordinator WebSocket auto-reconnect with backoff |
| Network detection | 5-second NIC polling worker | `connectivity_plus` package or periodic health checks |
| State resync on reconnect | `ViewerStatesInitWorker`, config resync on availability change | Query bridges + coordinator on any reconnect event |
| Graceful partial failure | `Parallel.ForEach` with per-driver try-catch | Already OK (each bridge independent) |
| Process watchdog | Windows Service | systemd / Windows Service / Docker restart policy |
| Media channel refresh | 10-minute periodic refresh | Periodic bridge status query |
### What the New System Should Do BETTER THAN Legacy
| Improvement | Legacy Gap | New Approach |
|-------------|-----------|--------------|
| Exponential backoff | Fixed delays (0, 5, 10, 15s) — no backoff | Exponential: 1s, 2s, 4s, 8s, max 30s with jitter |
| Circuit breaker | None — retries forever even if server is gone | After N failures, back off to slow polling (60s) |
| Command retry | None — single attempt | Retry critical commands (CrossSwitch) 3x with 200ms delay |
| Health visibility | Hidden in logs | Operator-facing status dashboard in UI |
| Structured logging | Basic ILogger | JSON structured logging → ELK (already in design) |
| Graceful degradation UI | Commands silently disabled | Clear visual indicator: "Degraded mode — locks unavailable" |
## 6. Proposed Resilience Architecture
```mermaid
graph TB
subgraph "Flutter App"
UI["UI Layer"]
BLoCs["BLoC Layer"]
RS["ReconnectionService"]
HS["HealthService"]
BS["BridgeService"]
CS["CoordinationClient"]
KS["KeyboardService"]
end
subgraph "Health & Reconnection"
RS -->|"periodic /health"| Bridge1["GeViScope Bridge"]
RS -->|"periodic /health"| Bridge2["G-Core Bridge"]
RS -->|"periodic /health"| Coord["Coordinator"]
RS -->|"on failure"| BS
RS -->|"on failure"| CS
HS -->|"status stream"| BLoCs
end
subgraph "Normal Operation"
BS -->|"REST commands"| Bridge1
BS -->|"REST commands"| Bridge2
BS -->|"WebSocket events"| Bridge1
BS -->|"WebSocket events"| Bridge2
CS -->|"REST + WebSocket"| Coord
end
BLoCs --> UI
KS -->|"Serial/HID"| BLoCs
```
**New services needed in Flutter app:**
| Service | Responsibility |
|---------|---------------|
| `ReconnectionService` | Polls bridge `/health` endpoints, auto-reconnects WebSocket, triggers state resync |
| `HealthService` | Aggregates health of all bridges + coordinator, exposes stream to UI |
| `CoordinationClient` | REST + WebSocket client to coordinator (locks, sequences, heartbeat) |
## 7. Action Items Before Implementation
- [ ] **Create coordination service** (.NET 8 minimal API, ~400 lines)
- [ ] **Add `ReconnectionService`** to Flutter app (exponential backoff, health polling)
- [ ] **Add `HealthService`** to Flutter app (status aggregation for UI)
- [ ] **Add `CoordinationClient`** to Flutter app (locks, sequences)
- [ ] **Fix WebSocket auto-reconnect** in `BridgeService`
- [ ] **Add command retry** for CrossSwitch (3x with backoff)
- [ ] **Add bridge process supervision** (systemd/Windows Service configs)
- [ ] **Add state resync** on every reconnect event
- [ ] **Build health status UI** component
- [ ] **Update `servers.json`** schema to include coordinator URL
- [ ] **Build for Linux** — verify .NET 8 bridges compile for linux-x64
- [ ] **Abstract keyboard input** behind `KeyboardService` interface with platform impls