Flutter web app replacing legacy WPF CCTV surveillance keyboard controller. Includes wall overview, section view with monitor grid, camera input, PTZ control, alarm/lock/sequence BLoCs, and legacy-matching UI styling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
400 lines
18 KiB
Markdown
400 lines
18 KiB
Markdown
# Architecture Review: Legacy vs New — Critical Infrastructure Improvements
|
|
|
|
> Pre-implementation review. This system controls traffic/tunnel cameras in critical infrastructure. Every failure mode must be addressed. The system may run on Windows, Linux, or Android tablets in the future.
|
|
|
|
## 1. Side-by-Side Failure Mode Comparison
|
|
|
|
### 1.1 Camera Server Unreachable
|
|
|
|
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
|
|--------|-------------|---------------|---------|
|
|
| Detection | Driver `IsConnected` check every 2 seconds | HTTP timeout (5s) | Legacy better — faster detection |
|
|
| Recovery | `CameraServerDriverReconnectService` retries every 2s | **None** — user must click retry button | **Critical gap** |
|
|
| Partial failure | Skips disconnected drivers, other servers still work | Each bridge is independent — OK | Equal |
|
|
| State on reconnect | Reloads media channels, fires `DriverConnected` event | No state resync after reconnect | **Gap** |
|
|
|
|
### 1.2 Coordination Layer Down (AppServer / PRIMARY)
|
|
|
|
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
|
|--------|-------------|---------------|---------|
|
|
| Detection | SignalR built-in disconnect detection | Not implemented yet | Equal (both need this) |
|
|
| Recovery | SignalR auto-reconnect: 0s, 5s, 10s, 15s fixed delays | Not implemented yet | To be built |
|
|
| Degraded mode | CrossSwitch/PTZ work, locks/sequences don't | Same design — correct | Equal |
|
|
| State on reconnect | Hub client calls `GetLockedCameraIds()`, `GetRunningSequences()` | Not implemented yet | Must match |
|
|
|
|
### 1.3 Network Failure
|
|
|
|
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
|
|--------|-------------|---------------|---------|
|
|
| Detection | `NetworkAvailabilityWorker` polls every 5s (checks NIC status) | **None** — no network detection | **Critical gap** |
|
|
| UI feedback | `NetworkAvailabilityState` updates UI commands | Connection status bar (manual) | **Gap** |
|
|
| Recovery | Automatic — reconnect services activate when NIC comes back | **Manual only** — user clicks retry | **Critical gap** |
|
|
|
|
### 1.4 Bridge Process Crash
|
|
|
|
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
|
|--------|-------------|---------------|---------|
|
|
| Detection | N/A (SDK was in-process) | HTTP timeout → connection status false | OK |
|
|
| Recovery | N/A (app restarts) | **None** — bridge stays dead | **Critical gap** |
|
|
| Prevention | N/A | Process supervision needed | Must add |
|
|
|
|
### 1.5 Flutter App Crash
|
|
|
|
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
|
|--------|-------------|---------------|---------|
|
|
| Recovery | App restarts, reconnects in ~5s | App restarts, must reinitialize | Equal |
|
|
| State recovery | Queries AppServer for locks, sequences, viewer states | Queries bridges for monitor states, alarms | Equal |
|
|
| Lock state | Restored via `GetLockedCameraIds()` | Restored from coordination service | Equal |
|
|
|
|
## 2. Critical Improvements Required
|
|
|
|
### 2.1 Automatic Reconnection (MUST HAVE)
|
|
|
|
The legacy system reconnects automatically at every level. Our Flutter app does not. For tunnel/traffic camera control, an operator cannot be expected to click a retry button during an emergency.
|
|
|
|
**Required reconnection layers:**
|
|
|
|
```
|
|
Layer 1: Bridge Health Polling
|
|
Flutter → periodic GET /health to each bridge
|
|
If bridge was down and comes back → auto-reconnect WebSocket + resync state
|
|
|
|
Layer 2: WebSocket Auto-Reconnect
|
|
On disconnect → exponential backoff retry (1s, 2s, 4s, 8s, max 30s)
|
|
On reconnect → resync state from bridge
|
|
|
|
Layer 3: Coordination Auto-Reconnect
|
|
On PRIMARY disconnect → retry connection with backoff
|
|
After 6s → STANDBY promotion (if configured)
|
|
On reconnect to (new) PRIMARY → resync lock/sequence state
|
|
|
|
Layer 4: Network Change Detection
|
|
Monitor network interface status
|
|
On network restored → trigger reconnection at all layers
|
|
```
|
|
|
|
**Legacy equivalent:**
|
|
- Camera drivers: 2-second reconnect loop (`CameraServerDriverReconnectService`)
|
|
- SignalR: built-in auto-reconnect with `HubRetryPolicy` (0s, 5s, 10s, 15s)
|
|
- Network: 5-second NIC polling (`NetworkAvailabilityWorker`)
|
|
|
|
### 2.2 Process Supervision (MUST HAVE)
|
|
|
|
Every .NET process (bridges + coordination service) must auto-restart on crash. An operator should never have to SSH into a machine to restart a bridge.
|
|
|
|
| Platform | Supervision Method |
|
|
|----------|--------------------|
|
|
| Windows | Windows Service (via `Microsoft.Extensions.Hosting.WindowsServices`) or NSSM |
|
|
| Linux | systemd units with `Restart=always` |
|
|
| Docker | `restart: always` policy |
|
|
| Android tablet | Bridges run on server, not locally |
|
|
|
|
**Proposed process tree:**
|
|
```
|
|
LattePanda Sigma (per keyboard)
|
|
├── copilot-geviscope-bridge.service (auto-restart)
|
|
├── copilot-gcore-bridge.service (auto-restart)
|
|
├── copilot-geviserver-bridge.service (auto-restart)
|
|
├── copilot-coordinator.service (auto-restart, PRIMARY only)
|
|
└── copilot-keyboard.service (auto-restart, Flutter desktop)
|
|
or browser tab (Flutter web)
|
|
```
|
|
|
|
### 2.3 Health Monitoring Dashboard (SHOULD HAVE)
|
|
|
|
The operator must see at a glance what's working and what's not.
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────┐
|
|
│ System Status │
|
|
│ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │
|
|
│ │ GeViScope │ │ G-Core │ │ Coordination │ │
|
|
│ │ ● Online │ │ ● Online │ │ ● PRIMARY active │ │
|
|
│ │ 12 cams │ │ 8 cams │ │ 2 keyboards │ │
|
|
│ │ 6 viewers │ │ 4 viewers │ │ 1 lock active │ │
|
|
│ └────────────┘ └────────────┘ └────────────────────┘ │
|
|
│ │
|
|
│ ⚠ G-Core bridge reconnecting (attempt 3/∞) │
|
|
└──────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.4 Command Retry with Idempotency (SHOULD HAVE)
|
|
|
|
Critical commands (CrossSwitch) should retry on transient failure:
|
|
|
|
```dart
|
|
Future<bool> viewerConnectLive(int viewer, int channel) async {
|
|
for (int attempt = 1; attempt <= 3; attempt++) {
|
|
try {
|
|
final response = await _client.post('/viewer/connect-live', ...);
|
|
if (response.statusCode == 200) return true;
|
|
} catch (e) {
|
|
if (attempt == 3) rethrow;
|
|
await Future.delayed(Duration(milliseconds: 200 * attempt));
|
|
}
|
|
}
|
|
return false;
|
|
}
|
|
```
|
|
|
|
PTZ commands should NOT retry (they're continuous — a stale retry would cause unexpected movement).
|
|
|
|
### 2.5 State Verification After Reconnection (MUST HAVE)
|
|
|
|
After any reconnection event, the app must not trust its cached state:
|
|
|
|
```
|
|
On bridge reconnect:
|
|
1. Query GET /monitors → rebuild monitor state
|
|
2. Query GET /alarms/active → rebuild alarm state
|
|
3. Re-subscribe WebSocket events
|
|
|
|
On coordination reconnect:
|
|
1. Query locks → rebuild lock state
|
|
2. Query running sequences → update sequence state
|
|
3. Re-subscribe lock/sequence change events
|
|
```
|
|
|
|
Legacy does this: `ViewerStatesInitWorker` rebuilds viewer state on startup/reconnect. `ConfigurationService.OnChangeAvailability` resyncs config when AppServer comes back.
|
|
|
|
## 3. Platform Independence Analysis
|
|
|
|
### 3.1 Current Platform Assumptions
|
|
|
|
| Component | Current Assumption | Future Need |
|
|
|-----------|-------------------|-------------|
|
|
| C# Bridges | Run locally on Windows (LattePanda) | Linux, Docker, remote server |
|
|
| Flutter App | Windows desktop or browser | Linux, Android tablet, browser |
|
|
| Coordination | Runs on PRIMARY keyboard (Windows) | Linux, Docker, any host |
|
|
| Hardware I/O | USB Serial + HID on local machine | Remote keyboard via network, or Bluetooth |
|
|
| Bridge URLs | `http://localhost:7720` | `http://192.168.x.y:7720` (already configurable) |
|
|
|
|
### 3.2 Architecture for Platform Independence
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Deployment A: LattePanda (Current)"
|
|
LP_App["Flutter Desktop"]
|
|
LP_Bridge1["GeViScope Bridge"]
|
|
LP_Bridge2["G-Core Bridge"]
|
|
LP_Coord["Coordinator"]
|
|
LP_Serial["USB Serial/HID"]
|
|
LP_App --> LP_Bridge1
|
|
LP_App --> LP_Bridge2
|
|
LP_App --> LP_Coord
|
|
LP_Serial --> LP_App
|
|
end
|
|
|
|
subgraph "Deployment B: Android Tablet (Future)"
|
|
AT_App["Flutter Android"]
|
|
AT_BT["Bluetooth Keyboard"]
|
|
AT_App -->|"HTTP over WiFi"| Remote_Bridge1["Bridge on Server"]
|
|
AT_App -->|"HTTP over WiFi"| Remote_Bridge2["Bridge on Server"]
|
|
AT_App -->|"WebSocket"| Remote_Coord["Coordinator on Server"]
|
|
AT_BT --> AT_App
|
|
end
|
|
|
|
subgraph "Deployment C: Linux Kiosk (Future)"
|
|
LX_App["Flutter Linux"]
|
|
LX_Bridge1["GeViScope Bridge"]
|
|
LX_Bridge2["G-Core Bridge"]
|
|
LX_Coord["Coordinator"]
|
|
LX_Serial["USB Serial/HID"]
|
|
LX_App --> LX_Bridge1
|
|
LX_App --> LX_Bridge2
|
|
LX_App --> LX_Coord
|
|
LX_Serial --> LX_App
|
|
end
|
|
|
|
Remote_Bridge1 --> CS1["Camera Server 1"]
|
|
Remote_Bridge2 --> CS2["Camera Server 2"]
|
|
LP_Bridge1 --> CS1
|
|
LP_Bridge2 --> CS2
|
|
LX_Bridge1 --> CS1
|
|
LX_Bridge2 --> CS2
|
|
```
|
|
|
|
### 3.3 Key Design Rules for Platform Independence
|
|
|
|
1. **Flutter app never assumes bridges are on localhost.** Bridge URLs come from `servers.json`. Already the case.
|
|
|
|
2. **Bridges are deployable anywhere .NET 8 runs.** Currently Windows x86/x64. Must also build for Linux x64 and linux-arm64.
|
|
|
|
3. **Coordination service is just another network service.** Flutter app connects to it like a bridge — via configured URL.
|
|
|
|
4. **Hardware I/O is abstracted behind a service interface.** `KeyboardService` interface has platform-specific implementations:
|
|
- `NativeSerialKeyboardService` (desktop with USB)
|
|
- `WebSerialKeyboardService` (browser with Web Serial API)
|
|
- `BluetoothKeyboardService` (tablet with BT keyboard, future)
|
|
- `EmulatedKeyboardService` (development/testing)
|
|
|
|
5. **No platform-specific code in business logic.** All platform differences are in the service layer, injected via DI.
|
|
|
|
## 4. Coordination Service Design (Option B)
|
|
|
|
### 4.1 Service Overview
|
|
|
|
A minimal .NET 8 ASP.NET Core application (~400 lines) running on the PRIMARY keyboard:
|
|
|
|
```
|
|
copilot-coordinator/
|
|
├── Program.cs # Minimal API setup, WebSocket, endpoints
|
|
├── Services/
|
|
│ ├── LockManager.cs # Camera lock state (ported from legacy CameraLocksService)
|
|
│ ├── SequenceRunner.cs # Sequence execution (ported from legacy SequenceService)
|
|
│ └── KeyboardRegistry.cs # Track connected keyboards
|
|
├── Models/
|
|
│ ├── CameraLock.cs # Lock state model
|
|
│ ├── SequenceState.cs # Running sequence model
|
|
│ └── Messages.cs # WebSocket message types
|
|
└── appsettings.json # Lock timeout, heartbeat interval config
|
|
```
|
|
|
|
### 4.2 REST API
|
|
|
|
```
|
|
GET /health → Service health
|
|
GET /status → Connected keyboards, active locks, sequences
|
|
|
|
POST /locks/try {cameraId, keyboardId, priority} → Acquire lock
|
|
POST /locks/release {cameraId, keyboardId} → Release lock
|
|
POST /locks/takeover {cameraId, keyboardId, priority} → Request takeover
|
|
POST /locks/confirm {cameraId, keyboardId, confirm} → Confirm/reject takeover
|
|
POST /locks/reset {cameraId, keyboardId} → Reset expiration
|
|
GET /locks → All active locks
|
|
GET /locks/{keyboardId} → Locks held by keyboard
|
|
|
|
POST /sequences/start {viewerId, sequenceId} → Start sequence
|
|
POST /sequences/stop {viewerId} → Stop sequence
|
|
GET /sequences/running → Active sequences
|
|
|
|
WS /ws → Real-time events
|
|
```
|
|
|
|
### 4.3 WebSocket Events (broadcast to all connected keyboards)
|
|
|
|
```json
|
|
{"type": "lock_acquired", "cameraId": 5, "keyboardId": "KB1", "expiresAt": "..."}
|
|
{"type": "lock_released", "cameraId": 5}
|
|
{"type": "lock_expiring", "cameraId": 5, "keyboardId": "KB1", "expiresIn": 60}
|
|
{"type": "lock_takeover", "cameraId": 5, "from": "KB1", "to": "KB2"}
|
|
{"type": "sequence_started", "viewerId": 1001, "sequenceId": 3}
|
|
{"type": "sequence_stopped", "viewerId": 1001}
|
|
{"type": "keyboard_online", "keyboardId": "KB2"}
|
|
{"type": "keyboard_offline", "keyboardId": "KB2"}
|
|
{"type": "heartbeat"}
|
|
```
|
|
|
|
### 4.4 Failover (Configured STANDBY)
|
|
|
|
```
|
|
keyboards.json:
|
|
{
|
|
"keyboards": [
|
|
{"id": "KB1", "role": "PRIMARY", "coordinatorPort": 8090},
|
|
{"id": "KB2", "role": "STANDBY", "coordinatorPort": 8090}
|
|
]
|
|
}
|
|
```
|
|
|
|
- PRIMARY starts coordinator service on `:8090`
|
|
- STANDBY monitors PRIMARY's `/health` endpoint
|
|
- If PRIMARY unreachable for 6 seconds → STANDBY starts its own coordinator
|
|
- When old PRIMARY recovers → checks if another coordinator is running → defers (becomes STANDBY)
|
|
- Lock state after failover: **empty** (locks expire naturally in ≤5 minutes, same as legacy AppServer restart behavior)
|
|
|
|
## 5. Improvement Summary: Legacy vs New
|
|
|
|
### What the New System Does BETTER
|
|
|
|
| Improvement | Detail |
|
|
|-------------|--------|
|
|
| No central server hardware | Coordinator runs on keyboard, not separate machine |
|
|
| Alarm reliability | Query + Subscribe + Periodic sync (legacy had event-only + hourly refresh) |
|
|
| Direct command path | CrossSwitch/PTZ bypass coordinator entirely (legacy routed some through AppServer) |
|
|
| Multiplatform | Flutter + .NET 8 run on Windows, Linux, Android. Legacy was Windows-only WPF |
|
|
| No SDK dependency in UI | Bridges abstract SDKs behind REST. UI never touches native code |
|
|
| Independent operation | Each keyboard works standalone for critical ops. Legacy needed AppServer for several features |
|
|
| Deployable anywhere | Bridges + coordinator can run on any server, not just the keyboard |
|
|
|
|
### What the New System Must MATCH (Currently Missing)
|
|
|
|
| Legacy Feature | Legacy Implementation | New Implementation Needed |
|
|
|---------------|----------------------|---------------------------|
|
|
| Auto-reconnect to camera servers | 2-second periodic retry service | Bridge health polling + WebSocket auto-reconnect |
|
|
| Auto-reconnect to AppServer | SignalR built-in (0s, 5s, 10s, 15s) | Coordinator WebSocket auto-reconnect with backoff |
|
|
| Network detection | 5-second NIC polling worker | `connectivity_plus` package or periodic health checks |
|
|
| State resync on reconnect | `ViewerStatesInitWorker`, config resync on availability change | Query bridges + coordinator on any reconnect event |
|
|
| Graceful partial failure | `Parallel.ForEach` with per-driver try-catch | Already OK (each bridge independent) |
|
|
| Process watchdog | Windows Service | systemd / Windows Service / Docker restart policy |
|
|
| Media channel refresh | 10-minute periodic refresh | Periodic bridge status query |
|
|
|
|
### What the New System Should Do BETTER THAN Legacy
|
|
|
|
| Improvement | Legacy Gap | New Approach |
|
|
|-------------|-----------|--------------|
|
|
| Exponential backoff | Fixed delays (0, 5, 10, 15s) — no backoff | Exponential: 1s, 2s, 4s, 8s, max 30s with jitter |
|
|
| Circuit breaker | None — retries forever even if server is gone | After N failures, back off to slow polling (60s) |
|
|
| Command retry | None — single attempt | Retry critical commands (CrossSwitch) 3x with 200ms delay |
|
|
| Health visibility | Hidden in logs | Operator-facing status dashboard in UI |
|
|
| Structured logging | Basic ILogger | JSON structured logging → ELK (already in design) |
|
|
| Graceful degradation UI | Commands silently disabled | Clear visual indicator: "Degraded mode — locks unavailable" |
|
|
|
|
## 6. Proposed Resilience Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Flutter App"
|
|
UI["UI Layer"]
|
|
BLoCs["BLoC Layer"]
|
|
RS["ReconnectionService"]
|
|
HS["HealthService"]
|
|
BS["BridgeService"]
|
|
CS["CoordinationClient"]
|
|
KS["KeyboardService"]
|
|
end
|
|
|
|
subgraph "Health & Reconnection"
|
|
RS -->|"periodic /health"| Bridge1["GeViScope Bridge"]
|
|
RS -->|"periodic /health"| Bridge2["G-Core Bridge"]
|
|
RS -->|"periodic /health"| Coord["Coordinator"]
|
|
RS -->|"on failure"| BS
|
|
RS -->|"on failure"| CS
|
|
HS -->|"status stream"| BLoCs
|
|
end
|
|
|
|
subgraph "Normal Operation"
|
|
BS -->|"REST commands"| Bridge1
|
|
BS -->|"REST commands"| Bridge2
|
|
BS -->|"WebSocket events"| Bridge1
|
|
BS -->|"WebSocket events"| Bridge2
|
|
CS -->|"REST + WebSocket"| Coord
|
|
end
|
|
|
|
BLoCs --> UI
|
|
KS -->|"Serial/HID"| BLoCs
|
|
```
|
|
|
|
**New services needed in Flutter app:**
|
|
|
|
| Service | Responsibility |
|
|
|---------|---------------|
|
|
| `ReconnectionService` | Polls bridge `/health` endpoints, auto-reconnects WebSocket, triggers state resync |
|
|
| `HealthService` | Aggregates health of all bridges + coordinator, exposes stream to UI |
|
|
| `CoordinationClient` | REST + WebSocket client to coordinator (locks, sequences, heartbeat) |
|
|
|
|
## 7. Action Items Before Implementation
|
|
|
|
- [ ] **Create coordination service** (.NET 8 minimal API, ~400 lines)
|
|
- [ ] **Add `ReconnectionService`** to Flutter app (exponential backoff, health polling)
|
|
- [ ] **Add `HealthService`** to Flutter app (status aggregation for UI)
|
|
- [ ] **Add `CoordinationClient`** to Flutter app (locks, sequences)
|
|
- [ ] **Fix WebSocket auto-reconnect** in `BridgeService`
|
|
- [ ] **Add command retry** for CrossSwitch (3x with backoff)
|
|
- [ ] **Add bridge process supervision** (systemd/Windows Service configs)
|
|
- [ ] **Add state resync** on every reconnect event
|
|
- [ ] **Build health status UI** component
|
|
- [ ] **Update `servers.json`** schema to include coordinator URL
|
|
- [ ] **Build for Linux** — verify .NET 8 bridges compile for linux-x64
|
|
- [ ] **Abstract keyboard input** behind `KeyboardService` interface with platform impls
|