Initial commit: COPILOT D6 Flutter keyboard controller
Flutter web app replacing legacy WPF CCTV surveillance keyboard controller. Includes wall overview, section view with monitor grid, camera input, PTZ control, alarm/lock/sequence BLoCs, and legacy-matching UI styling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
399
Docs/legacy-architecture/architecture-review.md
Normal file
399
Docs/legacy-architecture/architecture-review.md
Normal file
@@ -0,0 +1,399 @@
|
||||
# Architecture Review: Legacy vs New — Critical Infrastructure Improvements
|
||||
|
||||
> Pre-implementation review. This system controls traffic/tunnel cameras in critical infrastructure. Every failure mode must be addressed. The system may run on Windows, Linux, or Android tablets in the future.
|
||||
|
||||
## 1. Side-by-Side Failure Mode Comparison
|
||||
|
||||
### 1.1 Camera Server Unreachable
|
||||
|
||||
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
||||
|--------|-------------|---------------|---------|
|
||||
| Detection | Driver `IsConnected` check every 2 seconds | HTTP timeout (5s) | Legacy better — faster detection |
|
||||
| Recovery | `CameraServerDriverReconnectService` retries every 2s | **None** — user must click retry button | **Critical gap** |
|
||||
| Partial failure | Skips disconnected drivers, other servers still work | Each bridge is independent — OK | Equal |
|
||||
| State on reconnect | Reloads media channels, fires `DriverConnected` event | No state resync after reconnect | **Gap** |
|
||||
|
||||
### 1.2 Coordination Layer Down (AppServer / PRIMARY)
|
||||
|
||||
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
||||
|--------|-------------|---------------|---------|
|
||||
| Detection | SignalR built-in disconnect detection | Not implemented yet | Equal (both need this) |
|
||||
| Recovery | SignalR auto-reconnect: 0s, 5s, 10s, 15s fixed delays | Not implemented yet | To be built |
|
||||
| Degraded mode | CrossSwitch/PTZ work, locks/sequences don't | Same design — correct | Equal |
|
||||
| State on reconnect | Hub client calls `GetLockedCameraIds()`, `GetRunningSequences()` | Not implemented yet | Must match |
|
||||
|
||||
### 1.3 Network Failure
|
||||
|
||||
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
||||
|--------|-------------|---------------|---------|
|
||||
| Detection | `NetworkAvailabilityWorker` polls every 5s (checks NIC status) | **None** — no network detection | **Critical gap** |
|
||||
| UI feedback | `NetworkAvailabilityState` updates UI commands | Connection status bar (manual) | **Gap** |
|
||||
| Recovery | Automatic — reconnect services activate when NIC comes back | **Manual only** — user clicks retry | **Critical gap** |
|
||||
|
||||
### 1.4 Bridge Process Crash
|
||||
|
||||
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
||||
|--------|-------------|---------------|---------|
|
||||
| Detection | N/A (SDK was in-process) | HTTP timeout → connection status false | OK |
|
||||
| Recovery | N/A (app restarts) | **None** — bridge stays dead | **Critical gap** |
|
||||
| Prevention | N/A | Process supervision needed | Must add |
|
||||
|
||||
### 1.5 Flutter App Crash
|
||||
|
||||
| Aspect | Legacy (WPF) | New (Flutter) | Verdict |
|
||||
|--------|-------------|---------------|---------|
|
||||
| Recovery | App restarts, reconnects in ~5s | App restarts, must reinitialize | Equal |
|
||||
| State recovery | Queries AppServer for locks, sequences, viewer states | Queries bridges for monitor states, alarms | Equal |
|
||||
| Lock state | Restored via `GetLockedCameraIds()` | Restored from coordination service | Equal |
|
||||
|
||||
## 2. Critical Improvements Required
|
||||
|
||||
### 2.1 Automatic Reconnection (MUST HAVE)
|
||||
|
||||
The legacy system reconnects automatically at every level. Our Flutter app does not. For tunnel/traffic camera control, an operator cannot be expected to click a retry button during an emergency.
|
||||
|
||||
**Required reconnection layers:**
|
||||
|
||||
```
|
||||
Layer 1: Bridge Health Polling
|
||||
Flutter → periodic GET /health to each bridge
|
||||
If bridge was down and comes back → auto-reconnect WebSocket + resync state
|
||||
|
||||
Layer 2: WebSocket Auto-Reconnect
|
||||
On disconnect → exponential backoff retry (1s, 2s, 4s, 8s, max 30s)
|
||||
On reconnect → resync state from bridge
|
||||
|
||||
Layer 3: Coordination Auto-Reconnect
|
||||
On PRIMARY disconnect → retry connection with backoff
|
||||
After 6s → STANDBY promotion (if configured)
|
||||
On reconnect to (new) PRIMARY → resync lock/sequence state
|
||||
|
||||
Layer 4: Network Change Detection
|
||||
Monitor network interface status
|
||||
On network restored → trigger reconnection at all layers
|
||||
```
|
||||
|
||||
**Legacy equivalent:**
|
||||
- Camera drivers: 2-second reconnect loop (`CameraServerDriverReconnectService`)
|
||||
- SignalR: built-in auto-reconnect with `HubRetryPolicy` (0s, 5s, 10s, 15s)
|
||||
- Network: 5-second NIC polling (`NetworkAvailabilityWorker`)
|
||||
|
||||
### 2.2 Process Supervision (MUST HAVE)
|
||||
|
||||
Every .NET process (bridges + coordination service) must auto-restart on crash. An operator should never have to SSH into a machine to restart a bridge.
|
||||
|
||||
| Platform | Supervision Method |
|
||||
|----------|--------------------|
|
||||
| Windows | Windows Service (via `Microsoft.Extensions.Hosting.WindowsServices`) or NSSM |
|
||||
| Linux | systemd units with `Restart=always` |
|
||||
| Docker | `restart: always` policy |
|
||||
| Android tablet | Bridges run on server, not locally |
|
||||
|
||||
**Proposed process tree:**
|
||||
```
|
||||
LattePanda Sigma (per keyboard)
|
||||
├── copilot-geviscope-bridge.service (auto-restart)
|
||||
├── copilot-gcore-bridge.service (auto-restart)
|
||||
├── copilot-geviserver-bridge.service (auto-restart)
|
||||
├── copilot-coordinator.service (auto-restart, PRIMARY only)
|
||||
└── copilot-keyboard.service (auto-restart, Flutter desktop)
|
||||
or browser tab (Flutter web)
|
||||
```
|
||||
|
||||
### 2.3 Health Monitoring Dashboard (SHOULD HAVE)
|
||||
|
||||
The operator must see at a glance what's working and what's not.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ System Status │
|
||||
│ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │
|
||||
│ │ GeViScope │ │ G-Core │ │ Coordination │ │
|
||||
│ │ ● Online │ │ ● Online │ │ ● PRIMARY active │ │
|
||||
│ │ 12 cams │ │ 8 cams │ │ 2 keyboards │ │
|
||||
│ │ 6 viewers │ │ 4 viewers │ │ 1 lock active │ │
|
||||
│ └────────────┘ └────────────┘ └────────────────────┘ │
|
||||
│ │
|
||||
│ ⚠ G-Core bridge reconnecting (attempt 3/∞) │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.4 Command Retry with Idempotency (SHOULD HAVE)
|
||||
|
||||
Critical commands (CrossSwitch) should retry on transient failure:
|
||||
|
||||
```dart
|
||||
Future<bool> viewerConnectLive(int viewer, int channel) async {
|
||||
for (int attempt = 1; attempt <= 3; attempt++) {
|
||||
try {
|
||||
final response = await _client.post('/viewer/connect-live', ...);
|
||||
if (response.statusCode == 200) return true;
|
||||
} catch (e) {
|
||||
if (attempt == 3) rethrow;
|
||||
await Future.delayed(Duration(milliseconds: 200 * attempt));
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
PTZ commands should NOT retry (they're continuous — a stale retry would cause unexpected movement).
|
||||
|
||||
### 2.5 State Verification After Reconnection (MUST HAVE)
|
||||
|
||||
After any reconnection event, the app must not trust its cached state:
|
||||
|
||||
```
|
||||
On bridge reconnect:
|
||||
1. Query GET /monitors → rebuild monitor state
|
||||
2. Query GET /alarms/active → rebuild alarm state
|
||||
3. Re-subscribe WebSocket events
|
||||
|
||||
On coordination reconnect:
|
||||
1. Query locks → rebuild lock state
|
||||
2. Query running sequences → update sequence state
|
||||
3. Re-subscribe lock/sequence change events
|
||||
```
|
||||
|
||||
Legacy does this: `ViewerStatesInitWorker` rebuilds viewer state on startup/reconnect. `ConfigurationService.OnChangeAvailability` resyncs config when AppServer comes back.
|
||||
|
||||
## 3. Platform Independence Analysis
|
||||
|
||||
### 3.1 Current Platform Assumptions
|
||||
|
||||
| Component | Current Assumption | Future Need |
|
||||
|-----------|-------------------|-------------|
|
||||
| C# Bridges | Run locally on Windows (LattePanda) | Linux, Docker, remote server |
|
||||
| Flutter App | Windows desktop or browser | Linux, Android tablet, browser |
|
||||
| Coordination | Runs on PRIMARY keyboard (Windows) | Linux, Docker, any host |
|
||||
| Hardware I/O | USB Serial + HID on local machine | Remote keyboard via network, or Bluetooth |
|
||||
| Bridge URLs | `http://localhost:7720` | `http://192.168.x.y:7720` (already configurable) |
|
||||
|
||||
### 3.2 Architecture for Platform Independence
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Deployment A: LattePanda (Current)"
|
||||
LP_App["Flutter Desktop"]
|
||||
LP_Bridge1["GeViScope Bridge"]
|
||||
LP_Bridge2["G-Core Bridge"]
|
||||
LP_Coord["Coordinator"]
|
||||
LP_Serial["USB Serial/HID"]
|
||||
LP_App --> LP_Bridge1
|
||||
LP_App --> LP_Bridge2
|
||||
LP_App --> LP_Coord
|
||||
LP_Serial --> LP_App
|
||||
end
|
||||
|
||||
subgraph "Deployment B: Android Tablet (Future)"
|
||||
AT_App["Flutter Android"]
|
||||
AT_BT["Bluetooth Keyboard"]
|
||||
AT_App -->|"HTTP over WiFi"| Remote_Bridge1["Bridge on Server"]
|
||||
AT_App -->|"HTTP over WiFi"| Remote_Bridge2["Bridge on Server"]
|
||||
AT_App -->|"WebSocket"| Remote_Coord["Coordinator on Server"]
|
||||
AT_BT --> AT_App
|
||||
end
|
||||
|
||||
subgraph "Deployment C: Linux Kiosk (Future)"
|
||||
LX_App["Flutter Linux"]
|
||||
LX_Bridge1["GeViScope Bridge"]
|
||||
LX_Bridge2["G-Core Bridge"]
|
||||
LX_Coord["Coordinator"]
|
||||
LX_Serial["USB Serial/HID"]
|
||||
LX_App --> LX_Bridge1
|
||||
LX_App --> LX_Bridge2
|
||||
LX_App --> LX_Coord
|
||||
LX_Serial --> LX_App
|
||||
end
|
||||
|
||||
Remote_Bridge1 --> CS1["Camera Server 1"]
|
||||
Remote_Bridge2 --> CS2["Camera Server 2"]
|
||||
LP_Bridge1 --> CS1
|
||||
LP_Bridge2 --> CS2
|
||||
LX_Bridge1 --> CS1
|
||||
LX_Bridge2 --> CS2
|
||||
```
|
||||
|
||||
### 3.3 Key Design Rules for Platform Independence
|
||||
|
||||
1. **Flutter app never assumes bridges are on localhost.** Bridge URLs come from `servers.json`. Already the case.
|
||||
|
||||
2. **Bridges are deployable anywhere .NET 8 runs.** Currently Windows x86/x64. Must also build for Linux x64 and linux-arm64.
|
||||
|
||||
3. **Coordination service is just another network service.** Flutter app connects to it like a bridge — via configured URL.
|
||||
|
||||
4. **Hardware I/O is abstracted behind a service interface.** `KeyboardService` interface has platform-specific implementations:
|
||||
- `NativeSerialKeyboardService` (desktop with USB)
|
||||
- `WebSerialKeyboardService` (browser with Web Serial API)
|
||||
- `BluetoothKeyboardService` (tablet with BT keyboard, future)
|
||||
- `EmulatedKeyboardService` (development/testing)
|
||||
|
||||
5. **No platform-specific code in business logic.** All platform differences are in the service layer, injected via DI.
|
||||
|
||||
## 4. Coordination Service Design (Option B)
|
||||
|
||||
### 4.1 Service Overview
|
||||
|
||||
A minimal .NET 8 ASP.NET Core application (~400 lines) running on the PRIMARY keyboard:
|
||||
|
||||
```
|
||||
copilot-coordinator/
|
||||
├── Program.cs # Minimal API setup, WebSocket, endpoints
|
||||
├── Services/
|
||||
│ ├── LockManager.cs # Camera lock state (ported from legacy CameraLocksService)
|
||||
│ ├── SequenceRunner.cs # Sequence execution (ported from legacy SequenceService)
|
||||
│ └── KeyboardRegistry.cs # Track connected keyboards
|
||||
├── Models/
|
||||
│ ├── CameraLock.cs # Lock state model
|
||||
│ ├── SequenceState.cs # Running sequence model
|
||||
│ └── Messages.cs # WebSocket message types
|
||||
└── appsettings.json # Lock timeout, heartbeat interval config
|
||||
```
|
||||
|
||||
### 4.2 REST API
|
||||
|
||||
```
|
||||
GET /health → Service health
|
||||
GET /status → Connected keyboards, active locks, sequences
|
||||
|
||||
POST /locks/try {cameraId, keyboardId, priority} → Acquire lock
|
||||
POST /locks/release {cameraId, keyboardId} → Release lock
|
||||
POST /locks/takeover {cameraId, keyboardId, priority} → Request takeover
|
||||
POST /locks/confirm {cameraId, keyboardId, confirm} → Confirm/reject takeover
|
||||
POST /locks/reset {cameraId, keyboardId} → Reset expiration
|
||||
GET /locks → All active locks
|
||||
GET /locks/{keyboardId} → Locks held by keyboard
|
||||
|
||||
POST /sequences/start {viewerId, sequenceId} → Start sequence
|
||||
POST /sequences/stop {viewerId} → Stop sequence
|
||||
GET /sequences/running → Active sequences
|
||||
|
||||
WS /ws → Real-time events
|
||||
```
|
||||
|
||||
### 4.3 WebSocket Events (broadcast to all connected keyboards)
|
||||
|
||||
```json
|
||||
{"type": "lock_acquired", "cameraId": 5, "keyboardId": "KB1", "expiresAt": "..."}
|
||||
{"type": "lock_released", "cameraId": 5}
|
||||
{"type": "lock_expiring", "cameraId": 5, "keyboardId": "KB1", "expiresIn": 60}
|
||||
{"type": "lock_takeover", "cameraId": 5, "from": "KB1", "to": "KB2"}
|
||||
{"type": "sequence_started", "viewerId": 1001, "sequenceId": 3}
|
||||
{"type": "sequence_stopped", "viewerId": 1001}
|
||||
{"type": "keyboard_online", "keyboardId": "KB2"}
|
||||
{"type": "keyboard_offline", "keyboardId": "KB2"}
|
||||
{"type": "heartbeat"}
|
||||
```
|
||||
|
||||
### 4.4 Failover (Configured STANDBY)
|
||||
|
||||
```
|
||||
keyboards.json:
|
||||
{
|
||||
"keyboards": [
|
||||
{"id": "KB1", "role": "PRIMARY", "coordinatorPort": 8090},
|
||||
{"id": "KB2", "role": "STANDBY", "coordinatorPort": 8090}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- PRIMARY starts coordinator service on `:8090`
|
||||
- STANDBY monitors PRIMARY's `/health` endpoint
|
||||
- If PRIMARY unreachable for 6 seconds → STANDBY starts its own coordinator
|
||||
- When old PRIMARY recovers → checks if another coordinator is running → defers (becomes STANDBY)
|
||||
- Lock state after failover: **empty** (locks expire naturally in ≤5 minutes, same as legacy AppServer restart behavior)
|
||||
|
||||
## 5. Improvement Summary: Legacy vs New
|
||||
|
||||
### What the New System Does BETTER
|
||||
|
||||
| Improvement | Detail |
|
||||
|-------------|--------|
|
||||
| No central server hardware | Coordinator runs on keyboard, not separate machine |
|
||||
| Alarm reliability | Query + Subscribe + Periodic sync (legacy had event-only + hourly refresh) |
|
||||
| Direct command path | CrossSwitch/PTZ bypass coordinator entirely (legacy routed some through AppServer) |
|
||||
| Multiplatform | Flutter + .NET 8 run on Windows, Linux, Android. Legacy was Windows-only WPF |
|
||||
| No SDK dependency in UI | Bridges abstract SDKs behind REST. UI never touches native code |
|
||||
| Independent operation | Each keyboard works standalone for critical ops. Legacy needed AppServer for several features |
|
||||
| Deployable anywhere | Bridges + coordinator can run on any server, not just the keyboard |
|
||||
|
||||
### What the New System Must MATCH (Currently Missing)
|
||||
|
||||
| Legacy Feature | Legacy Implementation | New Implementation Needed |
|
||||
|---------------|----------------------|---------------------------|
|
||||
| Auto-reconnect to camera servers | 2-second periodic retry service | Bridge health polling + WebSocket auto-reconnect |
|
||||
| Auto-reconnect to AppServer | SignalR built-in (0s, 5s, 10s, 15s) | Coordinator WebSocket auto-reconnect with backoff |
|
||||
| Network detection | 5-second NIC polling worker | `connectivity_plus` package or periodic health checks |
|
||||
| State resync on reconnect | `ViewerStatesInitWorker`, config resync on availability change | Query bridges + coordinator on any reconnect event |
|
||||
| Graceful partial failure | `Parallel.ForEach` with per-driver try-catch | Already OK (each bridge independent) |
|
||||
| Process watchdog | Windows Service | systemd / Windows Service / Docker restart policy |
|
||||
| Media channel refresh | 10-minute periodic refresh | Periodic bridge status query |
|
||||
|
||||
### What the New System Should Do BETTER THAN Legacy
|
||||
|
||||
| Improvement | Legacy Gap | New Approach |
|
||||
|-------------|-----------|--------------|
|
||||
| Exponential backoff | Fixed delays (0, 5, 10, 15s) — no backoff | Exponential: 1s, 2s, 4s, 8s, max 30s with jitter |
|
||||
| Circuit breaker | None — retries forever even if server is gone | After N failures, back off to slow polling (60s) |
|
||||
| Command retry | None — single attempt | Retry critical commands (CrossSwitch) 3x with 200ms delay |
|
||||
| Health visibility | Hidden in logs | Operator-facing status dashboard in UI |
|
||||
| Structured logging | Basic ILogger | JSON structured logging → ELK (already in design) |
|
||||
| Graceful degradation UI | Commands silently disabled | Clear visual indicator: "Degraded mode — locks unavailable" |
|
||||
|
||||
## 6. Proposed Resilience Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Flutter App"
|
||||
UI["UI Layer"]
|
||||
BLoCs["BLoC Layer"]
|
||||
RS["ReconnectionService"]
|
||||
HS["HealthService"]
|
||||
BS["BridgeService"]
|
||||
CS["CoordinationClient"]
|
||||
KS["KeyboardService"]
|
||||
end
|
||||
|
||||
subgraph "Health & Reconnection"
|
||||
RS -->|"periodic /health"| Bridge1["GeViScope Bridge"]
|
||||
RS -->|"periodic /health"| Bridge2["G-Core Bridge"]
|
||||
RS -->|"periodic /health"| Coord["Coordinator"]
|
||||
RS -->|"on failure"| BS
|
||||
RS -->|"on failure"| CS
|
||||
HS -->|"status stream"| BLoCs
|
||||
end
|
||||
|
||||
subgraph "Normal Operation"
|
||||
BS -->|"REST commands"| Bridge1
|
||||
BS -->|"REST commands"| Bridge2
|
||||
BS -->|"WebSocket events"| Bridge1
|
||||
BS -->|"WebSocket events"| Bridge2
|
||||
CS -->|"REST + WebSocket"| Coord
|
||||
end
|
||||
|
||||
BLoCs --> UI
|
||||
KS -->|"Serial/HID"| BLoCs
|
||||
```
|
||||
|
||||
**New services needed in Flutter app:**
|
||||
|
||||
| Service | Responsibility |
|
||||
|---------|---------------|
|
||||
| `ReconnectionService` | Polls bridge `/health` endpoints, auto-reconnects WebSocket, triggers state resync |
|
||||
| `HealthService` | Aggregates health of all bridges + coordinator, exposes stream to UI |
|
||||
| `CoordinationClient` | REST + WebSocket client to coordinator (locks, sequences, heartbeat) |
|
||||
|
||||
## 7. Action Items Before Implementation
|
||||
|
||||
- [ ] **Create coordination service** (.NET 8 minimal API, ~400 lines)
|
||||
- [ ] **Add `ReconnectionService`** to Flutter app (exponential backoff, health polling)
|
||||
- [ ] **Add `HealthService`** to Flutter app (status aggregation for UI)
|
||||
- [ ] **Add `CoordinationClient`** to Flutter app (locks, sequences)
|
||||
- [ ] **Fix WebSocket auto-reconnect** in `BridgeService`
|
||||
- [ ] **Add command retry** for CrossSwitch (3x with backoff)
|
||||
- [ ] **Add bridge process supervision** (systemd/Windows Service configs)
|
||||
- [ ] **Add state resync** on every reconnect event
|
||||
- [ ] **Build health status UI** component
|
||||
- [ ] **Update `servers.json`** schema to include coordinator URL
|
||||
- [ ] **Build for Linux** — verify .NET 8 bridges compile for linux-x64
|
||||
- [ ] **Abstract keyboard input** behind `KeyboardService` interface with platform impls
|
||||
Reference in New Issue
Block a user