CORTEX: Camera Optimized Realtime Transmission Exchange

Battery-aware VLM middleware SDK for wearable devices like smart glasses. A 4-layer pipeline sits between the camera and AI APIs — gating redundant frames, cropping to salient regions, and routing to the right model — targeting 60%+ payload reduction and 2× battery extension without degrading answer quality.

Role

Solo Researcher & Engineer

Tech Stack

Python 3.11+OpenCVscikit-image (SSIM)NumPyPillowhttpxLaplacian Blur DetectionMSER Text DetectionSpectral Saliency

Repository

The Challenge

Smart glasses stream ~30 fps to VLM APIs even when the scene is static, a whiteboard hasn't changed, or the frame is too blurry to be useful. Each API call costs tokens and drains the wearable battery. The challenge was building a drop-in middleware layer that makes the decision "should this frame be sent?" in real time, then compresses and crops the accepted frames to the minimum payload that still lets the VLM answer correctly — all without requiring hardware changes or model retraining.

Architecture & Deep Dive

System Architecture

4-Layer Middleware: Gate → Crop → Route → Memory

L1: Capture — "Should we process this frame?"

markdown

New camera frame arrives
  │
  ├─ IMU Gate (motion sensor check)
  │    └─ acceleration below threshold? → skip (camera still, likely static scene)
  │
  ├─ Blur Detector (Laplacian variance)
  │    └─ variance < blur_threshold? → skip (frame too blurry for VLM)
  │
  ├─ Scene Change Detector (SSIM vs last accepted frame)
  │    ├─ SSIM > similarity_threshold (e.g. 0.92)? → skip (scene unchanged)
  │    └─ SSIM ≤ threshold → ACCEPT
  │         └─ update "last accepted frame" reference
  │
  Result: maximizes camera-off time; only meaningful frames pass to L2
  Target: 60%+ of frames filtered before any encode/API call

L2: Compress — "What part matters? How small can it be?"

markdown

Accepted frame (from L1)
  │
  ├─ Scene Classifier
  │    ├─ "text-heavy" → MSER text region detection → crop to text bounding box
  │    ├─ "face/object" → spectral saliency map → crop to salient region
  │    └─ "general"    → center-weighted crop (safe fallback)
  │
  ├─ Hybrid ROI Crop
  │    └─ expand bounding box by context_margin (avoid cutting off context)
  │         └─ clamp to frame boundary
  │
  ├─ Adaptive Encoder
  │    ├─ WiFi detected      → JPEG quality 85, max 1024px
  │    ├─ LTE detected       → JPEG quality 70, max 768px
  │    └─ low-battery mode   → JPEG quality 55, max 512px
  │
  └─ Output: compressed JPEG bytes (typically 40-120 KB vs 2+ MB raw)

L1 Core — SSIM scene change gate

python

def should_process_frame(self, frame: np.ndarray) -> tuple[bool, float]:
    """
    Returns (should_process, ssim_score).
    Rejects frames too similar to the last accepted frame.
    """
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # First frame always accepted
    if self._last_frame is None:
        self._last_frame = gray
        return True, 0.0

    score, _ = ssim(self._last_frame, gray, full=True)

    if score >= self.similarity_threshold:   # e.g. 0.92
        return False, score                  # scene unchanged → skip

    self._last_frame = gray                  # update reference
    return True, score

Technical Trade-offs

SSIM over pixel diff: Structural Similarity is robust to minor lighting changes and JPEG compression artifacts that fool naive pixel subtraction. Tunable threshold (0.85–0.95) lets integrators trade sensitivity for battery savings.
Laplacian blur before SSIM: Computing Laplacian variance is O(n) and cheap; SSIM is O(n log n). Running blur check first drops ~15% of frames before the more expensive similarity comparison.
Adaptive JPEG quality (not resolution scaling): Dropping quality 85→55 cuts payload ~3× with minimal VLM accuracy loss on text/object recognition. Resolution scaling can cut off context; quality reduction is safer for most VLM tasks.
Stateful resampler across chunks: SSIM reference frame and ROI classifier state persist across frames, enabling temporal coherence without a full video encoder.
Circuit breaker in L3 (planned): VLM router tracks error rate per provider. On >3 consecutive failures, it opens the circuit (stops sending to that provider for 30 s) and falls back to the next configured model.

Reliability & Validation

Test Coverage

80+ pytest tests, 99% coverage on L1 + L2. SSIM gate, blur detector, ROI crop, and adaptive encoder each have isolated unit tests with synthetic frames.

Validation

Webcam demo with real-time HUD overlays accepted/rejected frame counts, SSIM scores, and payload sizes. Validated against 3 scene types: static whiteboard, hand movement, walking.

Edge cases validated

All-black / all-white frames — Laplacian variance = 0 → rejected as blurry before SSIM
Scene flash (sudden brightness change) — SSIM drops to ~0.4 even with no semantic change → threshold tuning guide in docs
MSER on low-contrast text — falls back to center crop when fewer than 5 text regions detected
Tiny frames (<64×64) — spatial pyramid pooling skipped, direct encode
Network switch WiFi→LTE mid-session — adaptive encoder re-reads network state per frame (no restart needed)

Error Handling Strategy

L1 gate errors (OpenCV decode failure) return (False, 0.0) — frame is dropped, not crashed.
L2 ROI crop out-of-bounds is clamped to frame size with np.clip; never raises IndexError.
L3 VLM call failure increments error counter; circuit breaker opens after threshold (L3 planned).

Impact & Collaboration

Targets 60%+ reduction in frames sent to VLM APIs, directly cutting token costs and API latency.
Targets 2× battery extension on wearables by maximizing camera-off time via IMU + SSIM gating.
99% test coverage on L1 + L2 with 80+ pytest cases — production-ready filtering pipeline.
Drop-in SDK design: integrators call pipeline.process(frame) and receive (should_call_vlm, payload_bytes) — zero changes to existing VLM integration code.
Accompanied by an arXiv paper (cortex-arxiv-v5) formalizing the 4-layer architecture and benchmark methodology.

Back to Portfolio