Battery-aware VLM middleware SDK for wearable devices like smart glasses. A 4-layer pipeline sits between the camera and AI APIs — gating redundant frames, cropping to salient regions, and routing to the right model — targeting 60%+ payload reduction and 2× battery extension without degrading answer quality.
Solo Researcher & Engineer
Smart glasses stream ~30 fps to VLM APIs even when the scene is static, a whiteboard hasn't changed, or the frame is too blurry to be useful. Each API call costs tokens and drains the wearable battery. The challenge was building a drop-in middleware layer that makes the decision "should this frame be sent?" in real time, then compresses and crops the accepted frames to the minimum payload that still lets the VLM answer correctly — all without requiring hardware changes or model retraining.
4-Layer Middleware: Gate → Crop → Route → Memory
New camera frame arrives
│
├─ IMU Gate (motion sensor check)
│ └─ acceleration below threshold? → skip (camera still, likely static scene)
│
├─ Blur Detector (Laplacian variance)
│ └─ variance < blur_threshold? → skip (frame too blurry for VLM)
│
├─ Scene Change Detector (SSIM vs last accepted frame)
│ ├─ SSIM > similarity_threshold (e.g. 0.92)? → skip (scene unchanged)
│ └─ SSIM ≤ threshold → ACCEPT
│ └─ update "last accepted frame" reference
│
Result: maximizes camera-off time; only meaningful frames pass to L2
Target: 60%+ of frames filtered before any encode/API callAccepted frame (from L1)
│
├─ Scene Classifier
│ ├─ "text-heavy" → MSER text region detection → crop to text bounding box
│ ├─ "face/object" → spectral saliency map → crop to salient region
│ └─ "general" → center-weighted crop (safe fallback)
│
├─ Hybrid ROI Crop
│ └─ expand bounding box by context_margin (avoid cutting off context)
│ └─ clamp to frame boundary
│
├─ Adaptive Encoder
│ ├─ WiFi detected → JPEG quality 85, max 1024px
│ ├─ LTE detected → JPEG quality 70, max 768px
│ └─ low-battery mode → JPEG quality 55, max 512px
│
└─ Output: compressed JPEG bytes (typically 40-120 KB vs 2+ MB raw)def should_process_frame(self, frame: np.ndarray) -> tuple[bool, float]:
"""
Returns (should_process, ssim_score).
Rejects frames too similar to the last accepted frame.
"""
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# First frame always accepted
if self._last_frame is None:
self._last_frame = gray
return True, 0.0
score, _ = ssim(self._last_frame, gray, full=True)
if score >= self.similarity_threshold: # e.g. 0.92
return False, score # scene unchanged → skip
self._last_frame = gray # update reference
return True, score80+ pytest tests, 99% coverage on L1 + L2. SSIM gate, blur detector, ROI crop, and adaptive encoder each have isolated unit tests with synthetic frames.
Webcam demo with real-time HUD overlays accepted/rejected frame counts, SSIM scores, and payload sizes. Validated against 3 scene types: static whiteboard, hand movement, walking.
(False, 0.0) — frame is dropped, not crashed.np.clip; never raises IndexError.pipeline.process(frame) and receive (should_call_vlm, payload_bytes) — zero changes to existing VLM integration code.