Signia’s sign recognition feature lets deaf or non-speaking users communicate by signing in front of their webcam and receiving an instant text translation. The system captures video frames in the browser, extracts hand landmark coordinates using MediaPipe, and classifies the gesture sequence with a RandomForest model trained on sign videos uploaded through the admin panel. The result is a low-latency, server-side prediction that runs without any GPU requirement.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jtapieromalambo-ctrl/Signia/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
User opens /reconocimientos/camara/
The
camara view renders the recognition interface (usuarios/reconocimiento.html), which activates the browser’s webcam. Users with discapacidad='sordo' or 'mudo' are automatically redirected here after login.Browser captures video frames
JavaScript captures frames from the
<video> element at regular intervals. Frames are encoded as base64 data URLs before being sent to the server, or — in the faster landmark pipeline — processed locally first by MediaPipe JS.MediaPipe JS extracts hand landmarks client-side
The preferred path uses MediaPipe’s JavaScript
HandLandmarker running directly in the browser (via WASM). It detects up to 2 hands and extracts 21 3D landmarks per hand, producing a flat array of up to 126 floats ([x0,y0,z0, ..., x20,y20,z20] × 2 hands). This offloads the most expensive CV work to the client.Landmarks sent to /reconocimientos/predecir_landmarks/
The browser POSTs the accumulated landmark sequence (a list of up to 30 frames, each 126 floats) as JSON to the
predecir_landmarks endpoint. This avoids sending raw image data over the network and skips server-side MediaPipe processing entirely.Server normalizes and runs RandomForest prediction
The server normalizes each frame to its hand centroid (translation-invariant features), resamples the sequence to exactly 30 frames, builds a feature vector of flattened positions + frame deltas + magnitudes, and calls
modelo.predict().Prediction Endpoints
The recognition app exposes two complementary endpoints depending on where landmark extraction happens:- predecir_landmarks (recommended)
- predecir (server-side MediaPipe)
URL: Each row contains 126 floats: 21 landmarks × 3 coordinates × 2 hands. If only one hand is present, the second 63 values are
POST /reconocimientos/predecir_landmarks/The client sends pre-computed landmarks from MediaPipe JS, bypassing server-side computer vision entirely. This is the primary recognition path — faster and more scalable.0.0. The server normalizes each row to its centroid, resamples to 30 frames, and runs inference.{"seña": "", "confianza": 0} without running inference.
MediaPipe HandLandmarker
The server-sideHandLandmarker is configured with:
x, y, z), giving a maximum of 42 points (126 floats) per frame.
Thread Safety
HandLandmarker is not thread-safe. Sharing a single instance across Django worker threads causes deadlocks and dropped frames. Signia solves this with threading.local():
HandLandmarker instance, created on first use and reused for the lifetime of that thread.
RandomForest Classifier
The classifier is asklearn.ensemble.RandomForestClassifier trained with:
Feature Engineering
Prediction uses a three-part feature vector constructed byconstruir_features():
| Component | Shape | Description |
|---|---|---|
| Positions | 30 × 126 flattened | Centroid-normalized landmark coordinates across all 30 frames |
| Deltas | 29 × 126 flattened | Frame-to-frame differences — captures motion direction |
| Magnitudes | 29 values | L2 norm of each delta row — captures movement speed |
Centroid Normalization
Before building features, each frame’s landmarks are shifted by the hand’s centroid, making predictions invariant to screen position:Sequence Normalization
Incoming sequences of any length are resampled to exactly 30 frames using linear interpolation (np.interp), so the classifier always receives a fixed-size input regardless of how fast the user signed.
Training Data Augmentation
During training, each uploaded sign video is augmented into up to 8 variations per sample to improve generalization:Gaussian Noise
Small (
σ=0.008) and larger (σ=0.018) noise levels simulate natural hand tremor.Scale Variation
Uniform scaling between 0.93–1.07× simulates varying camera distances.
Speed Variation
Sequences are resampled to a random length (20–45 frames) to handle different signing speeds.
Horizontal Mirror
X-coordinates are flipped (
x → 1.0 - x) to handle left- and right-handed signers.Translation
Random X/Y offsets (±0.08) simulate the signer not being centered in the frame.
Temporal Reversal
The frame sequence is reversed to improve robustness to symmetrical gestures.
Request Throttling
To prevent a single active session from saturating the server at 30+ requests per second, thedetectar_mano endpoint enforces a 120 ms minimum interval per session key:
{"hay_mano": false, "throttled": true} immediately. Stale entries older than 60 seconds are pruned when the dictionary grows beyond 500 keys.
Hand Presence Detection
Thedetectar_mano endpoint (POST /reconocimientos/detectar_mano/) is a lightweight presence check that runs at up to 8 fps. It decodes a single base64 frame, runs the HandLandmarker, and returns a boolean:
Disability Routing
Theredirigir_por_discapacidad() function in usuarios/views.py automatically routes users to the recognition interface based on their profile:
discapacidad='sordo' (deaf) are routed to /reconocimiento/ because they communicate by signing; the recognition feature translates their signs into text that hearing users can read. Users marked 'mudo' (non-speaking) follow the same path for the same reason. This routing is applied on login, post-OTP verification, and after OAuth disability selection.