Neural Vault’s entire pipeline is built on top of cortical surface predictions produced by TRIBEv2, Meta’s brain-encoding model (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Skieriya/fMRI-key-generation-with-TRIBEv2/llms.txt
Use this file to discover all available pages before exploring further.
facebook/tribev2). TRIBEv2 takes a video stimulus, segments it into neural events, and returns a (T, 20484) float32 matrix where each row is one predicted fMRI timestep and each column is one vertex on the fsaverage5 cortical surface. The data.py script calls the TRIBEv2 API for each of your five videos, flattens the predictions into long-form rows, and writes the concatenated result to 5classpreds.csv. That single file is the only data input consumed by main.py and model.py.
Generating the CSV
Thedata.py script downloads the TRIBEv2 checkpoint from Hugging Face, iterates over five MP4 video files, and writes every vertex activation at every timestep to a CSV:
TribeModel.from_pretrained("facebook/tribev2")— loads the brain-encoding checkpoint from the Hugging Face Hub into./cache.model.get_events_dataframe(video_path=filename)— segments the video into stimulus events and returns a timing DataFrame that TRIBEv2 uses to align fMRI predictions.model.predict(events=df)— runs the encoder and returnspredswith shape(n_timesteps, n_vertices)wheren_vertices = 20484.- The triple-nested loop unrolls
predsinto one row per(video, timestep, vertex)combination and appends it toall_rows. - The final DataFrame is written to
5classpreds.csv.
v1 (online-video-cutter.com).mp4 through v5 (online-video-cutter.com).mp4 and must exist in the working directory before you run the script. Missing files are skipped with a printed warning rather than raising an exception — ensure all five are present to produce valid class labels downstream.
What the pipeline expects
5classpreds.csv must contain exactly three columns:
| Column | Type | Description |
|---|---|---|
video | string | The video filename. Used as the class label — each unique filename becomes one class (0–4). |
timestep | int | Zero-indexed temporal frame number within that video. |
prediction | float32 | The predicted cortical vertex activation at this timestep, from TRIBEv2. |
model.py handles the missing file gracefully. If 5classpreds.csv is not found it prints a warning and substitutes a synthetic stand-in:Preprocessing steps
Once5classpreds.csv is on disk, the data passes through three sequential transformations before reaching the Transformer encoder.
Step 1 — Pivot and scale (load_and_prepare_data in main.py)
main.py loads the long-form CSV and pivots it into a (N_samples, N_features) matrix:
pd.pivot_tablereshapes the long-form rows into a wide matrix where each row is one(video, timestep)pair and each column is one of the 20,484 vertex feature indices.label_mapassigns an integer class label 0–4 to each video filename in sorted alphabetical order.np.nan_to_numreplaces any NaN, +inf, or −inf values left by TRIBEv2 edge cases.StandardScaler().fit_transform(X)applies z-score normalization across the entire dataset. This differs frommodel.py’s per-feature normalization — see Step 2 below.
Step 2 — Per-feature z-score normalization (model.py)
model.py applies a per-feature normalization fit exclusively on the user’s own data, never on impostors:
+ 1e-8 epsilon prevents division by zero on constant-valued vertices. The mean and standard deviation are stored as feat_mean and feat_std and reused verbatim at inference time inside get_embedding(), ensuring that new authentication samples are normalized relative to the enrollment distribution.
Step 3 — Sequence chunking (build_sequences in model.py)
The normalized (N, 20484) matrix is chunked into non-overlapping temporal windows before being fed to the Transformer encoder:
SEQ_CHUNK = 5, every five consecutive timestep rows are stacked into a single sequence tensor. The output shape is (B, 5, 20484), where B = ⌊N / 5⌋. Any trailing rows that do not fill a complete window are silently dropped. This tensor is passed directly to the Transformer encoder’s forward() method, which projects the 20,484-dimensional frame vectors to the model width D_MODEL before applying positional encoding and multi-head self-attention:
derive_key() for HKDF-SHA256 key derivation.