3D Feature Triangulation and Initialization in OpenVINS

Before a visual feature can contribute to the EKF update — whether as a short-lived MSCKF feature or a persistent SLAM landmark — its 3D position must be estimated from a set of 2D bearing observations across multiple camera frames. This process, called feature initialization or triangulation, establishes the linearization point at which all subsequent Jacobians for that feature will be evaluated. A poor linearization point produces biased Jacobians that degrade filter consistency. OpenVINS therefore invests in a careful two-stage pipeline: a fast linear least-squares for an initial estimate, followed by a nonlinear Gauss-Newton refinement in inverse-depth parameterization.

Why Feature Initialization Matters

EKF linearization of the camera measurement model requires evaluating

\frac{\partial\mathbf{h}}{\partial{}^G\mathbf{p}_f}

at a fixed feature estimate

\hat{\mathbf{p}}_f

. When First-Estimate Jacobians (FEJ) are enabled, this estimate is frozen at initialization time and never updated. A poor triangulation therefore permanently degrades the Jacobians for the entire lifetime of the feature. Conversely, a good initialization with sufficient baseline and view diversity leads to accurate, consistent updates.

OpenVINS requires a minimum of two views to triangulate a feature and recommends at least five views for robust performance. The linear system’s condition number is checked to reject geometrically degenerate configurations (e.g., near-parallel viewing rays).

Feature Representations

OpenVINS supports six distinct parameterizations of a 3D point feature, selectable at runtime. Each trades off between singularity robustness, state dimension, and Jacobian complexity.

Representation	Parameters $\boldsymbol{\lambda}$	Dim	In State?	Notes
Global XYZ	$[{}^Gx,\ {}^Gy,\ {}^Gz]^\top$	3	Yes (SLAM)	Canonical; full ${}^G\mathbf{p}_f$ in global frame
Global Inverse Depth	$[\theta,\ \phi,\ \rho]^\top$	3	Yes (SLAM)	Spherical coordinates; singularity when $z \to 0$
Anchored XYZ	$[{}^{C_a}x,\ {}^{C_a}y,\ {}^{C_a}z]^\top$	3	Yes (SLAM)	Position in anchor camera frame $\{C_a\}$
Anchored Inverse Depth	$[\theta,\ \phi,\ \rho]^\top$	3	Yes (SLAM)	Spherical coords in anchor frame; no $z \to 0$ singularity
Anchored MSCKF Inv. Depth	$[\alpha,\ \beta,\ \rho]^\top$	3	No (MSCKF)	Planar bearing $[\alpha,\beta,1]/\rho$ ; default MSCKF rep.
Anchored Single Inv. Depth	$[\rho]$	1	Yes (SLAM, 1-DoF)	Bearing fixed at init; only depth in state

Global XYZ simply stores

\boldsymbol{\lambda} = {}^G\mathbf{p}_f

with identity Jacobian

\frac{\partial\mathbf{f}}{\partial\boldsymbol{\lambda}} = \mathbf{I}_{3\times3}

. Global Inverse Depth uses spherical coordinates:

{}^G\mathbf{p}_f = \frac{1}{\rho}\begin{bmatrix}\cos\theta\sin\phi \\ \sin\theta\sin\phi \\ \cos\phi\end{bmatrix}

where

\rho = 1/\|{}^G\mathbf{p}_f\|

is the inverse distance. This avoids the near-linear degeneracy of XYZ for far-away features, but has a singularity when the global

z

-coordinate approaches zero. Anchored representations express the feature relative to an “anchor” camera pose

\{C_a\}

— typically the first frame in which the feature was detected. This prevents the global

z

-singularity and is preferred in practice. The global position is recovered via:

{}^G\mathbf{p}_f = {}^{I_a}_G\mathbf{R}^\top\,{}^C_I\mathbf{R}^\top(\boldsymbol{\lambda}' - {}^C\mathbf{p}_I) + {}^G\mathbf{p}_{I_a}

where

\boldsymbol{\lambda}'

is the feature expressed in the anchor camera frame. Anchored MSCKF Inverse Depth (

[\alpha,\beta,\rho]

) is the representation from the original MSCKF paper. The bearing is

[{\alpha}/{\rho},\ {\beta}/{\rho},\ {1}/{\rho}]^\top

. This is the default MSCKF representation and is numerically well-conditioned as long as features are observed from the camera frame they were anchored in (depth always positive). Anchored Single Inverse Depth keeps only the scalar depth

\rho

in the state after marginalizing the bearing at initialization time. This is useful for applications where reducing state dimensionality is critical.

Triangulation Pipeline

Select anchor frame

Choose an anchor camera pose

\{A\}

— typically the earliest camera frame that observed the feature. All subsequent computation is done relative to this frame.

Build bearing vectors

For each observing camera pose

C_i

(

i = 1\ldots m

), compute the normalized image bearing in the anchor frame:

{}^A\mathbf{b}_{C_i\to f} = {}^{C_i}_A\mathbf{R}^\top\begin{bmatrix}u_{n,i} \\ v_{n,i} \\ 1\end{bmatrix}

where

(u_{n,i}, v_{n,i})

are the undistorted normalized image coordinates in frame

C_i

Form linear system (3D triangulation)

For each frame, the depth along the bearing is an unknown. To eliminate it, pre-multiply by the skew-symmetric (cross-product) matrix of the bearing:

{}^A\mathbf{N}_i = \lfloor{}^A\mathbf{b}_{C_i\to f}\times\rfloor

This gives two independent constraints per frame. Stacking all

m

frames yields

\mathbf{A}\,{}^A\mathbf{p}_f = \mathbf{b}

, which is solved in the least-squares sense via the

3\times3

normal equations:

\Bigl(\sum_i {}^A\mathbf{N}_i^\top{}^A\mathbf{N}_i\Bigr)\,{}^A\mathbf{p}_f = \sum_i {}^A\mathbf{N}_i^\top{}^A\mathbf{N}_i\,{}^A\mathbf{p}_{C_i}

This is a

3\times3

system that can be solved instantly, regardless of the number of observations

m

Validate the linear estimate

Check two conditions before proceeding to refinement:

Positive depth: the triangulated feature must lie in front of all observing cameras ( ${}^{C_i}z_f > 0$ ).
Condition number: the condition number of $\mathbf{A}^\top\mathbf{A}$ must be below a threshold. A large condition number indicates near-parallel rays (insufficient baseline) and an unreliable estimate.

Features failing either check are discarded.

Gauss-Newton refinement in inverse depth

The linear estimate is refined using Gauss-Newton optimization in the inverse-depth parameterization

(u_A, v_A, \rho_A)

where

u_A = {}^Ax_f/{}^Az_f

v_A = {}^Ay_f/{}^Az_f

\rho_A = 1/{}^Az_f

.For each observing frame, the predicted normalized bearing is:

\mathbf{h}(u_A, v_A, \rho_A) = {}^{C_i}_A\mathbf{R}\begin{bmatrix}u_A \\ v_A \\ 1\end{bmatrix} - \rho_A\,{}^{C_i}_A\mathbf{R}\,{}^A\mathbf{p}_{C_i}

and the projected measurement is

\hat{\mathbf{z}} = [h_1/h_3,\ h_2/h_3]^\top

. The total least-squares cost is:

\min_{u_A,v_A,\rho_A}\sum_i\|\mathbf{z}_i - \hat{\mathbf{z}}_i(u_A,v_A,\rho_A)\|^2

The Jacobian is derived via chain rule and the update step is applied iteratively. Convergence typically requires 2–3 Gauss-Newton iterations in indoor environments.

Final depth check

After refinement, verify that the feature is still in front of all cameras and within a reasonable depth range. The refined

(u_A, v_A, \rho_A)

is then converted back to the desired feature representation (Global XYZ, Anchored Inverse Depth, etc.) for use in the EKF.

1D Depth Triangulation (Single-Depth Variant)

For the Anchored Single Inverse Depth representation, the bearing in the anchor frame

{}^A\mathbf{b}_f = [u_{n,A},\ v_{n,A},\ 1]^\top

is known from the anchor observation. Only the scalar depth

{}^Az_f

is unknown. The same cross-product trick yields a scalar equation:

\Bigl(\sum_i ({}^A\mathbf{N}_i\,{}^A\mathbf{b}_f)^\top({}^A\mathbf{N}_i\,{}^A\mathbf{b}_f)\Bigr)\,{}^Az_f = \sum_i ({}^A\mathbf{N}_i\,{}^A\mathbf{b}_f)^\top\,{}^A\mathbf{N}_i\,{}^A\mathbf{p}_{C_i}

This reduces to a single scalar division, making it extremely fast. The full feature position is then

{}^A\mathbf{p}_f = {}^Az_f\,{}^A\mathbf{b}_f

. Expressing depth as its reciprocal

\rho = 1/z

has favorable numerical properties:

Features at large distances have small $z$ and large, numerically unstable Cartesian coordinates. In inverse depth, they cluster near $\rho \approx 0$ and the optimization landscape is well-conditioned.
The inverse-depth parameterization is closer to the natural “uncertainty space” of bearing-only sensors: uncertainty in bearing translates to roughly uniform uncertainty in $\rho$ (rather than a skewed distribution in $z$ ).
Gauss-Newton converges faster and with fewer iterations compared to direct Cartesian optimization.

Get Started

Guides

Theory

Evaluation

Development

3D Feature Triangulation and Initialization in OpenVINS

Why Feature Initialization Matters

Feature Representations

Triangulation Pipeline

1D Depth Triangulation (Single-Depth Variant)

Why Inverse Depth for Refinement?

Build docs developers (and LLMs) love

Get Started

Guides

Theory

Evaluation

Development

Documentation Index

​Why Feature Initialization Matters

​Feature Representations

​Triangulation Pipeline

​1D Depth Triangulation (Single-Depth Variant)

​Why Inverse Depth for Refinement?

Build docs developers (and LLMs) love

Why Feature Initialization Matters

Feature Representations

Triangulation Pipeline

1D Depth Triangulation (Single-Depth Variant)

Why Inverse Depth for Refinement?