The Large Movement Model — Technical Overview

1. Introduction

Human movement is one of the richest natural signals available. It encodes health status, task intent, emotional state, motor skill, and physical capability. Yet no unified AI framework has treated motion as a general-purpose sequence domain — one where the same foundational model can serve rehabilitation assessment, sports analytics, robotics control, and human–computer interaction.

The Large Movement Model (LMM) addresses this gap. It applies the same core insight that drives large language models — that sequence-learning architectures, given consistent tokenization and sufficient data, discover generalizable latent structure — to human pose sequences. Where language models learn grammar and semantics from words, LMM learns biomechanical structure, temporal coordination, and movement intent from the evolution of body-joint positions over time.

This document describes the LMM pipeline, model designs, and Phase I results. The work demonstrates that a multi-resolution attention architecture and a diffusion-based generative model both substantially outperform conventional baselines on motion forecasting — establishing feasibility for a new class of embodied foundation models.

2. Data Pipeline

The pipeline transforms raw video into structured, normalized, machine-learning-ready motion tokens. It is designed to ingest heterogeneous sources — different cameras, frame rates, resolutions, skeleton formats, and recording conditions — and produce one unified representation.

2.1 Pose estimation

Video is processed through OpenPose to extract BODY_25 skeletal keypoints: 25 major joints with per-joint confidence. For datasets provided as pre-extracted skeletons (COCO-17, Kinect V2, Vicon), format converters map alternative schemas to BODY_25 through documented correspondence tables, synthesizing derived joints (neck, mid-hip) where needed. BODY_25 balances representational fidelity with efficiency at roughly 1/100th the cost of mesh-based representations like SMPL.

2.2 Normalization

Translation invariance: every skeleton is centered on the hip midpoint.
Scale invariance: skeletons are scaled to a standardized torso length.
Temporal uniformity: sequences are resampled to 15 FPS regardless of source frame rate.
Derived features: per-joint velocity, acceleration, and jerk are computed alongside positions.
Quality control: each clip receives QC metadata — per-joint confidence statistics, detected artifacts, and a usability classification.

2.3 Multi-resolution tokenization

Motion is structured into three simultaneous token views. All three reconstruct losslessly to the original — no information is discarded, only reorganized.

Token level	Shape	What it captures	Resolution
Frame	(T, 75)	Full-body pose at each timestep	66 ms
Window	(T/5, 5, 75)	Groups of 5 frames — movement phrases	330 ms
Body-part	(T, 5, 21)	5 anatomical regions: core, R/L arm, R/L leg	66 ms (spatial)

The frame level gives the finest temporal grain. The window level captures meso-scale patterns (a step, an arm swing). The body-part level captures spatial coordination between regions. Together they give the model access to movement structure at multiple scales without committing to which scale matters most.

2.4 Pipeline performance

The pipeline has been validated on over 17,000 clips from five independent sources — multi-camera studio video, clinical rehabilitation video, depth-sensor skeleton data, and optical motion capture. Heterogeneous inputs (60 FPS dance video, 30 FPS clinical footage, Kinect V2 depth skeletons, Vicon optical mocap) all converge to the same 15 FPS BODY_25 normalized representation with zero pipeline failures.

3. Model Architectures

Two architectures were developed and evaluated, each taking 120 frames (8 seconds) of context and producing 30 predicted frames (2 seconds).

3.1 Flat transformer baseline

A standard encoder–decoder transformer on single-resolution frame tokens. Six encoder layers, six decoder layers, sinusoidal positional encodings based on physical time (seconds, not frame index). This is the controlled baseline.

3.2 Hierarchical Temporal Transformer (HTT)

The HTT processes all three token levels simultaneously through three parallel encoder streams connected by bidirectional cross-attention:

Frame encoder (4 layers): fine-grained temporal dynamics, frame to frame.
Window encoder (3 layers): meso-scale patterns via a learned 1D convolution that compresses each 5-frame chunk — self-attention over 24 window tokens instead of 120 frame tokens.
Body-part encoder (3 layers): spatial coordination across 5 anatomical regions, with learned part-type embeddings.

Input: 120 frames of context (8 seconds at 15 FPS) │ ┌───────────┼───────────┐ ▼ ▼ ▼ Frame Window Body-part tokens tokens tokens (120, 75) (24, 5, 75) (120, 5, 21) │ │ │ Frame enc Window enc Body-part enc (4 layers) (3 layers) (3 layers) │ │ │ └─────┬─────┘ │ Cross-attn (2L) │ frame ↔ window │ │ │ └────────┬────────┘ Cross-attn (2L) frame ↔ body-part │ Gated fusion │ Decoder (4 layers) │ 30 predicted frames

A gated linear combination fuses the three encoder outputs, and a 4-layer decoder generates the forecast autoregressively. Scheduled sampling addresses autoregressive drift by gradually transitioning the decoder from ground-truth to self-predicted inputs during training (0% → 50% over 25 epochs).

3.3 Motion diffusion model

Instead of predicting frame by frame, the diffusion model generates the full 30-frame forecast in a single pass — iteratively denoising random noise into a plausible trajectory, conditioned on the context.

Context encoder: 6-layer transformer encoder producing a latent representation of the input.
Denoiser: 6-layer transformer encoder, cross-attending to context, predicting clean motion from noise.
Schedule: cosine noise schedule (T=100 training steps, DDIM 20-step inference).

Because diffusion generates all frames simultaneously, it does not suffer from compounding autoregressive error.

4. Experiments & Results

4.1 Setup

Models were trained on 12,483 clips (~2.5M frames) from a multi-camera dance corpus spanning 10 genres, 30 dancers, and 9 synchronized 60 FPS camera angles. Identical hyperparameters throughout: AdamW, lr 1e-4, batch 32, early stopping (patience 10), 90/10 split by clip. Task: 120 frames of context → predict 30 frames. HTT and baseline evaluated autoregressively; diffusion as the sample mean over 10 DDIM trajectories.

4.2 Results

Model	Params	Frame 1 (0.07s)	Frame 15 (1.0s)	Frame 30 (2.0s)	Overall
Flat baseline	11.1M	0.087	0.654	0.900	0.637
HTT	18.8M	0.126	0.646	0.785	0.658
Diffusion	11.4M	0.137	0.443	0.621	0.445

Table 1. Motion-forecasting MSE (lower is better). 12,483 clips, 53,903 training windows.

4.3 Analysis

Autoregressive drift is the central challenge. The flat baseline predicts the next frame well (0.087) but degrades to 0.900 by frame 30 — a 10× increase. Even with a large, diverse corpus, single-resolution attention cannot maintain multi-second coherence.

Multi-resolution attention helps most at the longest horizons. The HTT reduces frame-30 error by 13% (0.900 → 0.785), concentrated at the 1.5–2 second horizon where cross-region coherence matters most. At short horizons the simpler baseline is competitive.

Diffusion substantially outperforms both deterministic models. Overall MSE is 30% lower than the baseline, and frame-30 error (0.621) is 31% lower than the HTT's. Generating all 30 frames at once avoids the compounding error that is the dominant failure mode of autoregressive architectures.

A note on comparison fairness. The HTT's MSE comes from a single deterministic rollout; the diffusion model's is the mean of 10 samples, which reduces variance. The comparison is directionally valid, but the magnitude of improvement should be read with this methodological difference in mind.

Beyond the training horizon, the difference becomes decisive. The 2-second forecast is only the start of the story. Extend it — roll a model forward on its own output — and the autoregressive HTT collapses toward a static pose: per-frame motion decays to near-zero within about a second and never recovers, the classic regression to the mean. Short-horizon error hides this, because predicting "hold still" scores acceptably on MSE while being dynamically inert. The diffusion model, generating each block in a single pass with no autoregressive feedback, keeps producing coherent, in-range motion far past the training window — sustaining roughly an order of magnitude more per-frame movement across 10+ seconds of rollout.

A 12-second forecast rolled from an identical 8-second context. The autoregressive HTT collapses to a near-static pose; the diffusion model sustains motion across the full horizon.

That makes the diffusion model the usable forecaster. A deterministic rollout gives a sharp next frame and a serviceable first second, but the forecast cannot be extended; it coasts to a near-static pose and stays there. The diffusion model produces motion that stays alive across and well beyond the 2-second horizon — which is what any downstream use of a forecast actually requires. It is the architecture we build forward on.

What is the frozen pose hiding? The autoregressive forecast is not motionless — it is amplitude-collapsed. Scaling its output by a constant factor — a linear transform that changes each motion's magnitude but not its direction — reveals smooth, coherent movement beneath the stillness. The limitation is one of magnitude, not structure: minimizing squared error against an uncertain future rewards hedging toward the average pose, so the motion is predicted but shrunk. This is a diagnostic, not a fix — amplification makes the structure visible, it does not make the forecast more accurate.

The autoregressive forecast (left) and the same output linearly amplified ×6 (right), drawn at one shared scale. Amplification scales each motion's magnitude while preserving its direction — showing that the freeze is collapsed amplitude, not absent motion. This characterizes the model's behavior; it is not a more accurate forecast.

5. What the Model Learns

Temporal coordination across timescales. The window encoder captures meso-scale patterns the frame encoder misses — stride cadence, the arc-and-return of reaching, weight-transfer timing. Cross-attention constrains fine-grained predictions within these broader patterns, which is what reduces drift.

Spatial coordination across body regions. The body-part encoder learns how regions move relative to each other — contralateral arm–leg coordination in gait, trunk stabilization during limb movement, bilateral symmetry. This is exactly the structure needed to detect compensatory patterns.

Biomechanical plausibility. Predicted sequences generally respect physical constraints — joints stay in plausible ranges, limb lengths stay approximately constant, center-of-mass trajectories stay continuous — and these emerge from the data distribution without explicit loss terms.

6. Cross-Domain Transfer

The pipeline and architectures are domain-agnostic by design — the skeleton, normalization, and tokenization make no assumption about what kind of movement is analyzed. But does the model actually transfer, or just memorize the training distribution?

6.1 Held-out generalization test

We evaluated trained models against a held-out corpus from a fundamentally different capture modality: optical motion capture of physical-therapy exercises (Vicon, 10 exercises, 10 subjects). The model never saw this data in training, and its characteristics are visually distinct — no camera perspective, no detection jitter, no confidence variation.

Key finding: motion dynamics transfer across domains. Per-frame velocity MSE — whether the model predicts the direction and speed of joint movement correctly — was essentially identical across all tested models (0.011–0.014), regardless of whether they trained on dance video, clinical footage, or a multi-source combination. Absolute position errors varied by domain (coordinate systems and projection geometry differ), but the underlying motion structure transferred.

6.2 Scheduled sampling as a transfer mechanism

An unexpected result: scheduled sampling — introduced to reduce autoregressive drift — turns out to be a powerful cross-domain transfer mechanism. Models trained with it showed 54% lower teacher-forced error on held-out data than identical models trained without it. The interpretation: scheduled sampling forces the model to predict from imperfect inputs, and cross-domain inputs are inherently imperfect; models that learn to handle imperfect inputs generalize better.

7. Current Limitations

2D projection. Phase I uses 2D pose estimation; depth-dependent movements are geometrically compressed. 3D pose or multi-view triangulation would improve coverage.
Training-data diversity. The primary corpus is 10 dance genres — fast, diverse, full-body, but not specifically clinical. Domain-specific fine-tuning on rehabilitation exercises is a Phase II objective.
Clinical alignment not yet validated. Engineering metrics are strong; correlation with expert-assigned movement-quality ratings has not yet been formally evaluated.
Autoregressive forecasts do not extend. Deterministic models predict a sharp next frame, but their motion decays toward a static pose within about a second, and rolling them forward only deepens the collapse — frame-30 error remains ~6–10× frame-1 error. This is an architectural property of autoregressive generation, not a tuning problem. The diffusion model avoids it by generating each forecast block in a single pass.
Evaluation methodology. HTT (single rollout) and diffusion (mean of 10) are not evaluated on identical terms; diffusion's advantage includes a variance-reduction benefit.

8. Architecture Summary

Component	HTT	Diffusion
Parameters	18.8M	11.4M
Encoder	3-stream: frame (4L), window (3L), body-part (3L)	Context encoder (6L)
Fusion	Bidirectional cross-attn (4L) + gated combination	Cross-attn in denoiser
Decoder / generator	4-layer autoregressive decoder	6-layer denoiser, DDIM sampling
Embedding	256-d, 8 heads, FFN 1024	256-d, 8 heads, FFN 1024
Output	30 frames (2s), sequential	30 frames (2s), single pass
Inference	Deterministic; short-horizon (motion decays when rolled)	Stochastic, single-pass; sustains long-horizon motion

9. Conclusion

The Large Movement Model demonstrates that human motion can be treated as a structured sequence domain analogous to text, and that both multi-resolution attention and diffusion-based generation learn meaningful motion structure from video-derived pose data. Trained on diverse full-body movement, the diffusion model reduces overall prediction error by 30% versus a standard transformer baseline, with a 31% improvement at the clinically relevant 2-second horizon — and where the autoregressive models freeze into a static pose past that horizon, the diffusion model keeps generating coherent motion, making it the practical basis for sustained forecasting.

Cross-domain evaluation on held-out data from a different capture modality confirms that the learned dynamics transfer — velocity prediction quality is essentially invariant to training corpus, even on domains the model never saw. These results establish feasibility for a new class of domain-agnostic motion foundation models — systems that learn movement structure transferable across capture modalities and movement types.

Read the thesis →