Unitree G1: Voice-Driven Arbitrary Action Generation in Real Time

Unitree Robotics has published a new demo showing the G1 humanoid robot performing a range of physical actions generated in real time from external voice commands.

The video was recorded in a single take with on-site audio — no post-production edits, no pre-programmed sequences. An operator issues voice commands, and the G1’s AI pipeline translates them into motion on the fly. Unitree notes that because actions are generated autonomously in real time, some latency and reduced movement smoothness are to be expected.

This is a meaningful step beyond the company’s earlier demos, which focused primarily on imitation learning (watching a human perform an action and replicating it) or reinforcement-learned acrobatics like backflips and breakdancing. Voice-driven arbitrary action generation implies a different architecture: the system must interpret natural language, map it to a motor plan, and execute — all within a feedback loop fast enough to look continuous.

The timing aligns with Unitree’s broader push toward language-model integration. In March 2026, the company open-sourced UnifoLM-VLA-0, a vision-language-action model built on Qwen2.5-VL-7B that gives the G1 a deployable manipulation baseline across 12 task categories. The latest firmware (v3.2+) also introduced preliminary LLM support on the EDU model’s Jetson Orin.

For context: the G1 stands 1.32 m tall, weighs 35 kg, and offers up to 43 degrees of freedom in its top configuration. Unitree shipped over 5,500 humanoid units in 2025 — more than all US competitors combined — and is targeting 10,000–20,000 in 2026, with an A-share IPO on track for mid-year at a ~$580M valuation.