AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization

Yushan Liu^*1 Shilong Mu^*†1 Xintao Chao¹ Zizhen Li² Yao Mu³ Tianxing Chen³ Shoujie Li¹
Chuqiao Lyu^†1 Xiao-ping Zhang¹ Wenbo Ding^†1

¹Tsinghua University ²National University of Singapore
³Shanghai Jiao Tong University ⁴The University of Hong Kong
^*Equal contribution ^†Corresponding author

Paper arXiv Video Code: Soon

Overview

Abstract

Robotic manipulation within dynamic environments presents challenges to precise control and adaptability. Traditional fixed-view camera systems face challenges adapting to change viewpoints and scale variations, limiting perception accuracy and manipulation precision. To tackle these issues, We propose the Active Vision-driven Robotic (AVR) framework, integrating dynamic viewpoint and zoom optimization to continuously center targets and maintain optimal scale. Using the RoboTwin platform with a real-time image processing plugin , AVR improves task success rates by 5%-16% on six manipulation tasks. Physical deployment on a dual-arm system demonstrates in collaborative tasks and 36% precision in screwdriver insertion, outperforming baselines by over 25%. Experimental results confirm that AVR enhances environmental perception, end-effector repeatability (40% ≤1 cm error), and robustness in complex scenarios, paving the way for future robotic precision manipulation methods in the pursuit of human-level robot dexterity and precision.

System Architecture

System have a 2-degree of freedom zoom camera, capable of covering the entire workspace, left camera and front cameras provide additional viewpoints. The system supports teleoperation via an ALOHA-based teaching pendant or a VR controller for intuitive manipulation.

Teleoperation with Dynamic Active Vision

The top camera enables 2D viewpoint adjustment and dynamic zoom. Users rotate the VR headset to center the target and adjust zoom via keyboard or controller. Data-driven pipeline: collected joint positions, image frames, and zoom levels are processed by a transformer-based model for action chunking and deployed for manipulation.

RoboTwin-based Simulation

We design a RoboTwin-based simulation environment to evaluate the performance of AVR in various manipulation tasks. The RoboTwin-based simulation setup with the dynamic top camera. (b) Data extraction and processing flow: the collected dataset includes image frames, arm joint angles, and end-effector poses. Viewpoint transformation is simulated by object detection using YOLOv8, followed by dynamic zoom, super-resolution reconstruction and dynamic crop-and-fill operations ensure pixel consistency with the original dataset, enabling focal-length variation simulation. The processed data is then used for backend training.

Our approach achieved a success rate improvement ranging from 5% to 16% across all five tasks. These results further validate the effectiveness of our method in various task scenarios.

Experiments

We further design a series of experiments to evaluate the performance of AVR in various manipulation tasks. We implemented different tasks in various scenarios, including pick-place handover, folding clothes, wiping plates, stacking blocks, and inserting a screwdriver into a small hole. The results show that AVR significantly improves task success rates and precision in various scenarios, demonstrating the effectiveness of our system.

Screwdriver Insert

Pick-place Handover

Fold Clothes

Plate Scurb

Stack Blocks

Results

"Dart Throwing Expert"

To evaluate the ability of our AVR system to learn stableprecision operations, we designed a "Dart Throwing Expert" experiment. In this experiment, a 0.5 mm ballpoint pen was mounted on the end effector of the robot gripper, and a target board composed of concentric circles with 1 cm spacing was mounted on the table. Final results showed an average score of 9.3, with more than 40% of the trials achieved a deviation of less than 1 cm.

Conclusion and Future Work

We propose the AVR framework, which leverages dynamic viewpoint and focal length adjustments to enhance precise manipulation. Simulation and real-world experiments show a 5–16% improvement in task success rates and over 25% higher precision than conventional imitation learning. In a “dart-throwing” test (average score: 9.3), 40% of throws landed within 1 cm of the target, demonstrating AVR’s potential for high-precision robotic manipulation.
Future work includes refining ALOHA mode for more efficient data collection, enhancing viewpoint control with higher-precision gimbal motors and improved VR-to-camera mapping, and integrating new robotic platforms (e.g., wrist-mounted cameras, end-effector pose sensing) to further improve perception and adaptability in complex tasks.

BibTeX

@misc{liu2025avractivevisiondrivenrobotic,
      title={AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization}, 
      author={Yushan Liu and Shilong Mu and Xintao Chao and Zizhen Li and Yao Mu and Tianxing Chen and Shoujie Li and Chuqiao Lyu and Xiao-ping Zhang and Wenbo Ding},
      year={2025},
      eprint={2503.01439},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2503.01439}, 
}