🐾 SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Xuyi Hu^1* Jin Lyu^2* Jiuming Liu¹ Yebin Liu³ Silvia Zuffi⁴

Liang An^3† Stefan Goetz¹

¹University of Cambridge ²Southern University of Science and Technology

³Tsinghua University ⁴IMATI-CNR, Milan, Italy

(* Equal Contribution, † Corresponding Author)

We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single in-the-wild image. Built on the SMAL+ parametric animal model, our method jointly recovers up to 30 animals in one forward pass and accepts optional keypoint and mask prompts to disambiguate crowded and occluded scenes—without requiring per-instance bounding-box cropping.

Abstract

3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.

Method Overview

Given a full image, SAM 3D Animal uses a ViT-Huge encoder to extract visual tokens and a SAM-style promptable Transformer decoder to predict SMAL+ shape, pose, camera, and bounding boxes for every animal instance. Unlike SAM 3D Body, which reconstructs one prompted human per forward pass, our model adopts a set-prediction paradigm with DETR-style bipartite matching and predicts up to P = 30 animals at once. Optional keypoint prompts provide skeletal alignment cues, while mask prompts sharpen silhouette discrimination when animals overlap.

Multi-Animal Reconstruction in the Wild

From a single uncropped photograph containing herds, packs, or interacting groups, SAM 3D Animal outputs articulated SMAL+ meshes for all visible animals. The model handles diverse species and remains robust under heavy mutual occlusion.

Single-animal reconstruction showcase: original image, overlay reveal, mesh travel, and 3D rotation.

Additional in-the-wild examples with diverse species, occlusions, and group interactions.

Promptable Reconstruction

SAM 3D Animal supports two complementary prompt modalities inspired by promptable human mesh recovery:

Keypoint prompts align skeletal structure and are especially effective when limbs are partially visible.
Mask prompts specify instance silhouettes and help separate animals in dense herds.

Even without prompts, the model already achieves competitive performance. With prompts, accuracy improves consistently across Animal3D, APTv2, and Animal Kingdom—including up to 54% AP and 80% mAP gains on Animal Kingdom over the strongest baseline, and 5.2 PA-MPJPE improvement on Animal3D. Ablation studies show that keypoint prompts are the dominant contributor, with performance scaling monotonically as more keypoints are provided.

Performance under different visibility levels

Herd3D Dataset

Training multi-instance 3D reconstruction from 2D-only annotations is insufficient for resolving inter-animal occlusions. We introduce Herd3D, a multi-animal 3D dataset with more than 5K images and per-instance ground-truth SMAL+ meshes. Building on GenZoo, we adapt the pipeline for multi-animal generation: up to 8 animals are placed on a shared ground plane with controlled layout, expanded pose diversity from Animal3D, and a two-stage Qwen3-VL-8B-Instruct prompting scheme that first predicts per-animal facing directions from the RGB render and then composes a coherent final prompt for Qwen-Image-ControlNet-Union synthesis. Each 1024×1024 image includes SMAL+ parameters, 2D/3D keypoints, and bounding boxes.

Qwen3-VL prompting

Qwen-ControlNet

Synthetic image

Comparisons & Ablations

We compare against state-of-the-art model-based and model-free animal mesh recovery methods on Animal3D, APTv2, and Animal Kingdom. SAM 3D Animal consistently produces more accurate poses and shapes in both single-animal and multi-animal settings. Ablations confirm that Herd3D pre-training, keypoint prompting, and mask prompting each contribute to the final performance.

Qualitative comparisons on Animal3D, Animal Kingdom and APT-36K

BibTeX

@article{hu2026sam3danimal,
  title   = {SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild},
  author  = {Hu, Xuyi and Lyu, Jin and Liu, Jiuming and Liu, Yebin and Zuffi, Silvia and An, Liang and Goetz, Stefan},
  journal = {arXiv preprint arXiv:2605.07604},
  year    = {2026}
}