🐾 SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Xuyi Hu1*      Jin Lyu2*      Jiuming Liu1      Yebin Liu3      Silvia Zuffi4
Liang An3†      Stefan Goetz1
1University of Cambridge     2Southern University of Science and Technology
3Tsinghua University     4IMATI-CNR, Milan, Italy
(* Equal Contribution, † Corresponding Author)
Paper Code Hugging Face Demo

We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single in-the-wild image. Built on the SMAL+ parametric animal model, our method jointly recovers up to 30 animals in one forward pass and accepts optional keypoint and mask prompts to disambiguate crowded and occluded scenes—without requiring per-instance bounding-box cropping.

Abstract

3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.

Method Overview

SAM 3D Animal model architecture

Given a full image, SAM 3D Animal uses a ViT-Huge encoder to extract visual tokens and a SAM-style promptable Transformer decoder to predict SMAL+ shape, pose, camera, and bounding boxes for every animal instance. Unlike SAM 3D Body, which reconstructs one prompted human per forward pass, our model adopts a set-prediction paradigm with DETR-style bipartite matching and predicts up to P = 30 animals at once. Optional keypoint prompts provide skeletal alignment cues, while mask prompts sharpen silhouette discrimination when animals overlap.

Multi-Animal Reconstruction in the Wild

From a single uncropped photograph containing herds, packs, or interacting groups, SAM 3D Animal outputs articulated SMAL+ meshes for all visible animals. The model handles diverse species—including horses, dogs, antelopes, wolves, cats, and sheep—and remains robust under heavy mutual occlusion.

Qualitative evaluation of SAM 3D Animal

Qualitative results on challenging multi-animal scenes with input images and overlay reconstructions.

Promptable Reconstruction

SAM 3D Animal supports two complementary prompt modalities inspired by promptable human mesh recovery:

  • Keypoint prompts align skeletal structure and are especially effective when limbs are partially visible.
  • Mask prompts specify instance silhouettes and help separate animals in dense herds.

Even without prompts, the model already achieves competitive performance. With prompts, accuracy improves consistently across Animal3D, APTv2, and Animal Kingdom—including up to 54% AP and 80% mAP gains on Animal Kingdom over the strongest baseline, and 5.2 PA-MPJPE improvement on Animal3D. Ablation studies show that keypoint prompts are the dominant contributor, with performance scaling monotonically as more keypoints are provided.

Ablation on number of prompt keypoints

Performance under different visibility levels

Herd3D Dataset

Training multi-instance 3D reconstruction from 2D-only annotations is insufficient for resolving inter-animal occlusions. We introduce Herd3D, a multi-animal 3D dataset with more than 5K images and per-instance ground-truth SMAL+ meshes. Building on GenZoo, we adapt the pipeline for multi-animal generation: up to 8 animals are placed on a shared ground plane with controlled layout, expanded pose diversity from Animal3D, and a two-stage Qwen3-VL-8B-Instruct prompting scheme that first predicts per-animal facing directions from the RGB render and then composes a coherent final prompt for Qwen-Image-ControlNet-Union synthesis. Each 1024×1024 image includes SMAL+ parameters, 2D/3D keypoints, and bounding boxes.

Qwen-ControlNet synthesis

Qwen3-VL prompting

Qwen-ControlNet

Synthetic image


Herd3D dataset example

Herd3D multi-animal dataset samples

Comparisons & Ablations

We compare against state-of-the-art model-based and model-free animal mesh recovery methods on Animal3D, APTv2, and Animal Kingdom. SAM 3D Animal consistently produces more accurate poses and shapes in both single-animal and multi-animal settings. Ablations confirm that Herd3D pre-training, keypoint prompting, and mask prompting each contribute to the final performance.

Qualitative comparisons on Animal3D, Animal Kingdom and APT-36K

Ablation studies

BibTeX

@article{hu2026sam3danimal,
  title   = {SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild},
  author  = {Hu, Xuyi and Lyu, Jin and Liu, Jiuming and Liu, Yebin and Zuffi, Silvia and An, Liang and Goetz, Stefan},
  journal = {arXiv preprint arXiv:2605.07604},
  year    = {2026}
}

Related Work

SAM 3D Body: Robust Full-Body Human Mesh Recovery.
AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer. CVPR 2025.
AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves. TPAMI 2025.
3D-Fauna: Learning the 3D Fauna of the Web. CVPR 2024.
MagicPony: Learning Articulated 3D Animals in the Wild. CVPR 2023.
Animal3D: A Comprehensive Dataset for Animal 3D Pose Estimation. ICCV 2023.
GenZoo: Generative Zoo. ICCV 2025.
AWOL: Analysis WithOut synthesis using Language. ECCV 2024.
PromptHMR: Promptable Human Mesh Recovery.