GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

Dongli Wu1 Xiaobao Wei2 Hao Wang2 Qiaochu Dong2 Ying Li2,3 Qingpo Wuwu2 Ming Lu2 Wufan Zhao1

1The Hong Kong University of Science and Technology (Guangzhou) 2Peking University 3The Hong Kong University of Science and Technology

GraspFoM turns partial RGB-D observations into a shared 3D object latent, then uses that latent to jointly reconstruct high-fidelity 3D assets and generate continuous 6-DoF grasp poses.

3D Foundation Priors 6-DoF Grasping Object Reconstruction Multimodal Pose Generation
Overview of the GraspFoM framework
Training fuses SAM3D object priors with grasp supervision; inference performs anchor-based denoising, score-based selection, and asset reconstruction from the shared latent.

Qualitative Results

Interactive 3D Pose Assets

Rotate and zoom the reconstructed object-pose assets directly in the browser.

Cola

Wine

Abstract

Robotic grasping remains challenging under partial observations because reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware methods often use reconstruction as an intermediate prediction, leaving the geometry weakly coupled with grasp planning.

GraspFoM leverages 3D foundation priors from SAM3D to build a shared object latent for both reconstruction and grasp pose prediction. On top of this latent, it introduces anchor-initialized truncated diffusion for continuous multimodal 6-DoF grasp generation, together with a reconstruction-aware scorer and residual latent updater that connect grasp supervision back to object reconstruction.

Shared 3D object latent for reconstruction and grasping
6-DoF continuous anchor-initialized grasp generation
Mesh + 3DGS high-fidelity object asset reconstruction
SOTA grasping and reconstruction performance

Foundation-Prior Object Latent

SAM3D priors provide object-centric point maps and compact shape latents that encode local surface geometry and global object structure from partial observations.

Anchor-Initialized Diffusion

Learned grasp anchors initialize truncated denoising in normalized pose space, avoiding discrete candidate enumeration while preserving multimodal grasp hypotheses.

Reconstruction-Aware Feedback

Point-wise scores aggregate grasp-relevant features for a lightweight residual updater, letting manipulation supervision refine the shared object latent without discarding geometry.

Results

State-of-the-Art Grasping and Reconstruction on GraspNet-1B

GraspFoM is evaluated on the official GraspNet-1Billion benchmark for grasp pose prediction and object-level 3D reconstruction. It is trained only on the official GraspNet-1B training split, while several unified baselines rely on additional pretraining or multi-view inputs.

Seen AP 78.87 +6.44 absolute AP gain
Similar AP 73.34 +7.89 absolute AP gain
Novel AP 54.32 +25.83 absolute AP gain
Reconstruction 2.74 CD 96.08 F1 and 87.18 NC

Grasp Pose Prediction

AP, AP0.8, and AP0.4 are reported on Seen, Similar, and Novel splits. G and R indicate grasp and reconstruction outputs.

Method Output Seen Similar Novel
G R AP AP0.8 AP0.4 AP AP0.8 AP0.4 AP AP0.8 AP0.4
GG-CNNYesNo15.4821.8410.2513.2618.374.625.525.931.86
MultiObject MultiGraspYesNo15.9723.6610.8015.4120.217.067.648.692.52
CenterGraspYesYes16.4620.2411.749.5211.925.711.601.891.12
GPDYesNo22.8728.5312.8421.3327.839.648.248.892.67
PointNetGPDYesNo25.9633.0115.3722.6829.1510.769.239.892.74
GraspNetYesNo27.5633.4316.5926.1134.1814.2310.5511.253.98
GSNetYesNo67.1278.4660.9054.8166.7246.1724.3130.5214.23
Scale-Balanced GraspYesNo63.8374.2558.6658.4670.0551.3224.6331.0512.85
HGGDYesNo64.4572.8161.1653.5964.1245.9124.5930.4615.58
EconomicGraspYesNo68.2179.6063.5461.1973.6053.7725.4831.4613.85
MG-GraspYesYes66.80--57.35--23.22--
ZeroGraspYesYes72.4383.1265.5765.4578.3255.4828.4934.2115.80
GraspFoMYesYes78.8789.6071.1373.3487.8464.1254.3264.0129.36

3D Reconstruction

Geometry quality on GraspNet-1B using CD, F1-Score@10mm, and Normal Consistency.

MethodSegmentedCD ↓F1 ↑NC ↑
MinkowskiYes6.8481.4577.89
OCNNYes7.2382.2278.44
OctMAENo7.5778.3875.19
ZeroGraspYes6.0584.0878.46
GraspFoMYes2.7496.0887.18

Qualitative Results

Visual Evidence Across Reconstruction, Scenes, and Pose Assets

GraspFoM predicts grasp poses together with object assets from the shared 3D latent. The examples below show reconstructed geometry, 6-DoF grasp visualizations, and scene-level outputs.

Single-Object Reconstruction Renders

Object-level assets are arranged as an interactive result wall.

Single-object apple reconstruction render
Single-object can reconstruction render
Single-object object reconstruction render
Single-object reconstruction render with pose

Scene-Level Composite Results

Scene examples are spread out for quick visual comparison across different object layouts.

Scene-level composite result 0001
Scene-level composite result 0059
Scene-level composite result 0064
Scene-level composite result 0196

Citation

BibTeX

@article{wu2026graspfom,
  title={GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors},
  author={Wu, Dongli and Wei, Xiaobao and Wang, Hao and Dong, Qiaochu and Li, Ying and Wuwu, Qingpo and Lu, Ming and Zhao, Wufan},
  journal={arXiv preprint},
  year={2026}
}