Model Zoo of DocSAM

April 8, 2025 · View on GitHub

Model Zoo

As an all-in-one model, DocSAM was pre-trained on nearly 50 datasets. We provide pre-trained weights for two model variants: DocSAM-Base and DocSAM-Large, with their visual backbones based on Swin-Base and Swin-Large, respectively. Additionally, we offer several specialized models fine-tuned on specific datasets. Information and download links for these pre-trained models are listed in the table below.

Name Backbone #Parameter BatchSize #Iterations PatchSize Download
docsam_base_all_dataset swin-base 208M 32 360k 640,640 model
docsam_large_all_dataset swin-large 317M 32 360k 640,640 model
docsam_large_all_dataset_keepsize swin-base 208M 32 360k 640,640 model
docsam_large_m6doc swin-large 317M 16 40k 800,800 model
docsam_large_doclaynet swin-large 317M 16 40k 800,800 model
docsam_large_scut_cab swin-large 317M 16 40k 800,800 model
docsam_large_ctw1500 swin-large 317M 16 40k 800,800 model
docsam_large_totaltext swin-large 317M 16 40k 800,800 model

Notice

By default, during training, we randomly resize the input images so that their shorter sides range between 704 and 896 pixels. After resizing, we randomly crop the images into patches of size 640×640 pixels. During testing and inference, we uniformly resize the input images to have a shorter side of 800 pixels and perform predictions using a sliding window approach. The window size is set to 640×640 pixels, with strides of 320 pixels in both the horizontal and vertical directions.

Additionally, we provide a model called docsam_large_all_dataset_keepsize, which uses a different resizing method. This model retains the original sizes of the input images as much as possible, ensuring that the shorter sides are within the range of 640 to 1280 pixels. Other settings are kept the same as docsam_large_all_dataset.

For fine-tuning on the M6Doc, DocLayNet, SCUT-CAB, CTW1500, and Total-Text datasets, we adopt a patch size of 800×800 pixels. This approach aims to prevent regions of interest from being fragmented across different patches.

The original DocLayNet dataset resizes all images to a fixed size of 1025×1025 pixels, which may distort the aspect ratios of the documents. To better preserve the original aspect ratios, we have resized the images to 1025×1449 pixels before sending them to our model.

Model Performance

The performance of the aforementioned models is shown in the following tables. For more details about the datasets listed in these tables, please refer to the original paper.

Please note that the pretrained models we provide here, such as docsam_base_all_dataset, docsam_large_all_dataset, and docsam_large_all_dataset_keepsize, are trained with more iterations (360k). As a result, their performance is actually higher than what is reported in our original paper.

Additionally, the dataset-specific models, including docsam_large_m6doc, docsam_large_doclaynet, docsam_large_scut_cab, docsam_large_ctw1500, and docsam_large_totaltext, are fine-tuned from the above docsam_large_all_dataset model. Consequently, their performance (DocSAM*) is also higher than what was reported in the original paper (DocSAM).

docsam_base_all_dataset

Task Dataset Instance Semantic Dataset Instance Semantic
AP50 AP75 mAP mAPb mAF mIoU AP50 AP75 mAP mAPb mAF mIoU
DLA BaDLAD 0.686 0.480 0.464 0.478 0.545 0.686 CDLA 0.935 0.871 0.774 0.772 0.776 0.832
D4LA 0.654 0.585 0.515 0.509 0.552 0.486 DocBank 0.638 0.478 0.442 0.425 0.519 0.638
DocLayNet 0.753 0.595 0.541 0.529 0.590 0.650 ICDAR2017-POD 0.892 0.846 0.803 0.788 0.796 0.920
IIIT-AR-13K 0.828 0.647 0.599 0.602 0.655 0.631 M6Doc 0.570 0.462 0.413 0.407 0.438 0.328
PubLayNet 0.951 0.900 0.848 0.840 0.884 0.918 RanLayNet 0.948 0.897 0.845 0.835 0.872 0.905
AHDS CASIA-AHCDB-style1 0.961 0.936 0.874 0.853 0.899 0.948 CASIA-AHCDB-style2 0.940 0.922 0.837 0.825 0.880 0.924
CHDAC-2022 0.872 0.730 0.613 0.560 0.632 0.908 ICDAR2019-HDRC 0.951 0.820 0.767 0.722 0.822 0.912
SCUT-CAB-physical 0.944 0.873 0.793 0.775 0.827 0.939 SCUT-CAB-logical 0.773 0.647 0.562 0.549 0.561 0.488
MTHv2 0.945 0.857 0.718 0.687 0.736 0.916 HJDataset 0.966 0.932 0.898 0.888 0.904 0.822
CASIA-HWDB 0.971 0.901 0.845 0.772 0.876 0.951 SCUT-HCCDoc 0.845 0.657 0.548 0.552 0.597 0.854
TSR FinTabNet 0.879 0.802 0.712 0.697 0.797 0.867 PubTabNet 0.981 0.903 0.829 0.821 0.869 0.927
ICDAR2013 0.948 0.525 0.602 0.525 0.550 0.846 ICDAR2017-POD 0.942 0.847 0.766 0.739 0.796 0.895
cTDaR-modern 0.916 0.561 0.636 0.594 0.692 0.879 cTDaR-archival 0.925 0.770 0.713 0.659 0.733 0.959
NTable-cam 0.851 0.749 0.673 0.680 0.718 0.865 NTable-gen 0.946 0.914 0.852 0.849 0.896 0.945
PubTables-1M-TD 0.982 0.953 0.910 0.867 0.945 0.968 PubTables-1M-TSR 0.902 0.821 0.752 0.734 0.794 0.836
TableBank-latex 0.968 0.950 0.923 0.906 0.939 0.952 TableBank-word 0.874 0.841 0.822 0.823 0.852 0.857
TNCR 0.704 0.650 0.629 0.613 0.658 0.555 STDW 0.960 0.933 0.913 0.887 0.925 0.974
WTW 0.948 0.905 0.803 0.786 0.812 0.974
STD CASIA-10k 0.638 0.411 0.383 0.380 0.413 0.799 COCO-Text 0.519 0.248 0.264 0.270 0.292 0.619
CTW1500 0.809 0.558 0.494 0.492 0.565 0.831 CTW-Public 0.369 0.124 0.167 0.139 0.199 0.540
HUST-TR400 0.811 0.680 0.600 0.605 0.643 0.859 ICDAR2015 0.657 0.316 0.337 0.350 0.364 0.635
ICDAR2017-RCTW 0.585 0.295 0.304 0.312 0.359 0.803 ICDAR2017-MLT 0.669 0.468 0.418 0.415 0.457 0.835
ICDAR2019-ArT 0.733 0.471 0.428 0.447 0.466 0.802 ICDAR2019-LSVT 0.597 0.372 0.351 0.352 0.397 0.814
ICDAR2019-MLT 0.702 0.498 0.445 0.443 0.480 0.846 ICDAR2019-ReCTS 0.739 0.553 0.490 0.485 0.523 0.845
ICDAR2023-HierText 0.594 0.333 0.327 0.316 0.359 0.671 ICDAR2023-ReST 0.944 0.900 0.771 0.850 0.486 0.833
ICPR2018-MTWI 0.644 0.394 0.379 0.383 0.424 0.838 MSRA-TD500 0.825 0.628 0.542 0.581 0.558 0.767
ShopSign 0.629 0.279 0.311 0.322 0.365 0.817 Total-Text 0.773 0.470 0.433 0.451 0.474 0.781
USTB-SV1K 0.826 0.379 0.425 0.412 0.467 0.714

docsam_large_all_dataset

Task Dataset Instance Semantic Dataset Instance Semantic
AP50 AP75 mAP mAPb mAF mIoU AP50 AP75 mAP mAPb mAF mIoU
DLA BaDLAD 0.713 0.504 0.488 0.497 0.562 0.694 CDLA 0.955 0.910 0.818 0.820 0.822 0.885
D4LA 0.722 0.662 0.581 0.576 0.600 0.560 DocBank 0.692 0.553 0.512 0.501 0.585 0.702
DocLayNet 0.826 0.697 0.628 0.621 0.675 0.725 ICDAR2017-POD 0.916 0.874 0.841 0.830 0.824 0.931
IIIT-AR-13K 0.887 0.754 0.674 0.680 0.716 0.776 M6Doc 0.682 0.590 0.519 0.512 0.538 0.467
PubLayNet 0.965 0.921 0.876 0.877 0.900 0.931 RanLayNet 0.950 0.925 0.873 0.869 0.863 0.882
AHDS CASIA-AHCDB-style1 0.977 0.965 0.914 0.899 0.934 0.955 CASIA-AHCDB-style2 0.952 0.938 0.865 0.846 0.900 0.930
CHDAC-2022 0.917 0.812 0.673 0.637 0.697 0.916 ICDAR2019-HDRC 0.955 0.830 0.777 0.728 0.829 0.919
SCUT-CAB-physical 0.943 0.888 0.812 0.792 0.849 0.947 SCUT-CAB-logical 0.786 0.675 0.583 0.567 0.595 0.548
MTHv2 0.956 0.878 0.731 0.706 0.756 0.919 HJDataset 0.986 0.945 0.916 0.908 0.918 0.823
CASIA-HWDB 0.976 0.934 0.879 0.798 0.900 0.956 SCUT-HCCDoc 0.877 0.684 0.574 0.581 0.640 0.860
TSR FinTabNet 0.887 0.818 0.734 0.716 0.817 0.877 PubTabNet 0.992 0.917 0.839 0.833 0.878 0.932
ICDAR2013 0.919 0.532 0.598 0.517 0.543 0.844 ICDAR2017-POD 0.944 0.855 0.774 0.750 0.797 0.899
cTDaR-modern 0.932 0.593 0.662 0.617 0.713 0.886 cTDaR-archival 0.922 0.793 0.734 0.679 0.756 0.965
NTable-cam 0.906 0.838 0.757 0.772 0.784 0.892 NTable-gen 0.961 0.933 0.905 0.906 0.941 0.964
PubTables-1M-TD 0.985 0.935 0.896 0.879 0.939 0.977 PubTables-1M-TSR 0.922 0.856 0.793 0.781 0.825 0.859
TableBank-latex 0.975 0.965 0.939 0.938 0.953 0.959 TableBank-word 0.893 0.859 0.858 0.856 0.868 0.875
TNCR 0.719 0.677 0.656 0.649 0.693 0.642 STDW 0.969 0.945 0.930 0.921 0.942 0.972
WTW 0.960 0.927 0.824 0.810 0.838 0.978
STD CASIA-10k 0.679 0.453 0.416 0.415 0.444 0.819 COCO-Text 0.577 0.282 0.297 0.306 0.324 0.655
CTW1500 0.809 0.581 0.503 0.505 0.586 0.843 CTW-Public 0.484 0.161 0.210 0.179 0.230 0.615
HUST-TR400 0.861 0.769 0.654 0.661 0.701 0.863 ICDAR2015 0.710 0.356 0.370 0.385 0.400 0.658
ICDAR2017-RCTW 0.624 0.315 0.328 0.339 0.383 0.811 ICDAR2017-MLT 0.705 0.506 0.451 0.450 0.492 0.857
ICDAR2019-ArT 0.774 0.509 0.462 0.484 0.507 0.814 ICDAR2019-LSVT 0.641 0.409 0.381 0.386 0.430 0.827
ICDAR2019-MLT 0.746 0.547 0.485 0.484 0.523 0.867 ICDAR2019-ReCTS 0.780 0.602 0.530 0.527 0.565 0.862
ICDAR2023-HierText 0.630 0.365 0.354 0.342 0.390 0.694 ICDAR2023-ReST 0.956 0.930 0.831 0.906 0.727 0.860
ICPR2018-MTWI 0.682 0.429 0.410 0.415 0.453 0.853 MSRA-TD500 0.857 0.654 0.564 0.510 0.601 0.774
ShopSign 0.660 0.321 0.342 0.357 0.411 0.832 Total-Text 0.788 0.509 0.456 0.475 0.508 0.798
USTB-SV1K 0.838 0.416 0.438 0.426 0.493 0.722

docsam_large_all_dataset_keepsize

Task Dataset Instance Semantic Dataset Instance Semantic
AP50 AP75 mAP mAPb mAF mIoU AP50 AP75 mAP mAPb mAF mIoU
DLA BaDLAD 0.696 0.494 0.477 0.487 0.547 0.700 CDLA 0.939 0.890 0.800 0.803 0.807 0.866
D4LA 0.655 0.590 0.522 0.518 0.575 0.469 DocBank 0.704 0.534 0.498 0.489 0.573 0.687
DocLayNet 0.748 0.581 0.530 0.528 0.605 0.613 ICDAR2017-POD 0.886 0.850 0.817 0.800 0.820 0.916
IIIT-AR-13K 0.851 0.698 0.635 0.648 0.693 0.730 M6Doc 0.667 0.564 0.503 0.498 0.529 0.425
PubLayNet 0.971 0.922 0.889 0.887 0.904 0.935 RanLayNet 0.936 0.895 0.847 0.843 0.841 0.879
AHDS CASIA-AHCDB-style1 0.968 0.955 0.926 0.915 0.943 0.956 CASIA-AHCDB-style2 0.959 0.950 0.871 0.863 0.903 0.927
CHDAC-2022 0.881 0.752 0.631 0.599 0.649 0.910 ICDAR2019-HDRC 0.952 0.807 0.769 0.732 0.814 0.895
SCUT-CAB-physical 0.925 0.862 0.791 0.773 0.850 0.929 SCUT-CAB-logical 0.737 0.576 0.517 0.508 0.559 0.523
MTHv2 0.934 0.839 0.695 0.674 0.741 0.913 HJDataset 0.941 0.891 0.859 0.853 0.825 0.813
CASIA-HWDB 0.963 0.935 0.880 0.810 0.900 0.955 SCUT-HCCDoc 0.849 0.647 0.548 0.557 0.620 0.855
TSR FinTabNet 0.886 0.819 0.749 0.724 0.825 0.878 PubTabNet 0.992 0.913 0.840 0.831 0.879 0.934
ICDAR2013 0.908 0.526 0.606 0.522 0.542 0.827 ICDAR2017-POD 0.935 0.861 0.774 0.752 0.794 0.896
cTDaR-modern 0.880 0.591 0.646 0.613 0.648 0.825 cTDaR-archival 0.949 0.837 0.762 0.721 0.777 0.957
NTable-cam 0.906 0.831 0.752 0.764 0.779 0.874 NTable-gen 0.951 0.930 0.899 0.900 0.940 0.944
PubTables-1M-TD 0.988 0.956 0.943 0.912 0.951 0.955 PubTables-1M-TSR 0.936 0.892 0.833 0.826 0.856 0.873
TableBank-latex 0.981 0.973 0.948 0.944 0.953 0.959 TableBank-word 0.922 0.890 0.893 0.891 0.868 0.872
TNCR 0.717 0.676 0.664 0.662 0.753 0.592 STDW 0.948 0.932 0.921 0.915 0.935 0.968
WTW 0.959 0.925 0.818 0.810 0.834 0.978
STD CASIA-10k 0.679 0.457 0.419 0.423 0.443 0.820 COCO-Text 0.542 0.260 0.276 0.281 0.316 0.643
CTW1500 0.848 0.638 0.544 0.550 0.610 0.846 CTW-Public 0.660 0.312 0.335 0.314 0.335 0.645
HUST-TR400 0.845 0.789 0.651 0.666 0.710 0.878 ICDAR2015 0.700 0.338 0.362 0.375 0.401 0.643
ICDAR2017-RCTW 0.638 0.311 0.331 0.346 0.389 0.810 ICDAR2017-MLT 0.714 0.514 0.457 0.460 0.491 0.840
ICDAR2019-ArT 0.767 0.517 0.461 0.482 0.504 0.818 ICDAR2019-LSVT 0.635 0.404 0.378 0.383 0.422 0.827
ICDAR2019-MLT 0.751 0.557 0.488 0.495 0.519 0.854 ICDAR2019-ReCTS 0.789 0.603 0.533 0.533 0.572 0.862
ICDAR2023-HierText 0.651 0.387 0.371 0.370 0.396 0.683 ICDAR2023-ReST 0.970 0.917 0.843 0.913 0.746 0.873
ICPR2018-MTWI 0.683 0.434 0.413 0.418 0.456 0.853 MSRA-TD500 0.770 0.599 0.509 0.553 0.533 0.748
ShopSign 0.604 0.254 0.292 0.301 0.357 0.820 Total-Text 0.796 0.502 0.462 0.486 0.504 0.807
USTB-SV1K 0.840 0.390 0.430 0.417 0.484 0.719

docsam_large_m6doc

Method Object Instance
mAP AP50 AP75 mAP AP50 AP75
Faster R-CNN 0.490 0.678 0.572 0.478 0.678 0.552
Mask R-CNN 0.401 0.584 0.462 0.397 0.584 0.456
Deformable DETR 0.572 0.768 0.634 0.556 0.765 0.611
ISTR 0.627 0.808 0.708 0.620 0.807 0.702
TransDLANet 0.645 0.827 0.727 0.638 0.826 0.719
DAT 0.712 -- -- 0.657 -- --
DocSAM 0.663 0.840 0.755 0.661 0.840 0.750
DocSAM* 0.664 0.833 0.754 0.667 0.835 0.757

docsam_large_doclaynet

Method Caption Footnote Formula List-item Page-footer Page-header Picture Section-header Table Text Title mAP
Human 0.890 0.910 0.850 0.880 0.940 0.890 0.710 0.840 0.810 0.860 0.720 0.830
Faster R-CNN 0.701 0.737 0.635 0.810 0.589 0.720 0.720 0.684 0.822 0.854 0.799 0.734
Mask R-CNN 0.715 0.718 0.634 0.808 0.593 0.700 0.727 0.693 0.829 0.858 0.804 0.735
YOLOv5 0.777 0.772 0.662 0.862 0.611 0.679 0.771 0.746 0.863 0.881 0.827 0.768
Cascade Mask R-CNN 0.732 0.753 0.669 0.839 0.617 0.713 0.750 0.701 0.859 0.871 0.815 0.756
TransDLANet 0.682 0.747 0.616 0.810 0.548 0.682 0.685 0.698 0.824 0.838 0.818 0.723
SwinDocSegmenter 0.836 0.648 0.623 0.823 0.651 0.664 0.847 0.665 0.874 0.882 0.633 0.769
DINO 0.718 0.788 0.727 0.856 0.630 0.766 0.741 0.721 0.873 0.876 0.851 0.777
VSR 0.726 0.721 0.738 0.862 0.818 0.813 0.631 0.825 0.794 0.884 0.807 0.784
M2Doc (Cascade Mask R-CNN) 0.860 0.836 0.871 0.928 0.867 0.856 0.763 0.891 0.864 0.927 0.878 0.867
M2Doc (DINO) 0.853 0.867 0.898 0.936 0.903 0.910 0.784 0.907 0.874 0.939 0.913 0.890
DocSAM 0.740 0.703 0.646 0.822 0.569 0.668 0.730 0.702 0.843 0.853 0.739 0.729

docsam_large_scut_cab

Method Object Instance
mAP AP50 AP75 mAP AP50 AP75
Physical
Faster R-CNN 0.775 0.913 0.861 0.753 0.910 0.834
Mask R-CNN 0.791 0.921 0.877 0.795 0.917 0.872
SCNet 0.813 0.941 0.890 0.820 0.941 0.891
Deformable DETR 0.799 0.923 0.871 0.779 0.921 0.843
VSR 0.787 0.919 0.860 0.787 0.919 0.852
DocSAM 0.774 0.947 0.860 0.811 0.948 0.891
DocSAM* 0.829 0.944 0.902 0.837 0.944 0.908
Logical
Faster R-CNN 0.549 0.774 0.613 0.542 0.773 0.606
Mask R-CNN 0.551 0.785 0.619 0.553 0.777 0.631
SCNet 0.602 0.836 0.673 0.603 0.836 0.680
Deformable DETR 0.627 0.852 0.717 0.620 0.851 0.703
VSR 0.557 0.783 0.616 0.551 0.782 0.611
DocSAM 0.548 0.769 0.632 0.575 0.779 0.667
DocSAM* 0.588 0.780 0.660 0.592 0.781 0.671

docsam_large_ctw1500 and docsam_large_totaltext

Method CTW1500 Total-Text
P R F P R F
HierText 0.846 0.874 0.860 0.855 0.905 0.879
SIR 0.874 0.837 0.855 0.909 0.856 0.882
DPText-DETR 0.917 0.862 0.888 0.918 0.864 0.890
UNITS -- -- -- -- -- 0.898
ESTextSpotter 0.915 0.886 0.900 0.920 0.881 0.900
DAT-DET 0.893 0.893 0.893 0.940 0.882 0.910
DAT-SEG 0.925 0.909 0.917 0.950 0.892 0.920
DocSAM 0.805 0.881 0.842 0.721 0.826 0.770
DocSAM* 0.839 0.880 0.859 0.778 0.868 0.820

Analyis and Findings

Due to GPU memory limitations, DocSAM relies on cropping during training, testing, and inference. Although we have added a whole-image thumbnail to recover large objects that may be fragmented across patches and designed a score re-weighting strategy to carefully merge segmentation results from different patches, it may still be inferior to directly processing entire images.

Moreover, because of the use of cropping, different resizing strategies may significantly influence performance on different datasets. This can be observed in the results of docsam_large_all_dataset and docsam_large_all_dataset_keepsize. Compared to docsam_large_all_dataset, which resizes all images to a fixed shorter side of 800 pixels, docsam_large_all_dataset_keepsize attempts to maintain the native resolution of documents while ensuring the shorter side falls within the range of 640 to 1280 pixels.

This approach may be beneficial for high-resolution documents containing many small objects, as well as for low-resolution documents with large objects. However, in other scenarios—such as high-resolution documents containing large objects—it may be harmful because it could exacerbate the fragmentation problem.

Moreover, the current version of DocSAM is purely single-modal. Although it performs comparably or even superiorly to previous single-modal methods, it still performs significantly worse than multi-modal models, especially on tasks like logical layout analysis, which require distinguishing between fine-grained text classes.

In future work, we plan to optimize the model's memory usage and reduce computational costs, design more reasonable resizing and cropping strategies, and extend DocSAM to a multi-modal version, thereby further improving its performance and efficiency.