Model Zoo of DocSAM
April 8, 2025 · View on GitHub
Model Zoo
As an all-in-one model, DocSAM was pre-trained on nearly 50 datasets. We provide pre-trained weights for two model variants: DocSAM-Base and DocSAM-Large, with their visual backbones based on Swin-Base and Swin-Large, respectively. Additionally, we offer several specialized models fine-tuned on specific datasets. Information and download links for these pre-trained models are listed in the table below.
| Name | Backbone | #Parameter | BatchSize | #Iterations | PatchSize | Download |
|---|---|---|---|---|---|---|
| docsam_base_all_dataset | swin-base | 208M | 32 | 360k | 640,640 | model |
| docsam_large_all_dataset | swin-large | 317M | 32 | 360k | 640,640 | model |
| docsam_large_all_dataset_keepsize | swin-base | 208M | 32 | 360k | 640,640 | model |
| docsam_large_m6doc | swin-large | 317M | 16 | 40k | 800,800 | model |
| docsam_large_doclaynet | swin-large | 317M | 16 | 40k | 800,800 | model |
| docsam_large_scut_cab | swin-large | 317M | 16 | 40k | 800,800 | model |
| docsam_large_ctw1500 | swin-large | 317M | 16 | 40k | 800,800 | model |
| docsam_large_totaltext | swin-large | 317M | 16 | 40k | 800,800 | model |
Notice
By default, during training, we randomly resize the input images so that their shorter sides range between 704 and 896 pixels. After resizing, we randomly crop the images into patches of size 640×640 pixels. During testing and inference, we uniformly resize the input images to have a shorter side of 800 pixels and perform predictions using a sliding window approach. The window size is set to 640×640 pixels, with strides of 320 pixels in both the horizontal and vertical directions.
Additionally, we provide a model called docsam_large_all_dataset_keepsize, which uses a different resizing method. This model retains the original sizes of the input images as much as possible, ensuring that the shorter sides are within the range of 640 to 1280 pixels. Other settings are kept the same as docsam_large_all_dataset.
For fine-tuning on the M6Doc, DocLayNet, SCUT-CAB, CTW1500, and Total-Text datasets, we adopt a patch size of 800×800 pixels. This approach aims to prevent regions of interest from being fragmented across different patches.
The original DocLayNet dataset resizes all images to a fixed size of 1025×1025 pixels, which may distort the aspect ratios of the documents. To better preserve the original aspect ratios, we have resized the images to 1025×1449 pixels before sending them to our model.
Model Performance
The performance of the aforementioned models is shown in the following tables. For more details about the datasets listed in these tables, please refer to the original paper.
Please note that the pretrained models we provide here, such as docsam_base_all_dataset, docsam_large_all_dataset, and docsam_large_all_dataset_keepsize, are trained with more iterations (360k). As a result, their performance is actually higher than what is reported in our original paper.
Additionally, the dataset-specific models, including docsam_large_m6doc, docsam_large_doclaynet, docsam_large_scut_cab, docsam_large_ctw1500, and docsam_large_totaltext, are fine-tuned from the above docsam_large_all_dataset model. Consequently, their performance (DocSAM*) is also higher than what was reported in the original paper (DocSAM).
docsam_base_all_dataset
| Task | Dataset | Instance | Semantic | Dataset | Instance | Semantic | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AP50 | AP75 | mAP | mAPb | mAF | mIoU | AP50 | AP75 | mAP | mAPb | mAF | mIoU | |||
| DLA | BaDLAD | 0.686 | 0.480 | 0.464 | 0.478 | 0.545 | 0.686 | CDLA | 0.935 | 0.871 | 0.774 | 0.772 | 0.776 | 0.832 |
| D4LA | 0.654 | 0.585 | 0.515 | 0.509 | 0.552 | 0.486 | DocBank | 0.638 | 0.478 | 0.442 | 0.425 | 0.519 | 0.638 | |
| DocLayNet | 0.753 | 0.595 | 0.541 | 0.529 | 0.590 | 0.650 | ICDAR2017-POD | 0.892 | 0.846 | 0.803 | 0.788 | 0.796 | 0.920 | |
| IIIT-AR-13K | 0.828 | 0.647 | 0.599 | 0.602 | 0.655 | 0.631 | M6Doc | 0.570 | 0.462 | 0.413 | 0.407 | 0.438 | 0.328 | |
| PubLayNet | 0.951 | 0.900 | 0.848 | 0.840 | 0.884 | 0.918 | RanLayNet | 0.948 | 0.897 | 0.845 | 0.835 | 0.872 | 0.905 | |
| AHDS | CASIA-AHCDB-style1 | 0.961 | 0.936 | 0.874 | 0.853 | 0.899 | 0.948 | CASIA-AHCDB-style2 | 0.940 | 0.922 | 0.837 | 0.825 | 0.880 | 0.924 |
| CHDAC-2022 | 0.872 | 0.730 | 0.613 | 0.560 | 0.632 | 0.908 | ICDAR2019-HDRC | 0.951 | 0.820 | 0.767 | 0.722 | 0.822 | 0.912 | |
| SCUT-CAB-physical | 0.944 | 0.873 | 0.793 | 0.775 | 0.827 | 0.939 | SCUT-CAB-logical | 0.773 | 0.647 | 0.562 | 0.549 | 0.561 | 0.488 | |
| MTHv2 | 0.945 | 0.857 | 0.718 | 0.687 | 0.736 | 0.916 | HJDataset | 0.966 | 0.932 | 0.898 | 0.888 | 0.904 | 0.822 | |
| CASIA-HWDB | 0.971 | 0.901 | 0.845 | 0.772 | 0.876 | 0.951 | SCUT-HCCDoc | 0.845 | 0.657 | 0.548 | 0.552 | 0.597 | 0.854 | |
| TSR | FinTabNet | 0.879 | 0.802 | 0.712 | 0.697 | 0.797 | 0.867 | PubTabNet | 0.981 | 0.903 | 0.829 | 0.821 | 0.869 | 0.927 |
| ICDAR2013 | 0.948 | 0.525 | 0.602 | 0.525 | 0.550 | 0.846 | ICDAR2017-POD | 0.942 | 0.847 | 0.766 | 0.739 | 0.796 | 0.895 | |
| cTDaR-modern | 0.916 | 0.561 | 0.636 | 0.594 | 0.692 | 0.879 | cTDaR-archival | 0.925 | 0.770 | 0.713 | 0.659 | 0.733 | 0.959 | |
| NTable-cam | 0.851 | 0.749 | 0.673 | 0.680 | 0.718 | 0.865 | NTable-gen | 0.946 | 0.914 | 0.852 | 0.849 | 0.896 | 0.945 | |
| PubTables-1M-TD | 0.982 | 0.953 | 0.910 | 0.867 | 0.945 | 0.968 | PubTables-1M-TSR | 0.902 | 0.821 | 0.752 | 0.734 | 0.794 | 0.836 | |
| TableBank-latex | 0.968 | 0.950 | 0.923 | 0.906 | 0.939 | 0.952 | TableBank-word | 0.874 | 0.841 | 0.822 | 0.823 | 0.852 | 0.857 | |
| TNCR | 0.704 | 0.650 | 0.629 | 0.613 | 0.658 | 0.555 | STDW | 0.960 | 0.933 | 0.913 | 0.887 | 0.925 | 0.974 | |
| WTW | 0.948 | 0.905 | 0.803 | 0.786 | 0.812 | 0.974 | ||||||||
| STD | CASIA-10k | 0.638 | 0.411 | 0.383 | 0.380 | 0.413 | 0.799 | COCO-Text | 0.519 | 0.248 | 0.264 | 0.270 | 0.292 | 0.619 |
| CTW1500 | 0.809 | 0.558 | 0.494 | 0.492 | 0.565 | 0.831 | CTW-Public | 0.369 | 0.124 | 0.167 | 0.139 | 0.199 | 0.540 | |
| HUST-TR400 | 0.811 | 0.680 | 0.600 | 0.605 | 0.643 | 0.859 | ICDAR2015 | 0.657 | 0.316 | 0.337 | 0.350 | 0.364 | 0.635 | |
| ICDAR2017-RCTW | 0.585 | 0.295 | 0.304 | 0.312 | 0.359 | 0.803 | ICDAR2017-MLT | 0.669 | 0.468 | 0.418 | 0.415 | 0.457 | 0.835 | |
| ICDAR2019-ArT | 0.733 | 0.471 | 0.428 | 0.447 | 0.466 | 0.802 | ICDAR2019-LSVT | 0.597 | 0.372 | 0.351 | 0.352 | 0.397 | 0.814 | |
| ICDAR2019-MLT | 0.702 | 0.498 | 0.445 | 0.443 | 0.480 | 0.846 | ICDAR2019-ReCTS | 0.739 | 0.553 | 0.490 | 0.485 | 0.523 | 0.845 | |
| ICDAR2023-HierText | 0.594 | 0.333 | 0.327 | 0.316 | 0.359 | 0.671 | ICDAR2023-ReST | 0.944 | 0.900 | 0.771 | 0.850 | 0.486 | 0.833 | |
| ICPR2018-MTWI | 0.644 | 0.394 | 0.379 | 0.383 | 0.424 | 0.838 | MSRA-TD500 | 0.825 | 0.628 | 0.542 | 0.581 | 0.558 | 0.767 | |
| ShopSign | 0.629 | 0.279 | 0.311 | 0.322 | 0.365 | 0.817 | Total-Text | 0.773 | 0.470 | 0.433 | 0.451 | 0.474 | 0.781 | |
| USTB-SV1K | 0.826 | 0.379 | 0.425 | 0.412 | 0.467 | 0.714 | ||||||||
docsam_large_all_dataset
| Task | Dataset | Instance | Semantic | Dataset | Instance | Semantic | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AP50 | AP75 | mAP | mAPb | mAF | mIoU | AP50 | AP75 | mAP | mAPb | mAF | mIoU | |||
| DLA | BaDLAD | 0.713 | 0.504 | 0.488 | 0.497 | 0.562 | 0.694 | CDLA | 0.955 | 0.910 | 0.818 | 0.820 | 0.822 | 0.885 |
| D4LA | 0.722 | 0.662 | 0.581 | 0.576 | 0.600 | 0.560 | DocBank | 0.692 | 0.553 | 0.512 | 0.501 | 0.585 | 0.702 | |
| DocLayNet | 0.826 | 0.697 | 0.628 | 0.621 | 0.675 | 0.725 | ICDAR2017-POD | 0.916 | 0.874 | 0.841 | 0.830 | 0.824 | 0.931 | |
| IIIT-AR-13K | 0.887 | 0.754 | 0.674 | 0.680 | 0.716 | 0.776 | M6Doc | 0.682 | 0.590 | 0.519 | 0.512 | 0.538 | 0.467 | |
| PubLayNet | 0.965 | 0.921 | 0.876 | 0.877 | 0.900 | 0.931 | RanLayNet | 0.950 | 0.925 | 0.873 | 0.869 | 0.863 | 0.882 | |
| AHDS | CASIA-AHCDB-style1 | 0.977 | 0.965 | 0.914 | 0.899 | 0.934 | 0.955 | CASIA-AHCDB-style2 | 0.952 | 0.938 | 0.865 | 0.846 | 0.900 | 0.930 |
| CHDAC-2022 | 0.917 | 0.812 | 0.673 | 0.637 | 0.697 | 0.916 | ICDAR2019-HDRC | 0.955 | 0.830 | 0.777 | 0.728 | 0.829 | 0.919 | |
| SCUT-CAB-physical | 0.943 | 0.888 | 0.812 | 0.792 | 0.849 | 0.947 | SCUT-CAB-logical | 0.786 | 0.675 | 0.583 | 0.567 | 0.595 | 0.548 | |
| MTHv2 | 0.956 | 0.878 | 0.731 | 0.706 | 0.756 | 0.919 | HJDataset | 0.986 | 0.945 | 0.916 | 0.908 | 0.918 | 0.823 | |
| CASIA-HWDB | 0.976 | 0.934 | 0.879 | 0.798 | 0.900 | 0.956 | SCUT-HCCDoc | 0.877 | 0.684 | 0.574 | 0.581 | 0.640 | 0.860 | |
| TSR | FinTabNet | 0.887 | 0.818 | 0.734 | 0.716 | 0.817 | 0.877 | PubTabNet | 0.992 | 0.917 | 0.839 | 0.833 | 0.878 | 0.932 |
| ICDAR2013 | 0.919 | 0.532 | 0.598 | 0.517 | 0.543 | 0.844 | ICDAR2017-POD | 0.944 | 0.855 | 0.774 | 0.750 | 0.797 | 0.899 | |
| cTDaR-modern | 0.932 | 0.593 | 0.662 | 0.617 | 0.713 | 0.886 | cTDaR-archival | 0.922 | 0.793 | 0.734 | 0.679 | 0.756 | 0.965 | |
| NTable-cam | 0.906 | 0.838 | 0.757 | 0.772 | 0.784 | 0.892 | NTable-gen | 0.961 | 0.933 | 0.905 | 0.906 | 0.941 | 0.964 | |
| PubTables-1M-TD | 0.985 | 0.935 | 0.896 | 0.879 | 0.939 | 0.977 | PubTables-1M-TSR | 0.922 | 0.856 | 0.793 | 0.781 | 0.825 | 0.859 | |
| TableBank-latex | 0.975 | 0.965 | 0.939 | 0.938 | 0.953 | 0.959 | TableBank-word | 0.893 | 0.859 | 0.858 | 0.856 | 0.868 | 0.875 | |
| TNCR | 0.719 | 0.677 | 0.656 | 0.649 | 0.693 | 0.642 | STDW | 0.969 | 0.945 | 0.930 | 0.921 | 0.942 | 0.972 | |
| WTW | 0.960 | 0.927 | 0.824 | 0.810 | 0.838 | 0.978 | ||||||||
| STD | CASIA-10k | 0.679 | 0.453 | 0.416 | 0.415 | 0.444 | 0.819 | COCO-Text | 0.577 | 0.282 | 0.297 | 0.306 | 0.324 | 0.655 |
| CTW1500 | 0.809 | 0.581 | 0.503 | 0.505 | 0.586 | 0.843 | CTW-Public | 0.484 | 0.161 | 0.210 | 0.179 | 0.230 | 0.615 | |
| HUST-TR400 | 0.861 | 0.769 | 0.654 | 0.661 | 0.701 | 0.863 | ICDAR2015 | 0.710 | 0.356 | 0.370 | 0.385 | 0.400 | 0.658 | |
| ICDAR2017-RCTW | 0.624 | 0.315 | 0.328 | 0.339 | 0.383 | 0.811 | ICDAR2017-MLT | 0.705 | 0.506 | 0.451 | 0.450 | 0.492 | 0.857 | |
| ICDAR2019-ArT | 0.774 | 0.509 | 0.462 | 0.484 | 0.507 | 0.814 | ICDAR2019-LSVT | 0.641 | 0.409 | 0.381 | 0.386 | 0.430 | 0.827 | |
| ICDAR2019-MLT | 0.746 | 0.547 | 0.485 | 0.484 | 0.523 | 0.867 | ICDAR2019-ReCTS | 0.780 | 0.602 | 0.530 | 0.527 | 0.565 | 0.862 | |
| ICDAR2023-HierText | 0.630 | 0.365 | 0.354 | 0.342 | 0.390 | 0.694 | ICDAR2023-ReST | 0.956 | 0.930 | 0.831 | 0.906 | 0.727 | 0.860 | |
| ICPR2018-MTWI | 0.682 | 0.429 | 0.410 | 0.415 | 0.453 | 0.853 | MSRA-TD500 | 0.857 | 0.654 | 0.564 | 0.510 | 0.601 | 0.774 | |
| ShopSign | 0.660 | 0.321 | 0.342 | 0.357 | 0.411 | 0.832 | Total-Text | 0.788 | 0.509 | 0.456 | 0.475 | 0.508 | 0.798 | |
| USTB-SV1K | 0.838 | 0.416 | 0.438 | 0.426 | 0.493 | 0.722 | ||||||||
docsam_large_all_dataset_keepsize
| Task | Dataset | Instance | Semantic | Dataset | Instance | Semantic | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AP50 | AP75 | mAP | mAPb | mAF | mIoU | AP50 | AP75 | mAP | mAPb | mAF | mIoU | |||
| DLA | BaDLAD | 0.696 | 0.494 | 0.477 | 0.487 | 0.547 | 0.700 | CDLA | 0.939 | 0.890 | 0.800 | 0.803 | 0.807 | 0.866 |
| D4LA | 0.655 | 0.590 | 0.522 | 0.518 | 0.575 | 0.469 | DocBank | 0.704 | 0.534 | 0.498 | 0.489 | 0.573 | 0.687 | |
| DocLayNet | 0.748 | 0.581 | 0.530 | 0.528 | 0.605 | 0.613 | ICDAR2017-POD | 0.886 | 0.850 | 0.817 | 0.800 | 0.820 | 0.916 | |
| IIIT-AR-13K | 0.851 | 0.698 | 0.635 | 0.648 | 0.693 | 0.730 | M6Doc | 0.667 | 0.564 | 0.503 | 0.498 | 0.529 | 0.425 | |
| PubLayNet | 0.971 | 0.922 | 0.889 | 0.887 | 0.904 | 0.935 | RanLayNet | 0.936 | 0.895 | 0.847 | 0.843 | 0.841 | 0.879 | |
| AHDS | CASIA-AHCDB-style1 | 0.968 | 0.955 | 0.926 | 0.915 | 0.943 | 0.956 | CASIA-AHCDB-style2 | 0.959 | 0.950 | 0.871 | 0.863 | 0.903 | 0.927 |
| CHDAC-2022 | 0.881 | 0.752 | 0.631 | 0.599 | 0.649 | 0.910 | ICDAR2019-HDRC | 0.952 | 0.807 | 0.769 | 0.732 | 0.814 | 0.895 | |
| SCUT-CAB-physical | 0.925 | 0.862 | 0.791 | 0.773 | 0.850 | 0.929 | SCUT-CAB-logical | 0.737 | 0.576 | 0.517 | 0.508 | 0.559 | 0.523 | |
| MTHv2 | 0.934 | 0.839 | 0.695 | 0.674 | 0.741 | 0.913 | HJDataset | 0.941 | 0.891 | 0.859 | 0.853 | 0.825 | 0.813 | |
| CASIA-HWDB | 0.963 | 0.935 | 0.880 | 0.810 | 0.900 | 0.955 | SCUT-HCCDoc | 0.849 | 0.647 | 0.548 | 0.557 | 0.620 | 0.855 | |
| TSR | FinTabNet | 0.886 | 0.819 | 0.749 | 0.724 | 0.825 | 0.878 | PubTabNet | 0.992 | 0.913 | 0.840 | 0.831 | 0.879 | 0.934 |
| ICDAR2013 | 0.908 | 0.526 | 0.606 | 0.522 | 0.542 | 0.827 | ICDAR2017-POD | 0.935 | 0.861 | 0.774 | 0.752 | 0.794 | 0.896 | |
| cTDaR-modern | 0.880 | 0.591 | 0.646 | 0.613 | 0.648 | 0.825 | cTDaR-archival | 0.949 | 0.837 | 0.762 | 0.721 | 0.777 | 0.957 | |
| NTable-cam | 0.906 | 0.831 | 0.752 | 0.764 | 0.779 | 0.874 | NTable-gen | 0.951 | 0.930 | 0.899 | 0.900 | 0.940 | 0.944 | |
| PubTables-1M-TD | 0.988 | 0.956 | 0.943 | 0.912 | 0.951 | 0.955 | PubTables-1M-TSR | 0.936 | 0.892 | 0.833 | 0.826 | 0.856 | 0.873 | |
| TableBank-latex | 0.981 | 0.973 | 0.948 | 0.944 | 0.953 | 0.959 | TableBank-word | 0.922 | 0.890 | 0.893 | 0.891 | 0.868 | 0.872 | |
| TNCR | 0.717 | 0.676 | 0.664 | 0.662 | 0.753 | 0.592 | STDW | 0.948 | 0.932 | 0.921 | 0.915 | 0.935 | 0.968 | |
| WTW | 0.959 | 0.925 | 0.818 | 0.810 | 0.834 | 0.978 | ||||||||
| STD | CASIA-10k | 0.679 | 0.457 | 0.419 | 0.423 | 0.443 | 0.820 | COCO-Text | 0.542 | 0.260 | 0.276 | 0.281 | 0.316 | 0.643 |
| CTW1500 | 0.848 | 0.638 | 0.544 | 0.550 | 0.610 | 0.846 | CTW-Public | 0.660 | 0.312 | 0.335 | 0.314 | 0.335 | 0.645 | |
| HUST-TR400 | 0.845 | 0.789 | 0.651 | 0.666 | 0.710 | 0.878 | ICDAR2015 | 0.700 | 0.338 | 0.362 | 0.375 | 0.401 | 0.643 | |
| ICDAR2017-RCTW | 0.638 | 0.311 | 0.331 | 0.346 | 0.389 | 0.810 | ICDAR2017-MLT | 0.714 | 0.514 | 0.457 | 0.460 | 0.491 | 0.840 | |
| ICDAR2019-ArT | 0.767 | 0.517 | 0.461 | 0.482 | 0.504 | 0.818 | ICDAR2019-LSVT | 0.635 | 0.404 | 0.378 | 0.383 | 0.422 | 0.827 | |
| ICDAR2019-MLT | 0.751 | 0.557 | 0.488 | 0.495 | 0.519 | 0.854 | ICDAR2019-ReCTS | 0.789 | 0.603 | 0.533 | 0.533 | 0.572 | 0.862 | |
| ICDAR2023-HierText | 0.651 | 0.387 | 0.371 | 0.370 | 0.396 | 0.683 | ICDAR2023-ReST | 0.970 | 0.917 | 0.843 | 0.913 | 0.746 | 0.873 | |
| ICPR2018-MTWI | 0.683 | 0.434 | 0.413 | 0.418 | 0.456 | 0.853 | MSRA-TD500 | 0.770 | 0.599 | 0.509 | 0.553 | 0.533 | 0.748 | |
| ShopSign | 0.604 | 0.254 | 0.292 | 0.301 | 0.357 | 0.820 | Total-Text | 0.796 | 0.502 | 0.462 | 0.486 | 0.504 | 0.807 | |
| USTB-SV1K | 0.840 | 0.390 | 0.430 | 0.417 | 0.484 | 0.719 | ||||||||
docsam_large_m6doc
| Method | Object | Instance | ||||
|---|---|---|---|---|---|---|
| mAP | AP50 | AP75 | mAP | AP50 | AP75 | |
| Faster R-CNN | 0.490 | 0.678 | 0.572 | 0.478 | 0.678 | 0.552 |
| Mask R-CNN | 0.401 | 0.584 | 0.462 | 0.397 | 0.584 | 0.456 |
| Deformable DETR | 0.572 | 0.768 | 0.634 | 0.556 | 0.765 | 0.611 |
| ISTR | 0.627 | 0.808 | 0.708 | 0.620 | 0.807 | 0.702 |
| TransDLANet | 0.645 | 0.827 | 0.727 | 0.638 | 0.826 | 0.719 |
| DAT | 0.712 | -- | -- | 0.657 | -- | -- |
| DocSAM | 0.663 | 0.840 | 0.755 | 0.661 | 0.840 | 0.750 |
| DocSAM* | 0.664 | 0.833 | 0.754 | 0.667 | 0.835 | 0.757 |
docsam_large_doclaynet
| Method | Caption | Footnote | Formula | List-item | Page-footer | Page-header | Picture | Section-header | Table | Text | Title | mAP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Human | 0.890 | 0.910 | 0.850 | 0.880 | 0.940 | 0.890 | 0.710 | 0.840 | 0.810 | 0.860 | 0.720 | 0.830 |
| Faster R-CNN | 0.701 | 0.737 | 0.635 | 0.810 | 0.589 | 0.720 | 0.720 | 0.684 | 0.822 | 0.854 | 0.799 | 0.734 |
| Mask R-CNN | 0.715 | 0.718 | 0.634 | 0.808 | 0.593 | 0.700 | 0.727 | 0.693 | 0.829 | 0.858 | 0.804 | 0.735 |
| YOLOv5 | 0.777 | 0.772 | 0.662 | 0.862 | 0.611 | 0.679 | 0.771 | 0.746 | 0.863 | 0.881 | 0.827 | 0.768 |
| Cascade Mask R-CNN | 0.732 | 0.753 | 0.669 | 0.839 | 0.617 | 0.713 | 0.750 | 0.701 | 0.859 | 0.871 | 0.815 | 0.756 |
| TransDLANet | 0.682 | 0.747 | 0.616 | 0.810 | 0.548 | 0.682 | 0.685 | 0.698 | 0.824 | 0.838 | 0.818 | 0.723 |
| SwinDocSegmenter | 0.836 | 0.648 | 0.623 | 0.823 | 0.651 | 0.664 | 0.847 | 0.665 | 0.874 | 0.882 | 0.633 | 0.769 |
| DINO | 0.718 | 0.788 | 0.727 | 0.856 | 0.630 | 0.766 | 0.741 | 0.721 | 0.873 | 0.876 | 0.851 | 0.777 |
| VSR | 0.726 | 0.721 | 0.738 | 0.862 | 0.818 | 0.813 | 0.631 | 0.825 | 0.794 | 0.884 | 0.807 | 0.784 |
| M2Doc (Cascade Mask R-CNN) | 0.860 | 0.836 | 0.871 | 0.928 | 0.867 | 0.856 | 0.763 | 0.891 | 0.864 | 0.927 | 0.878 | 0.867 |
| M2Doc (DINO) | 0.853 | 0.867 | 0.898 | 0.936 | 0.903 | 0.910 | 0.784 | 0.907 | 0.874 | 0.939 | 0.913 | 0.890 |
| DocSAM | 0.740 | 0.703 | 0.646 | 0.822 | 0.569 | 0.668 | 0.730 | 0.702 | 0.843 | 0.853 | 0.739 | 0.729 |
docsam_large_scut_cab
| Method | Object | Instance | ||||
|---|---|---|---|---|---|---|
| mAP | AP50 | AP75 | mAP | AP50 | AP75 | |
| Physical | ||||||
| Faster R-CNN | 0.775 | 0.913 | 0.861 | 0.753 | 0.910 | 0.834 |
| Mask R-CNN | 0.791 | 0.921 | 0.877 | 0.795 | 0.917 | 0.872 |
| SCNet | 0.813 | 0.941 | 0.890 | 0.820 | 0.941 | 0.891 |
| Deformable DETR | 0.799 | 0.923 | 0.871 | 0.779 | 0.921 | 0.843 |
| VSR | 0.787 | 0.919 | 0.860 | 0.787 | 0.919 | 0.852 |
| DocSAM | 0.774 | 0.947 | 0.860 | 0.811 | 0.948 | 0.891 |
| DocSAM* | 0.829 | 0.944 | 0.902 | 0.837 | 0.944 | 0.908 |
| Logical | ||||||
| Faster R-CNN | 0.549 | 0.774 | 0.613 | 0.542 | 0.773 | 0.606 |
| Mask R-CNN | 0.551 | 0.785 | 0.619 | 0.553 | 0.777 | 0.631 |
| SCNet | 0.602 | 0.836 | 0.673 | 0.603 | 0.836 | 0.680 |
| Deformable DETR | 0.627 | 0.852 | 0.717 | 0.620 | 0.851 | 0.703 |
| VSR | 0.557 | 0.783 | 0.616 | 0.551 | 0.782 | 0.611 |
| DocSAM | 0.548 | 0.769 | 0.632 | 0.575 | 0.779 | 0.667 |
| DocSAM* | 0.588 | 0.780 | 0.660 | 0.592 | 0.781 | 0.671 |
docsam_large_ctw1500 and docsam_large_totaltext
| Method | CTW1500 | Total-Text | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| HierText | 0.846 | 0.874 | 0.860 | 0.855 | 0.905 | 0.879 |
| SIR | 0.874 | 0.837 | 0.855 | 0.909 | 0.856 | 0.882 |
| DPText-DETR | 0.917 | 0.862 | 0.888 | 0.918 | 0.864 | 0.890 |
| UNITS | -- | -- | -- | -- | -- | 0.898 |
| ESTextSpotter | 0.915 | 0.886 | 0.900 | 0.920 | 0.881 | 0.900 |
| DAT-DET | 0.893 | 0.893 | 0.893 | 0.940 | 0.882 | 0.910 |
| DAT-SEG | 0.925 | 0.909 | 0.917 | 0.950 | 0.892 | 0.920 |
| DocSAM | 0.805 | 0.881 | 0.842 | 0.721 | 0.826 | 0.770 |
| DocSAM* | 0.839 | 0.880 | 0.859 | 0.778 | 0.868 | 0.820 |
Analyis and Findings
Due to GPU memory limitations, DocSAM relies on cropping during training, testing, and inference. Although we have added a whole-image thumbnail to recover large objects that may be fragmented across patches and designed a score re-weighting strategy to carefully merge segmentation results from different patches, it may still be inferior to directly processing entire images.
Moreover, because of the use of cropping, different resizing strategies may significantly influence performance on different datasets. This can be observed in the results of docsam_large_all_dataset and docsam_large_all_dataset_keepsize. Compared to docsam_large_all_dataset, which resizes all images to a fixed shorter side of 800 pixels, docsam_large_all_dataset_keepsize attempts to maintain the native resolution of documents while ensuring the shorter side falls within the range of 640 to 1280 pixels.
This approach may be beneficial for high-resolution documents containing many small objects, as well as for low-resolution documents with large objects. However, in other scenarios—such as high-resolution documents containing large objects—it may be harmful because it could exacerbate the fragmentation problem.
Moreover, the current version of DocSAM is purely single-modal. Although it performs comparably or even superiorly to previous single-modal methods, it still performs significantly worse than multi-modal models, especially on tasks like logical layout analysis, which require distinguishing between fine-grained text classes.
In future work, we plan to optimize the model's memory usage and reduce computational costs, design more reasonable resizing and cropping strategies, and extend DocSAM to a multi-modal version, thereby further improving its performance and efficiency.