Model Zoo of DocSAM

April 8, 2025 · View on GitHub

Model Zoo

As an all-in-one model, DocSAM was pre-trained on nearly 50 datasets. We provide pre-trained weights for two model variants: DocSAM-Base and DocSAM-Large, with their visual backbones based on Swin-Base and Swin-Large, respectively. Additionally, we offer several specialized models fine-tuned on specific datasets. Information and download links for these pre-trained models are listed in the table below.

Name	Backbone	#Parameter	BatchSize	#Iterations	PatchSize	Download
docsam_base_all_dataset	swin-base	208M	32	360k	640,640	model
docsam_large_all_dataset	swin-large	317M	32	360k	640,640	model
docsam_large_all_dataset_keepsize	swin-base	208M	32	360k	640,640	model
docsam_large_m6doc	swin-large	317M	16	40k	800,800	model
docsam_large_doclaynet	swin-large	317M	16	40k	800,800	model
docsam_large_scut_cab	swin-large	317M	16	40k	800,800	model
docsam_large_ctw1500	swin-large	317M	16	40k	800,800	model
docsam_large_totaltext	swin-large	317M	16	40k	800,800	model

Notice

By default, during training, we randomly resize the input images so that their shorter sides range between 704 and 896 pixels. After resizing, we randomly crop the images into patches of size 640×640 pixels. During testing and inference, we uniformly resize the input images to have a shorter side of 800 pixels and perform predictions using a sliding window approach. The window size is set to 640×640 pixels, with strides of 320 pixels in both the horizontal and vertical directions.

Additionally, we provide a model called docsam_large_all_dataset_keepsize, which uses a different resizing method. This model retains the original sizes of the input images as much as possible, ensuring that the shorter sides are within the range of 640 to 1280 pixels. Other settings are kept the same as docsam_large_all_dataset.

For fine-tuning on the M⁶Doc, DocLayNet, SCUT-CAB, CTW1500, and Total-Text datasets, we adopt a patch size of 800×800 pixels. This approach aims to prevent regions of interest from being fragmented across different patches.

The original DocLayNet dataset resizes all images to a fixed size of 1025×1025 pixels, which may distort the aspect ratios of the documents. To better preserve the original aspect ratios, we have resized the images to 1025×1449 pixels before sending them to our model.

Model Performance

The performance of the aforementioned models is shown in the following tables. For more details about the datasets listed in these tables, please refer to the original paper.

Please note that the pretrained models we provide here, such as docsam_base_all_dataset, docsam_large_all_dataset, and docsam_large_all_dataset_keepsize, are trained with more iterations (360k). As a result, their performance is actually higher than what is reported in our original paper.

Additionally, the dataset-specific models, including docsam_large_m6doc, docsam_large_doclaynet, docsam_large_scut_cab, docsam_large_ctw1500, and docsam_large_totaltext, are fine-tuned from the above docsam_large_all_dataset model. Consequently, their performance (DocSAM^*) is also higher than what was reported in the original paper (DocSAM).

docsam_base_all_dataset

Task	Dataset	Instance					Semantic	Dataset	Instance					Semantic
Task	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU
DLA	BaDLAD	0.686	0.480	0.464	0.478	0.545	0.686	CDLA	0.935	0.871	0.774	0.772	0.776	0.832
	D⁴LA	0.654	0.585	0.515	0.509	0.552	0.486	DocBank	0.638	0.478	0.442	0.425	0.519	0.638
	DocLayNet	0.753	0.595	0.541	0.529	0.590	0.650	ICDAR2017-POD	0.892	0.846	0.803	0.788	0.796	0.920
	IIIT-AR-13K	0.828	0.647	0.599	0.602	0.655	0.631	M⁶Doc	0.570	0.462	0.413	0.407	0.438	0.328
	PubLayNet	0.951	0.900	0.848	0.840	0.884	0.918	RanLayNet	0.948	0.897	0.845	0.835	0.872	0.905
AHDS	CASIA-AHCDB-style1	0.961	0.936	0.874	0.853	0.899	0.948	CASIA-AHCDB-style2	0.940	0.922	0.837	0.825	0.880	0.924
	CHDAC-2022	0.872	0.730	0.613	0.560	0.632	0.908	ICDAR2019-HDRC	0.951	0.820	0.767	0.722	0.822	0.912
	SCUT-CAB-physical	0.944	0.873	0.793	0.775	0.827	0.939	SCUT-CAB-logical	0.773	0.647	0.562	0.549	0.561	0.488
	MTHv2	0.945	0.857	0.718	0.687	0.736	0.916	HJDataset	0.966	0.932	0.898	0.888	0.904	0.822
	CASIA-HWDB	0.971	0.901	0.845	0.772	0.876	0.951	SCUT-HCCDoc	0.845	0.657	0.548	0.552	0.597	0.854
TSR	FinTabNet	0.879	0.802	0.712	0.697	0.797	0.867	PubTabNet	0.981	0.903	0.829	0.821	0.869	0.927
	ICDAR2013	0.948	0.525	0.602	0.525	0.550	0.846	ICDAR2017-POD	0.942	0.847	0.766	0.739	0.796	0.895
	cTDaR-modern	0.916	0.561	0.636	0.594	0.692	0.879	cTDaR-archival	0.925	0.770	0.713	0.659	0.733	0.959
	NTable-cam	0.851	0.749	0.673	0.680	0.718	0.865	NTable-gen	0.946	0.914	0.852	0.849	0.896	0.945
	PubTables-1M-TD	0.982	0.953	0.910	0.867	0.945	0.968	PubTables-1M-TSR	0.902	0.821	0.752	0.734	0.794	0.836
	TableBank-latex	0.968	0.950	0.923	0.906	0.939	0.952	TableBank-word	0.874	0.841	0.822	0.823	0.852	0.857
	TNCR	0.704	0.650	0.629	0.613	0.658	0.555	STDW	0.960	0.933	0.913	0.887	0.925	0.974
	WTW	0.948	0.905	0.803	0.786	0.812	0.974
STD	CASIA-10k	0.638	0.411	0.383	0.380	0.413	0.799	COCO-Text	0.519	0.248	0.264	0.270	0.292	0.619
	CTW1500	0.809	0.558	0.494	0.492	0.565	0.831	CTW-Public	0.369	0.124	0.167	0.139	0.199	0.540
	HUST-TR400	0.811	0.680	0.600	0.605	0.643	0.859	ICDAR2015	0.657	0.316	0.337	0.350	0.364	0.635
	ICDAR2017-RCTW	0.585	0.295	0.304	0.312	0.359	0.803	ICDAR2017-MLT	0.669	0.468	0.418	0.415	0.457	0.835
	ICDAR2019-ArT	0.733	0.471	0.428	0.447	0.466	0.802	ICDAR2019-LSVT	0.597	0.372	0.351	0.352	0.397	0.814
	ICDAR2019-MLT	0.702	0.498	0.445	0.443	0.480	0.846	ICDAR2019-ReCTS	0.739	0.553	0.490	0.485	0.523	0.845
	ICDAR2023-HierText	0.594	0.333	0.327	0.316	0.359	0.671	ICDAR2023-ReST	0.944	0.900	0.771	0.850	0.486	0.833
	ICPR2018-MTWI	0.644	0.394	0.379	0.383	0.424	0.838	MSRA-TD500	0.825	0.628	0.542	0.581	0.558	0.767
	ShopSign	0.629	0.279	0.311	0.322	0.365	0.817	Total-Text	0.773	0.470	0.433	0.451	0.474	0.781
	USTB-SV1K	0.826	0.379	0.425	0.412	0.467	0.714

docsam_large_all_dataset

Task	Dataset	Instance					Semantic	Dataset	Instance					Semantic
Task	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU
DLA	BaDLAD	0.713	0.504	0.488	0.497	0.562	0.694	CDLA	0.955	0.910	0.818	0.820	0.822	0.885
	D⁴LA	0.722	0.662	0.581	0.576	0.600	0.560	DocBank	0.692	0.553	0.512	0.501	0.585	0.702
	DocLayNet	0.826	0.697	0.628	0.621	0.675	0.725	ICDAR2017-POD	0.916	0.874	0.841	0.830	0.824	0.931
	IIIT-AR-13K	0.887	0.754	0.674	0.680	0.716	0.776	M⁶Doc	0.682	0.590	0.519	0.512	0.538	0.467
	PubLayNet	0.965	0.921	0.876	0.877	0.900	0.931	RanLayNet	0.950	0.925	0.873	0.869	0.863	0.882
AHDS	CASIA-AHCDB-style1	0.977	0.965	0.914	0.899	0.934	0.955	CASIA-AHCDB-style2	0.952	0.938	0.865	0.846	0.900	0.930
	CHDAC-2022	0.917	0.812	0.673	0.637	0.697	0.916	ICDAR2019-HDRC	0.955	0.830	0.777	0.728	0.829	0.919
	SCUT-CAB-physical	0.943	0.888	0.812	0.792	0.849	0.947	SCUT-CAB-logical	0.786	0.675	0.583	0.567	0.595	0.548
	MTHv2	0.956	0.878	0.731	0.706	0.756	0.919	HJDataset	0.986	0.945	0.916	0.908	0.918	0.823
	CASIA-HWDB	0.976	0.934	0.879	0.798	0.900	0.956	SCUT-HCCDoc	0.877	0.684	0.574	0.581	0.640	0.860
TSR	FinTabNet	0.887	0.818	0.734	0.716	0.817	0.877	PubTabNet	0.992	0.917	0.839	0.833	0.878	0.932
	ICDAR2013	0.919	0.532	0.598	0.517	0.543	0.844	ICDAR2017-POD	0.944	0.855	0.774	0.750	0.797	0.899
	cTDaR-modern	0.932	0.593	0.662	0.617	0.713	0.886	cTDaR-archival	0.922	0.793	0.734	0.679	0.756	0.965
	NTable-cam	0.906	0.838	0.757	0.772	0.784	0.892	NTable-gen	0.961	0.933	0.905	0.906	0.941	0.964
	PubTables-1M-TD	0.985	0.935	0.896	0.879	0.939	0.977	PubTables-1M-TSR	0.922	0.856	0.793	0.781	0.825	0.859
	TableBank-latex	0.975	0.965	0.939	0.938	0.953	0.959	TableBank-word	0.893	0.859	0.858	0.856	0.868	0.875
	TNCR	0.719	0.677	0.656	0.649	0.693	0.642	STDW	0.969	0.945	0.930	0.921	0.942	0.972
	WTW	0.960	0.927	0.824	0.810	0.838	0.978
STD	CASIA-10k	0.679	0.453	0.416	0.415	0.444	0.819	COCO-Text	0.577	0.282	0.297	0.306	0.324	0.655
	CTW1500	0.809	0.581	0.503	0.505	0.586	0.843	CTW-Public	0.484	0.161	0.210	0.179	0.230	0.615
	HUST-TR400	0.861	0.769	0.654	0.661	0.701	0.863	ICDAR2015	0.710	0.356	0.370	0.385	0.400	0.658
	ICDAR2017-RCTW	0.624	0.315	0.328	0.339	0.383	0.811	ICDAR2017-MLT	0.705	0.506	0.451	0.450	0.492	0.857
	ICDAR2019-ArT	0.774	0.509	0.462	0.484	0.507	0.814	ICDAR2019-LSVT	0.641	0.409	0.381	0.386	0.430	0.827
	ICDAR2019-MLT	0.746	0.547	0.485	0.484	0.523	0.867	ICDAR2019-ReCTS	0.780	0.602	0.530	0.527	0.565	0.862
	ICDAR2023-HierText	0.630	0.365	0.354	0.342	0.390	0.694	ICDAR2023-ReST	0.956	0.930	0.831	0.906	0.727	0.860
	ICPR2018-MTWI	0.682	0.429	0.410	0.415	0.453	0.853	MSRA-TD500	0.857	0.654	0.564	0.510	0.601	0.774
	ShopSign	0.660	0.321	0.342	0.357	0.411	0.832	Total-Text	0.788	0.509	0.456	0.475	0.508	0.798
	USTB-SV1K	0.838	0.416	0.438	0.426	0.493	0.722

docsam_large_all_dataset_keepsize

Task	Dataset	Instance					Semantic	Dataset	Instance					Semantic
Task	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU	Dataset	AP50	AP75	mAP	mAP_b	mAF	mIoU
DLA	BaDLAD	0.696	0.494	0.477	0.487	0.547	0.700	CDLA	0.939	0.890	0.800	0.803	0.807	0.866
	D⁴LA	0.655	0.590	0.522	0.518	0.575	0.469	DocBank	0.704	0.534	0.498	0.489	0.573	0.687
	DocLayNet	0.748	0.581	0.530	0.528	0.605	0.613	ICDAR2017-POD	0.886	0.850	0.817	0.800	0.820	0.916
	IIIT-AR-13K	0.851	0.698	0.635	0.648	0.693	0.730	M⁶Doc	0.667	0.564	0.503	0.498	0.529	0.425
	PubLayNet	0.971	0.922	0.889	0.887	0.904	0.935	RanLayNet	0.936	0.895	0.847	0.843	0.841	0.879
AHDS	CASIA-AHCDB-style1	0.968	0.955	0.926	0.915	0.943	0.956	CASIA-AHCDB-style2	0.959	0.950	0.871	0.863	0.903	0.927
	CHDAC-2022	0.881	0.752	0.631	0.599	0.649	0.910	ICDAR2019-HDRC	0.952	0.807	0.769	0.732	0.814	0.895
	SCUT-CAB-physical	0.925	0.862	0.791	0.773	0.850	0.929	SCUT-CAB-logical	0.737	0.576	0.517	0.508	0.559	0.523
	MTHv2	0.934	0.839	0.695	0.674	0.741	0.913	HJDataset	0.941	0.891	0.859	0.853	0.825	0.813
	CASIA-HWDB	0.963	0.935	0.880	0.810	0.900	0.955	SCUT-HCCDoc	0.849	0.647	0.548	0.557	0.620	0.855
TSR	FinTabNet	0.886	0.819	0.749	0.724	0.825	0.878	PubTabNet	0.992	0.913	0.840	0.831	0.879	0.934
	ICDAR2013	0.908	0.526	0.606	0.522	0.542	0.827	ICDAR2017-POD	0.935	0.861	0.774	0.752	0.794	0.896
	cTDaR-modern	0.880	0.591	0.646	0.613	0.648	0.825	cTDaR-archival	0.949	0.837	0.762	0.721	0.777	0.957
	NTable-cam	0.906	0.831	0.752	0.764	0.779	0.874	NTable-gen	0.951	0.930	0.899	0.900	0.940	0.944
	PubTables-1M-TD	0.988	0.956	0.943	0.912	0.951	0.955	PubTables-1M-TSR	0.936	0.892	0.833	0.826	0.856	0.873
	TableBank-latex	0.981	0.973	0.948	0.944	0.953	0.959	TableBank-word	0.922	0.890	0.893	0.891	0.868	0.872
	TNCR	0.717	0.676	0.664	0.662	0.753	0.592	STDW	0.948	0.932	0.921	0.915	0.935	0.968
	WTW	0.959	0.925	0.818	0.810	0.834	0.978
STD	CASIA-10k	0.679	0.457	0.419	0.423	0.443	0.820	COCO-Text	0.542	0.260	0.276	0.281	0.316	0.643
	CTW1500	0.848	0.638	0.544	0.550	0.610	0.846	CTW-Public	0.660	0.312	0.335	0.314	0.335	0.645
	HUST-TR400	0.845	0.789	0.651	0.666	0.710	0.878	ICDAR2015	0.700	0.338	0.362	0.375	0.401	0.643
	ICDAR2017-RCTW	0.638	0.311	0.331	0.346	0.389	0.810	ICDAR2017-MLT	0.714	0.514	0.457	0.460	0.491	0.840
	ICDAR2019-ArT	0.767	0.517	0.461	0.482	0.504	0.818	ICDAR2019-LSVT	0.635	0.404	0.378	0.383	0.422	0.827
	ICDAR2019-MLT	0.751	0.557	0.488	0.495	0.519	0.854	ICDAR2019-ReCTS	0.789	0.603	0.533	0.533	0.572	0.862
	ICDAR2023-HierText	0.651	0.387	0.371	0.370	0.396	0.683	ICDAR2023-ReST	0.970	0.917	0.843	0.913	0.746	0.873
	ICPR2018-MTWI	0.683	0.434	0.413	0.418	0.456	0.853	MSRA-TD500	0.770	0.599	0.509	0.553	0.533	0.748
	ShopSign	0.604	0.254	0.292	0.301	0.357	0.820	Total-Text	0.796	0.502	0.462	0.486	0.504	0.807
	USTB-SV1K	0.840	0.390	0.430	0.417	0.484	0.719

docsam_large_m6doc

Method	Object			Instance
Method	mAP	AP50	AP75	mAP	AP50	AP75
Faster R-CNN	0.490	0.678	0.572	0.478	0.678	0.552
Mask R-CNN	0.401	0.584	0.462	0.397	0.584	0.456
Deformable DETR	0.572	0.768	0.634	0.556	0.765	0.611
ISTR	0.627	0.808	0.708	0.620	0.807	0.702
TransDLANet	0.645	0.827	0.727	0.638	0.826	0.719
DAT	0.712	--	--	0.657	--	--
DocSAM	0.663	0.840	0.755	0.661	0.840	0.750
DocSAM^*	0.664	0.833	0.754	0.667	0.835	0.757

docsam_large_doclaynet

Method	Caption	Footnote	Formula	List-item	Page-footer	Page-header	Picture	Section-header	Table	Text	Title	mAP
Human	0.890	0.910	0.850	0.880	0.940	0.890	0.710	0.840	0.810	0.860	0.720	0.830
Faster R-CNN	0.701	0.737	0.635	0.810	0.589	0.720	0.720	0.684	0.822	0.854	0.799	0.734
Mask R-CNN	0.715	0.718	0.634	0.808	0.593	0.700	0.727	0.693	0.829	0.858	0.804	0.735
YOLOv5	0.777	0.772	0.662	0.862	0.611	0.679	0.771	0.746	0.863	0.881	0.827	0.768
Cascade Mask R-CNN	0.732	0.753	0.669	0.839	0.617	0.713	0.750	0.701	0.859	0.871	0.815	0.756
TransDLANet	0.682	0.747	0.616	0.810	0.548	0.682	0.685	0.698	0.824	0.838	0.818	0.723
SwinDocSegmenter	0.836	0.648	0.623	0.823	0.651	0.664	0.847	0.665	0.874	0.882	0.633	0.769
DINO	0.718	0.788	0.727	0.856	0.630	0.766	0.741	0.721	0.873	0.876	0.851	0.777
VSR	0.726	0.721	0.738	0.862	0.818	0.813	0.631	0.825	0.794	0.884	0.807	0.784
M2Doc (Cascade Mask R-CNN)	0.860	0.836	0.871	0.928	0.867	0.856	0.763	0.891	0.864	0.927	0.878	0.867
M2Doc (DINO)	0.853	0.867	0.898	0.936	0.903	0.910	0.784	0.907	0.874	0.939	0.913	0.890
DocSAM	0.740	0.703	0.646	0.822	0.569	0.668	0.730	0.702	0.843	0.853	0.739	0.729

docsam_large_scut_cab

Method	Object			Instance
Method	mAP	AP50	AP75	mAP	AP50	AP75
Physical
Faster R-CNN	0.775	0.913	0.861	0.753	0.910	0.834
Mask R-CNN	0.791	0.921	0.877	0.795	0.917	0.872
SCNet	0.813	0.941	0.890	0.820	0.941	0.891
Deformable DETR	0.799	0.923	0.871	0.779	0.921	0.843
VSR	0.787	0.919	0.860	0.787	0.919	0.852
DocSAM	0.774	0.947	0.860	0.811	0.948	0.891
DocSAM^*	0.829	0.944	0.902	0.837	0.944	0.908
Logical
Faster R-CNN	0.549	0.774	0.613	0.542	0.773	0.606
Mask R-CNN	0.551	0.785	0.619	0.553	0.777	0.631
SCNet	0.602	0.836	0.673	0.603	0.836	0.680
Deformable DETR	0.627	0.852	0.717	0.620	0.851	0.703
VSR	0.557	0.783	0.616	0.551	0.782	0.611
DocSAM	0.548	0.769	0.632	0.575	0.779	0.667
DocSAM^*	0.588	0.780	0.660	0.592	0.781	0.671

docsam_large_ctw1500 and docsam_large_totaltext

Method	CTW1500			Total-Text
Method	P	R	F	P	R	F
HierText	0.846	0.874	0.860	0.855	0.905	0.879
SIR	0.874	0.837	0.855	0.909	0.856	0.882
DPText-DETR	0.917	0.862	0.888	0.918	0.864	0.890
UNITS	--	--	--	--	--	0.898
ESTextSpotter	0.915	0.886	0.900	0.920	0.881	0.900
DAT-DET	0.893	0.893	0.893	0.940	0.882	0.910
DAT-SEG	0.925	0.909	0.917	0.950	0.892	0.920
DocSAM	0.805	0.881	0.842	0.721	0.826	0.770
DocSAM^*	0.839	0.880	0.859	0.778	0.868	0.820

Analyis and Findings

Due to GPU memory limitations, DocSAM relies on cropping during training, testing, and inference. Although we have added a whole-image thumbnail to recover large objects that may be fragmented across patches and designed a score re-weighting strategy to carefully merge segmentation results from different patches, it may still be inferior to directly processing entire images.

Moreover, because of the use of cropping, different resizing strategies may significantly influence performance on different datasets. This can be observed in the results of docsam_large_all_dataset and docsam_large_all_dataset_keepsize. Compared to docsam_large_all_dataset, which resizes all images to a fixed shorter side of 800 pixels, docsam_large_all_dataset_keepsize attempts to maintain the native resolution of documents while ensuring the shorter side falls within the range of 640 to 1280 pixels.

This approach may be beneficial for high-resolution documents containing many small objects, as well as for low-resolution documents with large objects. However, in other scenarios—such as high-resolution documents containing large objects—it may be harmful because it could exacerbate the fragmentation problem.

Moreover, the current version of DocSAM is purely single-modal. Although it performs comparably or even superiorly to previous single-modal methods, it still performs significantly worse than multi-modal models, especially on tasks like logical layout analysis, which require distinguishing between fine-grained text classes.

In future work, we plan to optimize the model's memory usage and reduce computational costs, design more reasonable resizing and cropping strategies, and extend DocSAM to a multi-modal version, thereby further improving its performance and efficiency.