Complexity

Research Article

Features to Text: A Comprehensive Survey of Deep Learning on Semantic Segmentation and Image Captioning

Table 2

Class pixel label distribution in the CamVid dataset.


Dataset	Method	B-1	B-2	B-3	B-4	M	C

MS COCO	LSTM-A-2 [179]	0.734	0.567	0.430	0.326	0.254	1.00
	Att-Reg [180]	0.740	0.560	0.420	0.310	0.260	—
	Attend-tell [156]	0.707	0.492	0.344	0.243	0.239	—
	SGC [181]	67.1	48.8	34.3	23.9	21.8	73.3
	phi-LSTM [182]	66.6	48.9	35.5	25.8	23.1	82.1
	COMIC [183]	70.6	53.4	39.5	29.2	23.7	88.1
	TBVA [184]	69.5	52.1	38.6	28.7	24.1	91.9
	SCN [185]	0.741	0.578	0.444	0.341	0.261	1.041
	CLGRU [186]	0.720	0.550	0.410	0.300	0.240	0.960
	A-Penalty [187]	72.1	55.1	41.5	31.4	24.7	95.6
	VD-SAN [188]	73.4	56.6	42.8	32.2	25.4	99.9
	ATT-CNN [189]	73.9	57.1	43.3	33	26	101.6
	RTAN [190]	73.5	56.9	43.3	32.9	25.4	103.3
	Adaptive [191]	0.742	0.580	0.439	0.332	0.266	1.085
	Full-SL [192]	0.713	0.539	0.403	0.304	0.251	0.937

Flickr30K	hLSTMat [193]	73.8	55.1	40.3	29.4	23	66.6
	SGC [181]	61.5	42.1	28.6	19.3	18.2	39.9
	RA + SF [194]	0.649	0.462	0.324	0.224	0.194	0.472
	gLSTM [195]	0.646	0.446	0.305	0.206	0.179	—
	Multi-Mod [196]	0.600	0.380	0.254	0.171	0.169	—
	TBVA [184]	66.6	48.4	34.6	24.7	20.2	52.4
	Attend-tell [156]	0.669	0.439	0.296	0.199	0.185	—
	ATT-FCN [158]	0.647	0.460	0.324	0.230	0.189	—
	VQA [197]	0.730	0.550	0.400	0.280	—	—
	Align-Mod [144]	0.573	0.369	0.240	0.157	—	—
	m-RNN [198]	0.600	0.410	0.280	0.190	—	—
	LRCN [112]	0.587	0.391	0.251	0.165	—	—
	NIC [141]	0.670	0.450	0.300	—	—	—
	RTAN [190]	67.1	48.7	34.9	23.9	20.1	53.3
	3-gated [199]	69.4	45.7	33.2	22.6	23	—
	VD-SAN [188]	65.2	47.1	33.6	23.9	19.9	—
	ATT-CNN [189]	66.1	47.2	33.4	23.2	19.4	—