Research Article
Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization
Table 3
Experimental results with text data augmentation on Flickr8k.
| Model | Feature (s) | Image count | Image retrieval | R@1 | R@5 | R@10 | Med r | R@1 | R@5 | R@10 | Med r |
| DeViSE [13] | Word2Vec | 4.8 | 16.5 | 27.3 | 28 | 5.9 | 20.1 | 29.6 | 29 | DeFrag [14] | R–CNN | 12.6 | 32.9 | 44.0 | 14 | 9.7 | 29.6 | 42.5 | 15 | VSA [16] | R–CNN + BRNN | 16.5 | 40.6 | 54.2 | 7.6 | 11.8 | 32.1 | 44.7 | 12.4 | UVSE [18] | ConvNet + LSTM | 13.5 | 36.2 | 45.7 | 13 | 10.4 | 31 | 43.7 | 14 | UVSE (VGG) [19] | VGG + LSTM | 18.0 | 40.9 | 55.0 | 8 | 12.5 | 37.0 | 51.5 | 10 | VSE++ [21] | VGG + GRU + HNM | 16.3 | 37.7 | 52.5 | 9 | 12 | 33.3 | 48.1 | 11 | Ours | Aug | 20.9 | 44.1 | 58.8 | 7 | 14.5 | 39 | 51.2 | 10 | Ours | Aug + Word2Vec | 21.5 | 49.1 | 62.3 | 6 | 15.1 | 38.9 | 53.1 | 9 |
|
|