Table 1: Summary statistics from representative real-world collections that we used as templates for our synthetic data sets.

CollectionNo. docs. Avg. Doc. Len. Avg. Uniq. Terms

Aquaint 1,033,461 437 169
USPTO 1,406,200 1718 353
EPO 989,507 3863 705