Abstract

Some rules on gene recognition and ORF organization in the Saccharomyces cerevisiae genome are demonstrated by statistical analyses of sequence data. This study includes: (a) The random frame rule—that the six reading frames W1, W2, W3, C1, C2 and C3 in the double-stranded genome are randomly occupied by ORFs (related phenomena on ORF overlapping are also discussed). (b) The inhomogeneity rule—coding and non-coding ORFs differ in inhomogeneity of base composition in the three codon positions. By use of the inhomogeneity index (IHI), one can make a distinction between coding (IHI > 14) and non-coding (IHI ≤ 14) ORFs at 95% accuracy. We find that ‘spurious’ ORFs (with IHI ≤ 14) are distributed mainly in three classes of ORFs, namely, those with ‘similarity to unknown proteins’, those with ‘no similarity’, or ‘questionable ORFs’. The total number of spurious ORFs (which are unlikely to be regarded as coding ORFs) is estimated to be 470. (c) The evaluation of ORF length distribution shows that below 200 amino acids the occurrence of ATG initiator ORFs is close to random.