Abstract

This paper describes a system which generates variations on theme music fitting to story scenes represented by texts and/or pictures. Inputs to the present system are original theme music and numerical information on given story scenes. The present system varies melodies, tempos, tones, tonalities, and accompaniments of given theme music based on impressions of story scenes. Genetic algorithms (GAs) using modular neural network (MNN) models as fitness functions are applied to music generation in order to reflect user's feeling of music and stories. The present system adjusts MNN models for each user on line. This paper also describes the evaluation experiments to confirm whether the generated variations on theme music reflect impressions of story scenes appropriately or not.

1. Introduction

Music, pictures, and/or text information are combined into multimedia content with interaction among them [1]. The effectiveness of multimodal communication using combined different modal media has been analyzed in the field of cognitive psychology [1]. It is expected that multimodal communication will be performed in everyday life in the future owing to the development of information technology [2]. However, the interaction among different modal media is not necessarily generated by their simple and random combination. Features and impressions of media should be considered well in order to create effective multimedia contents. Therefore, creation of multimodal contents costs more time and labor than that of single-modal one. Support systems for creation of multimodal contents or for the flexible combination of different modal media are taken interest in [3, 4].

The authors are studying on the construction of a system which generates variations on theme music fitting to each story scene represented by texts and/or pictures [5]. This system varies melodies, tempos, tones, tonalities, and accompaniments of a given theme music based on impressions of story scenes. This system has two sections representing (a) relations between story scenes and musical images and (b) relations between features of variations and musical impressions. Since human feeling of stories and music is different among people [6] and the difference is important in multimedia content creation, it is necessary to consider the above relations depending on each user. Although in [5] these relations are obtained by questionnaire data, that is, off line, in the present paper a method, which adjusts the relations for each user on line, is proposed. In this paper, the transformation of theme music is defined as follows. Tunes, tones, musical performances, rhythms, tempos are varied according to story scenes [7].

2. Outline of Present System

2.1. Inputs and Outputs

Inputs to the present system are original theme music and numerical information on given story scenes. Outputs are MIDI files of variations on original theme music generated according to each story scene. This paper deals with generation of variations on theme music fitting to stories obtained by the system [8] that generates story-like linguistic expressions given four pictures. In this paper, a scene is defined as each picture for story generation. Information on a picture scene, for example, character’s emotion, kinds of characters, impressions of character’s behavior in a story scene (e.g., violent behavior), picture sequence, as shown in Table 1, is acquired from each picture [8]. These are inputs to the present system.

2.2. System Structure

The present system consists of two sections, a musical image acquisition (MIA) section and a theme music transformation (TMT) section as shown in Figure 1. The MIA section converts information on story scenes shown in Table 1 into transformation image parameters (TIPs) by modular neural network (MNN) models [9]. The TMT section transforms inputted original theme music based on values of TIPs, and generates a set of midiformatted candidates of variations on theme music for each story scene. The TMT section applies genetic algorithms (GAs) to the generation of variations candidates, which has MNN models as fitness functions. MNN models consist of three neural network models, an average model network (AMN), an individual variation model network (IVMN), and gating networks. AMN is a hierarchical neural network model expressing user’s average feeling of music and stories. IVMN is a radial basis function network model expressing differences among users’ feeling of music and stories. The gating network switches over between AMN and IVMN. The present system adjusts IVMNs and the gating networks for each user.

3. Musical Image Acquisition (MIA) Section

The MIA section is constructed by MNN models. The inputs to MNN models are shown in Table 1. MNN models estimate the values of TIPs representing musical image for transformation of original theme music. In this paper, TIPs consist of some pairs of adjectives that are selected referring to a study that retrieves many genres musical works with pairs of adjectives representing musical image [10]. These are happy-sad, heavy-light, hard-soft, stable-unstable, clear-muddy, calm-violent, smooth-rough, thick-thin. Preexperiments are performed in order to confirm which pairs of adjectives are necessary for TIPs. The procedures of the pre-experiments are as follows. (1) Fixing musical instruments, tempos, tonalities, tones, chords in a melody part and accompaniment parts patterns at random, 125 variations are generated. (2) Some subjects, who have no experience to play some musical instruments over 3 years, listen to the variations and express impressions on them with 8 pairs of adjectives. (3) If the subjects feel that it is difficult to evaluate the difference among the variations with some pairs of adjectives, they give the pairs. The results of the pre-experiments show that it is difficult to evaluate the difference among the variations using adjectives hard-soft, stable-unstable, smooth-rough, or thick-thin. Then, in this paper these four pairs of adjectives are not used. That is, four pairs of adjectives, which are parameters on degree of change from original theme music shown in Table 2, are used. Each parameter value is a real number in [0.0, 1.0].

The MIA section estimates the values of TIPs from information on a picture scene. In generation of variations on theme music fitting to story scenes, information on story scenes necessary for the estimation of the values of TIPs is dependent on media representing a story, for example, pictures, texts or animations or the contents of a story, for example, a serious story, a story for children, and is not determined uniquely. Therefore, it is necessary to consider the selection of information on picture scenes for the estimation of the values of TIPs. However, since in this paper, input to the present system is limited to information on pictures scenes, the paper does not discuss this point. In the future it is necessary to change information according to media representing a story or the form of a story.

4. Theme Music Transformation (TMT) Section

4.1. Procedure on Generation of Variations [5]

Inputs to the TMT section are original theme music and values of TIPs obtained by the MIA section, and outputs are MIDI files of variations on theme music. MIDI files consist of the melody part and six accompaniment parts. The accompaniment parts consist of an obbligati part, a backing parts 1 and 2, a bass part, a pad part, and a drums part. The TMT section modifies impressions of inputted original theme music varying the following components of MIDI files [5]: (1) scores of melody parts, (2) tempos, (3) tonalities, (4) accompaniment patterns of accompaniment parts, and (5) tones.

4.2. Structure of TMT Section

The TMT section transforms given original theme music according to inputted TIPs and outputted sets of MIDI-formatted candidates of variations on given theme music as shown in Figure 2. GAs are applied to the transformation of a given theme music fitting to TIPs, where a variation generated from a given theme music is represented by a chromosome in the framework of GAs. In this paper, GAs parameters are abbreviated as follows.

(1): Population size(2): Maximum number of generations(3): The number of individuals generated randomly(4): Partial population size presented to user(5): Crossover probability(6): Mutation probability.

Procedures in the TMT section are as follows.

(1) variations are generated from inputted theme music in the form of chromosomes.(2) Fitness values of chromosomes are calculated according to the inputted values of TIPs and melodies of original theme music.(3) GAs operations of crossover and mutations are performed. Next generation population is generated. Go back to step (2).

4.2.1. Structure of Chromosome

Variations consist of three kinds of chromosomes such as Melody Chromosome, Accompaniment Chromosome, and Status Chromosome.

The melody chromosome has melody part score information. Melody part score information is represented by the format shown in Figure 3. A given original theme music is represented as an initial chromosome. The accompaniment chromosome has accompaniment part information. The playing pattern number and the performance type of the obbligati part in the accompaniment part are represented by chromosomes, where each information is represented with 1 byte as shown in Figure 3. Initial chromosomes have random values for information. The status chromosome has information on a tempo, a tonality, and a tone. Tempo, tonality, melody part tone, and obbligato part tone are represented by a chromosome as shown in Figure 3. Tempo (60–200 [BPM]), tonality (a major scale or minor one), and tone are also represented with 1 byte. Initial chromosomes have random values for information.

4.2.2. Calculation of Fitness Value [5]

Fitness values of chromosomes are calculated according to the inputted values of TIPs and melodies of original theme music [5]. Let be the chromosome number, that is, the variation number, and Fitnessi represents the fitness value of the th variation. Fitnessi is defined as where is the fitness value of score information in the melody part of the th variation referring to [11], and is the fitness value of impressions on the th variation [5]. Impression values of variations are estimated by MNN models. These impression values are degrees of four pairs of adjectives used in TIPs estimation. MNN models are obtained by the relation between feature spaces of variations and impression values.

Smaller the value of is, the better the th variation is. Procedures of calculation of fitness values are shown in Figure 4.

4.2.3. GA Operations

individuals of parent candidates are selected by the tournament selection according to the fitness values obtained in 4.2.2. Crossovers at probability of and mutations at are applied to parent candidates. individuals are generated at random. Crossover and mutation are performed as follows.Crossover
uniform crossover is applied to melody chromosomes obtained by the generative theory of total music grouping structure analysis [12] in every group.
Mutation
random values are assigned to the accompaniment chromosome and the status chromosome. Varying score information on the melody part described in [5] is applied to melody chromosomes.

5. MNN Structure

The present system uses MNN models to represent (1) relations between story scenes and values of TIPs in the MIA section, and (2) relations between features of variations and musical impressions in the TMT section. MNN models in the present system consist of AMN, IVMN, and the gating network as shown in Figure 5. When the present system adjusts its MNN models for each user, IVMN and the gating network are obtained by learning of user’s data of individual variation of feeling of music and stories.

AMN is a hierarchical neural network model which consists of sigmoid neurons. AMN is constructed using questionnaire data of subject’s feelings for music and stories. The questionnaire data are obtained referring to [6].

IVMN is a hierarchical neural network model which consists of RBF neurons. RBF is a function responding to input values in a local area. Therefore, an RBF network is easy to be adjusted online and fast. When a user is not satisfied with outputs of MNN, learning data of IVMN are generated and saved in the present system. Input values of learning data are input values of MNN. Output values of learning data are evaluation values by each user.

The gating network is an RBF network switching over between AMN and IVMN. The gating network judges whether input values of MNN are close to the area learned by IVMN or not. When a user is not satisfied with outputs of MNN, learning data of IVMN are generated and saved in the present system. Learning data of the gating network are input values of MNN. Output values of learning data of the IVMN are evaluation values by users.

IVMN and the gating network are constructed by the method proposed in [13] using all data saved in the present system.

Outputs of MNN models are defined as where is an output value of the gating network is an output value of the IVMN is an output value of the AMN, and is a threshold of switching AMN and IVMN.

6. Experiments

Experiments are performed to evaluate the present system by 8 undergraduate/graduate students. In the experiments, GA parameters are set at the following values: , , , , , . In the experiments, the threshold of switching AMN and IVMN by a gating network is set at 0.75. Musical works are chosen at random from prepared seventeen MIDI files of classical tunes or folk tunes, and are used as theme music of stories.

6.1. Construction of IVMN and Gating Network

IVMN and the gating network for each subject are constructed in the following procedures.

(1) Story scenes and theme music are inputted to the present system. The present system generates variations according to each story scene and outputs them.(2) When a subject is satisfied with one of outputted variations, go to (8). When a subject is not satisfied with any variations, go to (3).(3) A subject looks at the values of TIPs estimated by the present system. The values of TIPs are presented to a subject in the form of Figure 6.(4) The present system adjusts MNN models according to two cases as shown in Figure 7. That is, a subject feels (a) presented musical image is not suitable for story scenes or (b) generated variations are different from presented musical images.(a)When a subject feels that presented musical image is not suitable for story scenes, a subject evaluates whether the values of TIPs fit to story scenes by the interface shown in Figure 6. Variations are generated according to values of TIPs evaluated by a subject. Go to (5).(b)When a subject feels generated variations are different from presented images, a subject chooses one of variations. A subject evaluates his/her impressions using the 7-point scale method shown in Figure 8, where evaluation items are pairs of the same adjectives as the ones used as TIPs estimation. In this procedure the human interface shown in Figure 8 is used. Variations are generated by using modified MNN models. Go to (6).(5) A subject listens to variations. When a subject is satisfied with one of outputted variations, go to (7). When a subject is not satisfied with any variations, go to (3).(6) A subject listens to variations. When a subject is satisfied with one of outputted variations, go to (8). When a subject is not satisfied with any variations, go to (3).(7) MNN models in MIA section are adjusted by the relation between information on story scenes and values of TIPs evaluated by a subject.(8) Go to an evaluation of the next scene.

6.2. Experiment 1

Three story scenes are inputted into the present system and variations are generated, where the story scenes are different from the ones used in Section 6.1 and MNN models are adjusted for each subject in Section 6.1. This experiment confirms whether the present system generates variations on theme music reflecting subject’s feeling of music and stories.

Let twelve story scenes be . A subject is asked to read and to evaluate musical images fitting using the 7-point scale method (e.g., (7) very calm through (1) very violent), where evaluation items are 4 pairs of the same adjectives as the ones used in TIPs estimation. Let the evaluation values of by a subject be and let twelve variations generated from by the present system be . A subject is asked to evaluate impressions of by 7-point scale method, where evaluation items are four pairs of the same adjectives as the ones used in TIPs estimation, and are presented to a subject at random. Let impressions of evaluated by a subject be , where and , and , and , and and are evaluation values of “violent-calm,” “heavy-light,” “clear-muddy,” and “sad-happy,” respectively. These variables have integer values in [1, 7] evaluated by a subject. In this experiment, cosine correlations [14] between and are used for the evaluation whether the generated variations are reflecting subject’s feelings for music and stories or not. Cosine correlation is defined as when is close to 1.0, generated variations are reflecting users feelings for music and story well.

6.3. Result 1

are shown in Table 3. It is found that 80% of the whole of is 0.9 or more, and the present system is able to generate variations reflecting subject’s feelings to music and stories.

6.4. Experiment 2

Other three story scenes are inputted into the present system and variations are generated, where MNN models in the present system are adjusted for each subject in Section 6.1. A subject is asked to evaluate with 7-point scale method whether variations on theme music fit impressions of each presented story scene or not; (7) very suitable (6) suitable (5) a little suitable (4) neutral (3) little suitable (2) not suitable (1) not suitable at all. Furthermore, to confirm the effectiveness of IVMN, variations generated by the present system and the ones generated by an average system are compared with each other, where the average system is the system whose IVMN in MNN models are average among subjects.

6.5. Result 2

Experimental results are shown in Table 4. It is found that the present system gets evaluation (5) or (6) or (7) in approximately 87.5% of all evaluation results.

Distributions of evaluation values of variations generated by the present system and those of variations generated by the average system are shown in Figure 9. It is found that variations generated by the present system are evaluated higher in the large percentage than variations generated by the average system. An average and a decentralization of evaluation values are also shown in Figure 9. The decentralization of evaluation values of variations generated by the present system is lower than that of variations generated by the average system. It is found that constructing IVMN and the gating network for each user are effective.

Figure 10 shows examples of the variations on theme music by subjects A and B, where original theme music is Twinkle Stars as shown in Figure 11. From these figures it is found that although the same scenes are given to the subjects, various theme tunes are transformed by the present system.

Variations on theme music generated by the present system are dependent on subjects’ impressions on story expressed by pictures. Therefore, even if the same pictures are given, generated variations are different among subjects. Nevertheless, subjects themselves are satisfied with generated variations. Then it is found that the present system generates variations on theme music fitting to subjects’ impressions on story well. However, subjects’ impressions on story usually change according to time and environment in which subjects are. The present system does not deal with the variations depending on these factors, time, environment, and so forth. This is a future work.

7. Conclusions

This paper presents the system which transforms a theme music fitting to story scenes represented by texts and/or pictures, and generates variations on the theme music. The present system varies (1) melodies, (2) tempos, (3) tones, (4) tonalities, and (5) accompaniments of a given theme music based on impressions of story scenes using neural network models and GAs. Differences of human’s feeling of music/stories are important in multimedia content creation. This paper proposes the method that adjusts the models in the present system for each user. The results of the experiments show that the system transforms a theme music reflecting user’s impressions of story scenes.