Image synthesis based on natural language description has become a research hotspot in edge computing in artificial intelligence. With the help of generative adversarial edge computing networks, the field has made great strides in high-resolution image synthesis. However, there are still some defects in the authenticity of synthetic single-target images. For example, there will be abnormal situations such as “multiple heads” and “multiple mouths” when synthesizing bird graphics. Aiming at such problems, a text generation single-target model SA-AttnGAN based on a self-attention mechanism is proposed. SA-AttnGAN (Attentional Generative Adversarial Network) refines text features into word features and sentence features to improve the semantic alignment of text and images; in the initialization stage of AttnGAN, the self-attention mechanism is used to improve the stability of the text-generated image model; the multistage GAN network is used to superimpose, finally synthesizing high-resolution images. Experimental data show that SA-AttnGAN outperforms other comparable models in terms of Inception Score and Frechet Inception Distance; synthetic image analysis shows that this model can learn background and colour information and correctly capture bird heads and mouths. The structural information of other components is improved, and the AttnGAN model generates incorrect images such as “multiple heads” and “multiple mouths.” Furthermore, SA-AttnGAN is successfully applied to description-based clothing image synthesis with good generalization ability.

1. Introduction

Image synthesis based on text description (text to image, t2i) covers technologies such as computer vision and natural language processing and is an interdisciplinary and cross-modal comprehensive task [1]. Based on the input natural language description, the model should synthesize images consistent with the description content and have complete semantic information. This task requires the computer to understand the semantic information of the text and convert the semantic information into pixels to generate a high-resolution and high-fidelity image, which is a very challenging task. It has a wide range of application potential and can be used in computer-aided design, criminal investigation portrait generation, etc.

The rapid development of deep learning has brought significant advances in computer vision and natural language processing in theory and technology and promoted the task of text-based image synthesis to move towards high resolution, high authenticity, and high controllability. Ref. [2] used generative adversarial networks (GANs) [3] to extract sentence features of textual descriptions by using a character-level recurrent neural network, along with noise as input to a cGAN network [3]. To reduce the difficulty of high-resolution image synthesis based on GANs, Ref. [4] proposed StackGAN, which consists of two generative adversarial networks: the first stage generates low scores, and the second stage refines low-resolution images and gradually synthesizes high-resolution photos. To improve the quality of synthetic images, Xue et al. proposed StackGAN++ [5]. In addition to using multiple GANs to generate multiscale ideas, they added a regularization setting for colour consistency to the loss, which can keep the images of different scales during training. Thickness, reducing the instability of GANs training. Xu et al. introduced a global attention mechanism [6] in AttnGAN [7]. They proposed a severe attention multimodal similarity model, which used word-level and sentence-level text features as input to improve the matching between text and images.

However, the GAN-INT-CLS, StackGAN, and StackGAN++ are some methods in adversarial network that are employed to achieve high artistic graphics. They only use sentence-level features as text features and lose important synthetic image details in performing the image synthesis techniques. Level features are embedded as text, which improves semantic alignment. In addition, although the AttnGAN network uses a global attention mechanism for text images to increase the details of the generated images, it often generates birds that do not conform to natural laws, such as “two heads” and “two eyes” and other wrong photos. Generating nonsemantic bird images for AttnGAN, a GAN-based t2i network model is proposed, which uses a self-attention mechanism in the initial stage of the model better to learn important spatial and positional information in the image when synthesizing low-resolution ideas and improve the accuracy of image generation in the initial step, thereby improving the correctness of high-resolution image synthesis. When synthesizing bird graphics, there will be unusual scenarios such as “many heads” and “multiple mouths.” A text synthesis single-target model SA-AttnGAN based on a self-attention strategy is established to solve such concerns. In this study, we aim at improving the semantic alignment of text and images on basis of SA-AttnGAN (attentional generative adversarial network) that is proposed to refine text features into word features and sentence features.

The main objective of this article lies in the following two points:(1)Based on the AttnGAN model, this article proposes to add a self-attention module in the initial stage to improve the original model to generate bird pictures that do not conform to the norm and optimize the IS and FID index scores in the CUB [8] dataset. The actual generation effect shows that the SA-AttnGAN network model proposed in this article can generate realistic and natural bird pictures.(2)A text-generated image clothing dataset is also produced, expanding the application field of t2i technology for other researchers and laying a data foundation.

1.1. Organization

This work is organized into various modules where Module 1 discusses the introduction followed by Section 2, which states the related work in this field. Section 3 elucidates the proposed model followed by the “Result and Analysis” section that is stated in Section 4. The last section discusses the conclusion of the work.

Image synthesis for early text descriptions mainly combined retrieval and supervised learning [9]. First, the information and “imageable” text units are determined by the correlation between keywords (or key phrases) and the image; then, based on the current test conditions, the text unit retrieves the regions most likely to be related to the image content and finally optimized to an image layout to associate the textual description with the image content. However, due to limited training methods, this method can only change the characteristics of specific images and cannot synthesize ideas with entirely new content based on textual descriptions. With the deepening of research. Each image is modelled as a combined foreground and background Attribute2Image [10] method. Attribute2Image learns from given attributes to generate ideas containing different characteristics, such as gender, hair colour, and age. Although the above techniques can synthesize relatively realistic images, they are still limited by limited descriptive properties. With the development of multimodal learning, a batch of image synthesis models based on generative adversarial networks and deep convolutional decoders have emerged [11, 12]. Generative adversarial networks (GANs) proposed by Ref. [13] mainly consist of a discriminator and a generator. The generator tries to generate synthetic images and thus “trick” the discriminator; the discriminator tries to distinguish between authentic images and synthetic images. Based on such characteristics, GANs can be used in the field of image synthesis based on the text description, and the purpose of adversarial training is defined as image synthesis based on text description: through the continuous “generation” and “discrimination” of raw images and “fake images,” and the relationship between image content and text description is gradually improved, and finally, the purpose of describing synthetic images based on text is achieved. However, there may be certain negatives to picture-generating technology, such as overcrowding, artificial visual mismatch difficulties, and uniformity or normalization of scanned images, but these concerns are rare.

Tang et al. pioneered deep convolution-based GANs (DC-GANs) [14] for text-image synthesis [15]. DC-GANs use a character-level recurrent neural network to extract sentence feature vectors from text descriptions and use them with noise as input to GAN [16]. StackGAN [17] focuses on improving synthetic images’ quality and increasing the resolution of synthetic images from 64 × 64 to 256 × 256 through two GANs based on word feature vectors. As a further expansion, StackGAN++ [18] improved StackGAN into an end-to-end network, which reduced the instability of GANs training and increased the colour loss function and improved the colour expression of the synthesized image. It is a multistage generative adversarial network, which is based on edge computing features and mimics the model for limiting the latency incurred while processing the image synthesis of data in multigenerative network and works on a real-time processing model. Edge computing necessitates greater storage space. Because of the large amount of datasets for images synthesis process, it also poses a significant security risk and necessitates sophisticated infrastructure, which limits the usage of employing this model. Given the successful application of attention mechanisms in various fields of deep learning, AttnGAN [19] first introduced a global attention mechanism into the area of text synthesis images. It uses a text encoder to extract text feature vectors at the sentence and word levels, calculate their similarity to global image features and local image features, and improve the correlation between synthetic images and description texts through the proposed DAMSM pretraining method. Clarity. With the deepening of research, image synthesis based on text description has achieved unprecedented high resolution, multiobjective and controllability: HD-GAN [20] uses cascaded network results to increase the resolution to 512 × 512. For photo-realistic image creation through semantic layouts, the author has suggested a novel Edge-assisted generative adversarial network [21]. We provide us with an edge-preserving MRI image reconstruction technique on intermittent multiscale feature engineering and a (EP IMF-GAN)-generative adversarial network [22]. Despite improvements, the resolution of image sequences seems far from optimal due to largely unaddressed obstacles. Obj-GAN [23] can synthesize multitarget images with a complex layout, gradually generate from design, shape to content, and improve the model collapse problem in complex image synthesis. To solve the problem of overall composition reset caused by the change of specific text attributes (colour, target), ControlGAN [24], based on AttnGAN structure, proposed to use a channel and spatial attention mechanism to increase word-image region feature matching and perceptual loss and other constraints.

3. Network Model (SA-AttnGAN)

Like AttnGAN, the SA-AttnGAN network structure is proposed in this article is divided into pretraining and a multistage generative adversarial network. The DAMSM module [25] in AttnGAN is introduced in the pretraining network. This module contains a text encoder and an image encoder to extract features and calculate the DAMSM loss as part of the generator loss function. At the same time, the generative adversarial network consists of three pairs of generators and discriminators are composed to process images of 64 × 64, 128 × 128, and 256 × 256 stages, respectively.

Like most GAN-based t2i models, the self-attention-based text generative image network (SA-AttnGAN) proposed in this article adopts a multistage high-resolution image synthesis strategy (Figure 1). The generators G0, G1, and G2 synthesize images with resolutions of 64 × 64, 128 × 128, and 256 × 256, respectively.

In the G0 stage, the conditional enhancement module Fca is used first; the CA [26] module in Figure 1 is used to process the sentence feature vector eˉ to obtain a low-dimensional text conditional vector. It is then concatenated with the noise vector as an input containing multiple upsampling blocks F0, as shown in the following equation:where represents the hidden node, which contains the image information generated in the initial stage.

Unlike AttnGAN, this article introduces a self-attention mechanism [27], which assigns additional weight information through autonomous learning between image feature maps. As a result, the final feature map contains more spatial and positional information.

As shown in Figure 2, first transform into feature spaces f and , where Wf and are perception layers, as shown in the following equations:

And calculate the weight information βj,i and the calculation formula is shown in the following formula:where . βj,i represents the weight information of the i-th position when synthesizing the j-th area of the image. It learns the space and position information in the feature map through the self-supervision mechanism and assigns a greater weight value to the important detailed information in the image. It is beneficial to generate more meaningful images in the initial stage. Then, convert to the third feature space u, as shown in the following equation:where Wu is the perception layer of the feature space u, which is used to change the dimension size of the feature. Then, multiply the weight map βj,i and to get the image feature matrix mj with attention mechanism, as shown in the following equation:

Finally, use conv_1 × 1 to convert the obtained image feature matrix mj to the feature space , as shown in the following equation, so that the received image feature size is the same as the input image feature size:

Using h0 to represent the output result of Fsa (i.e., the SA module in Figure 1), by using the attention mechanism, the generated image in the initial stage will contain more meaningful position and spatial information, as shown in the following equation:

In the G1 and G2 stages, the hidden nodes hi of different stages are used as input, and i is used to represent the generated images of different resolutions, as shown in the following equations:where is the i-th stage global attention generation module [28] and Fi is the i-th stage segments containing neural network layers such as upsampling blocks.

Among them, the generator loss function is defined as inwhere is the loss function derived using a pretrained network, and λ is a hyper parameter that determines how much the DAMSM module affects the generator loss function [29].

As shown in Figure 1, the D0, D1, D2 multidiscriminators used in this article are calculated in parallel, and the input image sizes are 64 × 64, 128 × 128, and 256 × 256, respectively. The discriminator Di consists of two parts , where i = 0,1,2, each part contains different discriminative content, discriminates the authenticity of the image, and discriminates the semantic consistency between the image and the text. The definition is shown in the following formula:where is used to identify whether the input image is real and is used to identify whether the input image is related to text. The calculation formulas are shown in the following equations:

4. Experiment and Result Analysis

This article chooses AttnGAN as a comparison model. AttnGAN uses an attention mechanism in text generation images and uses sentence-level and word-level text features as input to improve the clarity of synthesized images.

A Bi-LSTM [30] with a layer number of 1 is adopted in the text encoder, the word embedding size is 300, and the dimension of text features is 256. The inception-v3 [31] network is used in the image encoder to extract image features. The global image feature dimension is 2 048, the local image region feature contains 768 channels, and each channel dimension is 289, similar to the AttnGAN [32] network. The parameters of each track are the same. Adam [33] is used as the optimizer in the training phase, and the learning rate is set to 0.0002. In the network loss function, the hyperparameter λ is set to 5. Batch_size is set to 10.

4.1. Dataset

In this article, the CUB dataset is selected for training the model. CUB is a public dataset in the t2i field produced by the University of Cambridge [34], which contains more than 10,000 pictures of more than 200 species of birds. Among them, 8855 photos are used for training, and 2933 are used for testing. Ten text descriptions accompany each image. Its report covers more than ten bird’s heads, beak, breast, and crest attributes.

4.2. Evaluation Parameter

To ensure the comparability of experimental results, this article selects Inception Score [35] (IS) and Frechet Inception Distance (FID) [36] for comparison. IS indicator is especially proposed by StackGAN for CUB with a complete set of evaluation algorithms (https://github.com/hanzhanggit/StackGAN inception-model) and is widely used in other t2i works. The greater the IS index, the finer the produced image, the greater the variety, and the improved the model reliability. Another frequently utilized assessment index is FID. It computes the real samples and creates the higher dimensional space difference across them. Based on the AttnGAN model, this article proposes to add a self-attention module in the initial stage to improve the original model to generate bird pictures that do not conform to the norm and optimize the IS and FID index scores in the CUB dataset [37, 38].

The algorithm principle is as follows:where x represents the generated sample and y represents the label predicted by the algorithm, and by calculating the Kullback–Leibler divergence of p(y|x) and p(y) distribution, the larger the value, the better the model generation result. The higher the IS index, the clearer the generated image, the higher the diversity, and the better the model stability. FID is another commonly used evaluation index. It calculates the actual samples and generates the distance between them in the feature space. The algorithm principle is as follows:where μr represents the mean of the actual image features, represents the mean of the generated image features, Σx represents the covariance matrix of the primary image features, represents the covariance matrix of the developed image features, and tr represents the trace of the matrix. The lower the FID value, the better the quality and diversity of the image.

4.3. Result Analysis

The SA-AttnGAN model proposed in this article is trained for 600 epochs on an RTX TITAN V graphics card with 11 GB of video memory. About 30 000 test set photos are generated for index comparison. The results are shown in Table 1 and Figures 35. As shown in Table 1, compared with many other representative methods, the model in this article has the highest IS index value, achieving a score of 4.52 ± 0.03, FID.

The indicator was the lowest, with a score of 14.25. Compared with AttnGAN, the IS index is improved by 0.16, and the FID index is reduced by 0.13.

Figure 4 shows the IS index changes for 600 epochs. The abscissa is the number of epoch iterations, and the ordinate is the IS index value. After 450 generations, the index value of our method is better than that of AttnGAN.

Figure 6 shows the change in the FID index for 600 epochs. The abscissa is the number of epoch iterations, and the ordinate is the FID index value. After 380 generations, the FID value of the method in this article is better than that of AttnGAN.

The above chart shows that this article uses the self-attention mechanism in the initial stage to generate the weight mask map by learning the feature information between images independently. The feature map finally developed in the initial stage integrates more space and position information in the model to generate structure. It can further improve the effect of high-resolution image synthesis and enhance the clarity and diversity of image synthesis. In addition, sentence-level and word-level text features are used simultaneously to extract more textual information and improve the semantic consistency between text and images.

As shown in Table 2, the influence of the hyperparameter λ on the two metrics with different values is also calculated. λ is the DAMSM network module on the overall network. After the values are 0.1, 1, 5, and 10, when λ = 5, the index effect is the best.

4.4. Comparison of Synthetic Effects

Figure 5 compares the experimental results between SA-AttnGAN and many usual methods. Among them, the HDGAN [10], StackGAN++ [6], and AttnGAN [7] ways use the officially implemented model to conduct experiments and test 2 933 test set texts in the same experimental environment. The HDGAN model is inspired by StackGAN [5], proposes an end-to-end model, and introduces adversarial hierarchical adversarial objectives, focusing on improving the resolution of image generation. Still, it does not pay attention to the structural information of the generated images. As a result, some of the attributes of HDGAN are caused unnaturally. For example, in the third group of experiments, the proportion of bird eyes generated by HDGAN is inconsistent. The StackGAN++ [6] model is based on the StackGAN [5] model, changing it to an end-to-end model, adding colour regularization loss, focusing on improving the colour consistency of the multistage generated method cannot also create images. As a result, the learning of spatial and location information, such as the third group of experiments, the rendered birds are integrated with the background, and the overall generation fails. AttnGAN [7] uses an attention mechanism and uses sentence-level and word-level text features to enhance the semantic alignment of text images. Still, the weights of essential attributes of birds cannot be learned well, and they pay too much attention to some details. This is because, such as the tenth group of experiments, two mouths were generated, while the third group of experiments lacked detailed information such as mouths. The SA-AttnGAN method adds a self-attention mechanism module in the initial stage so that the model can learn the correct attribute weight distribution. For example, in the third group of experiments, the generated birds are complete and natural, indicating that this method improves the text generation of single-target images—visual quality.

Figure 6 shows the ablation experiment adding the self-attention mechanism module. SA-AttnGAN indicates that the self-attention mechanism is used, and AttnGAN means that the self-attention tool is not used. Figure 6 is divided into four groups, each with six groups of comparative experiments. The first three groups show that both SA-AttnGAN and AttnGAN methods synthesize realistic and natural bird pictures. For example, the renderings of the first sentence of the third group of text synthesis conform to text semantics, including details such as “brown bird” and “white belly.” Information. The fourth set of experiments shows partially generated images. For example, the second sentence of the fourth group of text AttnGAN synthesized a “multiheaded” bird, the second sentence AttnGAN failed to create birds, and SA-AttnGAN synthesized the correct bird photos. The experimental analysis will focus on these two parts later. Illustrate. In addition, the fourth group of experiments also showed that the SA-AttnGAN and AttnGAN methods failed to generate some images, such as the third, fourth, fifth, and sixth sentences of the fourth group of experiments. The analysis is that “Large bird” text descriptions such as “Large wings” will synthesize photos of flying birds. Still, since there are few photos of soaring postures in the dataset, the model does not thoroughly learn the distribution of such images, ultimately affecting image generation results.

4.4.1. Analysis of Inappropriate Images such as “Long Heads” and “Multiple Mouths”

Figure 7 shows some well-generated synthetic photos of birds, but the AttnGAN method will also synthesize inappropriate images during the test. Figure 8 shows six sets of AttnGAN models and high-resolution images generated by the model in this article. It can be found that AttnGAN often causes bird pictures that are not normal, such as “multiple heads,” “multiple mouths,” and “multiple eyes.” For example, a bird head, 7-2-2, 7-6-2 generate two beaks, and 7-5-2 causes multiple eyes. The method in this article uses a self-attention mechanism in the initial stage so that the model can not only By learning pixel information such as background and other colours; the model can also capture the structural information of the target, correctly generate the position and number of bird heads, beaks, and bird eyes, and improve the synthesis of bird images that AttnGAN does not match with text features.

4.4.2. Image Analysis of Bird’s Overall Generation Failure

The self-attention mechanism can learn important spatial and positional information, improve errors such as “multiple heads” and “multiple mouths,” improve the stability of the t2i model, and generate more realistic bird pictures. As shown in Figures 9 and 10, the pictures caused by AttnGAN cannot be seen as birds, and the generated birds do not match the real birds. Compared with the shape of the class, the model proposed in this article can synthesize images with solid correlation with the textual feature information. Taking the third set of comparative experiments in Figure 10 as an example, the picture synthesized by this model can correctly reflect the text attributes such as “white and brown” and “multicoloured beak” and the composition of the composition ensures the content of the synthesized image. Consistency with text descriptions and high discrimination with background image features, while the images synthesized by the AttnGAN method are distorted and do not correctly generate textual semantic information.

5. Conclusion

A multistage generative adversarial network is generating a lot of buzz these days because it is being used in AI techniques and computing models to explore anatomical visuals, images of somatic cells, and diverse human organ prosecutions, together with fingerprint scanning. As a result, this strategy is playing a key role in assessing images for diverse medical specialties and criminology for detecting fatal ailments and verifying new sequences of image clues for criminal justice. This article proposes a GAN-based t2i network model. By introducing the self-attention mechanism, the stability of the model is improved. Furthermore, the IS index and the FID index are optimized on the CUB dataset. Image synthesis by integrating AI techniques is employed. The experimental results show that the network proposed in this article can generate clear, natural, realistic, and diverse single-target images and has a specific generalization. In addition, the Chinese t2i dataset is further enriched. Future research will focus on the controllability of text-generated clothing images and apply them in clothing generation and design.

Data Availability

The data shall be made available on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.