Abstract

On the basis of studying the basic theory of anisotropic diffusion equation, this paper focuses on the application of anisotropic diffusion equation in image recognition film production. In order to further improve the application performance of P-M (Perona-Malik) anisotropic diffusion model, an improved P-M anisotropic diffusion model is proposed in this paper, and its application in image ultrasonic image noise reduction is discussed. The experimental results show that the model can effectively suppress the speckle noise and preserve the edge features of the image. Based on the image recognition technology, an image frame testing system is designed and implemented. The method of image recognition diffusion equation is used to extract and recognize the multilayer feature points of the test object according to the design of artificial neural network. To a certain extent, it improves the accuracy of image recognition and the audience rating of film and television. Use visual features of the film and television play in similarity calculation for simple movement scene segmentation problem, at the same time, the camera to obtain information, use the lens frame vision measuring the change of motion of the camera, and use weighted diffusion equation and the visual similarity of lens similarity calculation and motion information, by considering the camera motion of image recognition, effectively solve the sports scene of oversegmentation problem such as fighting and chasing.

1. Introduction

With the advent of the Internet era, the network image traffic is increasing with each passing day. Most users like to watch movies or TV series with the characteristics of The Times; among which, most of the dramas account for a lot. Because the content of the play is old, the actors’ clothes and scenes in the film and television sometimes do not conform to the background at that time, resulting in great mistakes. Therefore, it is particularly important to ensure that the characters and scenes involved in each film or TV series are in line with the background at that time in order to improve users’ viewing experience and reduce the burden of producers of films and TV dramas. Through the image recognition of the diffusion equation, to detect the factors that do not conform to the historical scene in the film and television drama, greatly reduce the work of the director and producer, and solve some problems existing in the clothing and scene in the current film and television drama and life.

Among various forms of digital images, film and television images are the most accessible and indispensable form of images in People’s Daily life. Like other digital images, film and television images are unstructured data in form, but they often have a strong plot structure in content, which is often shown as a combination of scenes connected or associated with each other in plot. This provides a certain factual basis for automatic segmentation of scene structure and scene recognition of film and television images. Generally, the description data of film and television image content includes the following forms: (1) For image metadata, image metadata records information about the image (including title and type) and the production of the image (director, cast, producer, distributor). The metadata of film and television images is generally provided by digital image distributors along with digital media resources [1]. (2) For image structure data, image structure usually refers to the logical structure between image frames and image fragments, such as the connection, switch, and transfer between shots [2]. (3) For image semantic data, semantic data describes the scene, action, story plot, theme, and other semantic content of the image itself [3], which is often obtained by identifying the features of image frames and audio data, as well as from subtitles and other auxiliary explanatory files.

Using image processing and recognition technology, Muwen aims to develop software that can identify what an actor is wearing and whether the scene is appropriate for the shooting. Using the neural network technology of deep learning and based on the image recognition technology, the artificial neural network is designed according to the multilevel, which can extract, analyze, identify, and detect the feature points of the target and effectively find out the factors that do not conform to the historical scene in the film and television plays.

Image recognition has achieved great development with the help of deep learning technology and has been widely applied in various fields at home and abroad. Some scholars to apply image recognition technology content detection of image frames, in view of the existing retrieval scenario framework and clothing problems [4], the missing clothing information recognition for dress design optimization and scene design recognition problems, this paper proposes a new garment segmentation method and based on cross domain dictionary learning recognition of dress design. A new image information retrieval algorithm [5, 6] based on scale-invariant feature transform features is proposed, which is applied to content-based image information retrieval and improves the traditional image similarity method to accurately and quickly scan the content of the image. The classic deep network [7] is proposed to extract clothing and scene features, and the specialized data is trained repeatedly to make the network features more significant. Image recognition technology has more and more penetrated into daily life. In order to improve audience rating and make scenes of costume dramas more accurate, image recognition technology provides new ideas for whether scenes in films and TV dramas conform to historical background [8].

The research on the structure analysis of film and television images can be divided into two categories: shot-based segmentation and scene-based segmentation from the granularity of time. In lense-based segmentation, the image is first detected and represented as a set of fixed or indefinite length image frames. A shot segmentation algorithm based on image color histogram [9] is proposed. This method and its deformation are widely used in image shot segmentation. Scene-based segmentation takes the background in the process of image plot development as a reference and clusters the temporal sequences with the same semantics to form a shot set representing the scene. Scene-based image segmentation is also called image scene boundary detection. In general, image scene boundary detection algorithms can be divided into two categories: methods based on domain-related prior models and domain-independent methods based on image production principles [10]. The method based on domain-related model needs to establish corresponding prior models according to image types and related domain knowledge [11]. Therefore, before this method is applied to a new domain, it is necessary to establish a prior model that depends on domain-related expertise, which is relatively limited. However, the domain-independent method is usually based on the principle of image production, clustering image shots according to the visual similarity and the characteristics of time constraints, clustering the shots that express the semantics of similar scenes in a time continuum into shot clusters, and then constructing the scene on this basis. This method does not require professional knowledge in related fields, so it is widely used in scene boundary detection of film and television images.

Scholars proposed to use the evolution of the scene graph to describe the scene graph model [12]. Each vertex in the figure represents an image of a scene; each edge represents the visual similarity between two shots in position and time; the full link method is used to calculate the similarity of the shots, and the hierarchical clustering method is used to form subgraph; each cluster of subgraph represents a scene. Similar graph clustering method is used to find the story in the image unit [13], to use vision algorithms to detect changes in image scenes, it is necessary to calculate the similarity between the two shots of the image. When the image file or contains a large number of lenses (for example, the number of lens film images can reach thousands or more), a lot of time and calculations are required, which leads to low efficiency. At the same time, with images of the scene lens must be continuous on time; you can use the limited time approach to the shot clustering, and literature [14] puts forward within the fixed time window T lens clustering method based on similarity shots, but because in the image, based on the plot and story rhythm slow degree is different, the length of the image lens also varies greatly, so it is more reasonable to use a fixed lens window than a fixed time window. Literature [15] proposed an algorithm for image scene detection based on a sliding window with fixed lens window length. Literature [16] analyzes the background pictures in the same scene shot to complete scene boundary detection of film and television images. For scene recognition of image, the underlying features of the image are usually used to complete the mapping from the underlying features to the scene category through machine learning method. According to the type of characteristics and in general can be divided into scene recognition method based on global features and recognition method based on local features of the scene, typical global features are [17, 18] proposed C71St characteristics, the use of frequency domain information of image scene, said the global shape of this method in the recognition of outdoor scene, and good results have been achieved in the classification. However, the recognition effect of indoor scene is not very ideal. Relative to the global features, local feature better describes the detail of the image information; thus, a global feature in many recognition task has a better performance; the typical image scene recognition methods are mainly based on local feature hierarchical model classification method; the characteristics of the pyramid matching method and scene recognition method are based on the characteristics of CENTRIST descriptor [19]. The above methods can achieve certain results in specific image scene recognition, but the results are often poor when applied directly to image scene recognition. And most of the research on the current image scene recognition for a particular image category in the field of classification and identification, such as in the images of the sports football, table tennis, advertising image recognition, image stream, and video image scene classification problem, is relatively complex, often because of video image in the process of filming; for reasons of artistry, there will be a lot of shooting angle, light intensity, and other changes and often a lot of camera movement, long lens, and other different shooting techniques, all of which make it more difficult to identify the scene of film and television images. In literature [20, 21], action recognition and scene recognition images are combined to extract a variety of local features for action and scene training and recognition. At the same time, using the traditional underlying features to represent images or images and because of the underlying characteristics often contain semantic information is very limited, often when dealing with complex tasks based on semantic incompetence [22, 23]; as a result, the underlying characteristics of different often has good effect to the specific processing tasks and in other task performance in general. Trying to use intermediate semantics to represent images and images has been an important research direction for a long time. The proposed image-based representation method [2426] achieves good results in scene recognition of images. It uses the objects contained in the image as the intermediate semantic features of the image and uses the SVM classifier for scene recognition of the image.

However, all of the above methods have some limitations. For example, the time locality of the shot in the same scene is not taken into account, which requires a lot of calculation, but only the color similarity of the shot is considered, and the similarity of the shot in the moving state is smaller than that in the static state, which leads to the oversegmentation problem. The global similarity between the shots is used, and the motion characteristics of the shots are added, and then the sliding window technology based on the number of shots is used to detect the image scene.

3. Image Recognition Based on Anisotropic Diffusion Equation

3.1. Diffusion Model of Heat Conduction Equation

The intermediate score vector graph of a set of object perception filters is used as the image feature, and a simple linear classifier is used for the scene recognition task. The proposed image recognition method is shown in Figure 1. The image recognition using the diffusion equation can easily distinguish and classify different scenes [27]. At the same time, with the popularization of the Internet, it is becoming easier and easier to obtain large-scale digital images. The corresponding object recognition model can be easily trained by using the annotated object images.

After discretization, is the class, and is the time; the classical P-M equation is

By analyzing the characteristics of anisotropic diffusion function and following the design principle of diffusion coefficient, is the diffusion; the following diffusion coefficients are constructed:

where represents the threshold parameter of the edge amplitude. An automatic estimation method of diffusion threshold is

3.2. Image Recognition Interaction of Diffusion Equation

For image matching based on feature points, the relevant elements of a scene picture similar to the image are obtained by locking a frame of image. The feature points of the two images are edge-extracted, and the pixel gray level of the corresponding feature point area is calculated. The correlation coefficient is used to determine whether the two images are of the same period. The image matching process based on feature points is shown in Figure 2.

The object model of deformable parts mainly solves the problem of object recognition under different angles and deformation. The object is represented as a set of deformable parts in relative position. Each part model describes the local features of the object. The relative positions of the parts are variable through spring-like connections. And the discriminant learning is carried out through the image marked only with the whole position box of the object. Deformable object model continuous has achieved the best results in PASCAL VOC recognition tasks.

The object model of the deforming part uses the 3-dimensional gradient direction histogram feature and provides an alternative of using the 3-dimensional feature vector obtained by using both the contrast sensitive and contrast insensitive features. The detection scores of small and large objects in the image are calibrated. The gradient direction histogram feature is a more commonly used image feature descriptor at present, which is used in object detection. It uses the gradient direction histogram of the image in locally dense cells with uniform size and uses overlapping local contrast normalization technology to improve the accuracy of object detection.

Let be a pixel of the image and and are the gradient direction and gradient amplitude of the image at the position , respectively:

The wavelet denoising method with better effect of removing Gaussian noise is compared with the nonlinear total variation partial differential equation denoising method. In wavelet denoising, different thresholds are selected according to different decomposition scales:

where is the signal length, is the current number of decomposition layers, and is the maximum number of decomposition layers. The method to estimate the wavelet coefficient is

The measurement parameters calculated by using the image recognition method of diffusion equations (1) to (8) are listed in Table 1.

4. Image Scene Structure Analysis and Recognition Based on Film Diffusion Equation

With the rapid development of digital multimedia technology and the popularization of the Internet, digital image resources have increasingly become an important part of People’s Daily entertainment. At the same time, because the film and television images are unstructured data, how to effectively organize and manage the film and television resources and provide users with the ability to quickly locate the content of interest is an important topic for the study of image content. The video image scene structure analysis and recognition system are aimed at segmenting the video image scene and obtaining the structure unit of the video image with the scene as the semantic, so as to realize the image scene semantic structured storage and management. At the same time, scene recognition is carried out for scene fragments of film and television images, and scene tags are automatically annotated to obtain scene semantic content, so as to provide retrieval capability based on scene semantic for film and television images.

For the input image, the scene structure is analyzed first. Scene change detection based on image lens, therefore, first of all, based on visual similarity of image frame shot segmentation, and image lens with multiple frames represent key frames and then use the shot key frames of visual characteristics and movement characteristics that were similar to the lens clustering, and by using the scenario development model law and similar adjacent interlaced lens to merge, the scene structure unit of the image is obtained. Then, key frames of image scene clips on the set of predefined object, which can identify and use the object image frame according to test results of statistics and information in order to get a panoramic view of the image scene, using all the largest pool of key frames and the average pooling results as image scene clips feature for image scene training and recognition.

After the data set of key points is obtained, the time of landing of left and right feet is marked, and other preprocessing is performed manually, which forms the basis of the next model training. We expect to fit the original data and the marked data through machine learning and finally train a model that can determine the step fall according to the coordinate relationship of human body nodes in each frame, which can be used for the recognition of images in actual work. In this link, two common neural network algorithms, support vector machines (SVM) and multilayer perceptron (MLP), were used. We observed their performance during the process and compare their results and evaluate their differences.

For the acquisition of training samples, a GH5 camera was used to shoot individual motion image segments of several different states in the format of 1080p/60fps, including two camera positions and four dimensions of movement (Table 1), so as to ensure the diversity of character movement. Single and not many people are given at the beginning of the experiment control unnecessary variables as far as possible. In fact, the attitude judgment accuracy of Openpose for people with the same image is basically the same as that of a single person. If the final training model can be applied to the results, many image processing will only amplify simple calculations. Also, the original reason to consider 60 fps was to be consistent with the base number of minutes and seconds, reducing some errors for manual corrections that might occur in certain segments. However, the increase of the amount of test data caused unnecessary disturbance to the accuracy of the model in the experiment. After observation, it is found that lower frame rate can solve the above two problems to some extent, which not only improves the accuracy but also reduces the calculation cost. Therefore, the frame rate and resolution of the shot picture are reduced to 720p/30fps data. The data processing process is shown in Figure 3.

MLP or SVM algorithm is difficult to extract features from the internal information of the picture. Each point is an absolute coordinate value, which means that when the character moves from left to right in the picture, the value of the point will gradually increase. When the character moves up and down in the picture, the value will produce some noise. The absolute coordinate values of and can be converted to relative coordinate values by:

Since the Openpose can infer the pose of the inner part of the figure painting, when the body part of the figure is blocked or beyond the picture, the value of this part of the point will be. As a result, the values of this part become discrete values, which will also affect the fitting of the model. Therefore, the missing frame is filled with the following method:

On the other hand, in the picture, the characters will also move in the depth direction of the picture, which means that in the same image, if the characters move in the depth, the absolute distance between points will change according to the rule of near, far, and small, which is equivalent to adding noise that cannot be ignored on the timing axis. Before data input, this part of noise is removed by the following normalization method:

In the training process, one frame was taken as a unit sample, aiming to determine whether the movement period of the figure’s footsteps was left or right through the joint coordinate information of each frame. In the training process, the movement modes of characters are divided into the following categories for training:(1)Fixed machine position, four-way figure position fixed footstep movement(2)Fixed machine position, characters to the deep position of the foot movement(3)Fixed position, characters in the picture from left to right/from right to left movement(4)The machine follows the character, and the character moves forward head-on(5)The machine follows the character, and the character moves backward(6)The machine follows the character, and the character moves sideways

The purpose is to make the generated graph and the target graph consistent in the low dimensional space and get the generated image with clearer contour. The L1 loss of the generated image and the target image is divided into two parts. One part is the Llloss between the output image and the target image after the damaged image passes through the generated network. The other part is the Llloss between the output image and the target image after the target image is generated through the network.

The application of diffusion equation can be mainly divided into two categories: one is the basic iterative scheme, which makes the image gradually close to the desired effect by updating with time, represented by Perona and Malik’s equation, and the subsequent work on its improvement. This method has the function of backward diffusion as well as forward diffusion, so it has the ability of smoothing image and sharpening edge. However, it is not stable in practical application because it is a sick problem. The other is based on the idea of variational method, which makes the image smooth by determining the energy function of the image and minimizing it, such as the widely used overall variational TV (total variation) model. It is more stable than the first method and has a clear geometric explanation. However, the image has no edge sharpening because it does not have the ability of backward diffusion. By adjusting the ruler library parameters according to the noise intensity at each step of the iteration, the pseudo edge can be preserved. The corresponding denoising method is adopted for noise. Among the denoising methods of sonar images, the method based on diffusion equation has some advantages that the classical algorithms do not have. It can remove the noise and keep the details of the image. However, because the diffusion equation denoising is an iterative operation, when the noise is large, in order to better remove the noise, the operation speed is affected to a certain extent. However, after filtering, the noise information is obviously reduced, the image is clearer, the SNR is high, and the edge of the image is well maintained. And it has high peak signal to noise ratio and edge retention evaluation parameters.

5. Example Verification

Because different types of film and television images have great differences in the length of shots and scene fragments, different scene boundary detection algorithms have great differences in the performance of different types of images. In order to evaluate the accuracy and comprehensiveness, the text is selected in different types of movies for evaluation. Among them, Hollywood’s The Sixth Sense uses all the shots except the beginning and end of the credits, while other movies select part of the footage in the middle of the image to mark the scene. In these films, A River Runs through it, Forrest Gump, and Thanks To Family, the rhythm is relatively gentle, and the changes in the shot are relatively small. The Sixth Sense is the violent images in the database of Media Eval Affect 2012. The plot is relatively compact, and some shots have great changes. Leo is an action movie, which contains a lot of fighting, gun fighting, and other shots, and there are drastic changes in the shots. Use these different genres and styles of films to ensure the reliability and comprehensiveness of the review results. Table 2 shows the information about the reviewed movies.

In the training process, the network is often difficult to converge to the local optimal solution due to the excessive weight update. To solve this problem, the cross lineal loss function with smoothing term is added in Lreal/false and Lreal/false. Figure 4 shows the comparison between the loss function of cross siblings during training with and without smoothing items, where the red solid line indicates that smoothing items are added, and the blue dotted line indicates that smoothing items are not added. The sigmoid function outputs 0 to 1, whereas the network is trained when the output value is. The corresponding log value fluctuates greatly, which makes the gradient transfer unstable. By adding smoothing item to restrain its fluctuation, adding loss function of smoothing item in the training process can make the network converge better and faster.

Figure 5 shows the change process of data accuracy improvement after each step of data preprocessing method. The order is from left to right and from top to bottom. The ordinate of each icon is the value of precision convergence after each step, and the abscissa is the time (frame). The curves of different colors in each figure represent the fluctuation range of different characteristic values.

Take 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 projection intervals, respectively, to make a projection interval and distance between two classes between the two in three types of curves as shown in Figure 6. It can be seen that with the increase of projection interval, the distance between classes gradually decreases to a certain distance, and the calculation time of invariant moment also increases. The higher the distance between classes, the higher the recognition rate. The shorter the calculation time, the smaller the calculation, considering the recognition time and recognition rate.

Object Bank features were extracted from the training data and made statistics. The statistical information obtained is shown in Figure 7. Figure 7, respectively, represents the statistical results of bedroom, dining room, living room, and street scene. The statistical results show that the image features are consistent with people’s cognition of image and image scene.

In view of the Gaussian noise in film and television production, this paper applies the commonly used image denoising methods to film and television production. It is found that these denoising methods destroy the details of the image while denoising, which will have an impact on the future recognition and detection work. The performance of image denoising methods based on wavelet transform and total variation image denoising methods based on partial differential equation are analyzed. The experimental results show that both of them can keep the details of the image better, but the latter is better than the former. Aiming at the speckle noise caused by the imaging basis of film and television production, this paper attempts to apply the improved anisotropic diffusion equation to the speckle suppression of film and television production and adopts the anisotropic diffusion model to remove the speckle noise. By modifying the diffusion coefficient, the diffusion equation can adjust the diffusion coefficient according to the noise of the image and can be more sensitive to the detail information such as the edge of the image.

6. Conclusion

Boundary detection and image to image scene recognition, on the basis of structure analysis and formation of the film and television image scene recognition system, the structure of the film and television image semantic unit into a scene, scene semantic annotation, and automation, so as to realize semantic image storage and index based on the scene, provide the basis for content-based image retrieval. Improved anisotropic diffusion equation was applied to image ultrasonic image noise reduction; the experimental results show that for ultrasonic image contains a lot of speckle noise, using the method of noise reduction, and enhance the stability of anisotropy and computing time, and the number of iteration is much better than the classical P-M equation and Lin Shi operator, and the effect of filter edge performance has a clear advantage. As deep learning has begun to step into digital image processing, especially convolutional neural network has initially achieved satisfactory results in image recognition and other fields, image recognition methods based on deep learning have also been proposed and developed in recent years, becoming a research hotspot and awaiting further exploration.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.