Abstract

In order to improve the effect of key information extraction from digital archives, a key information extraction algorithm for different types of digital archives is designed. Preprocess digital archive information, taking part of speech and marks as key information. Self-organizing feature mapping network is used to extract the key information features of digital archives, and the semantic similarity calculation results are obtained by combining the feature extraction results. Combine with mutual information collection, take that word with the highest mutual information value as the collection cent, traverse all keywords, and take the central word as the key information of digital archives to complete the extraction of key information. Experiments show that the recall rate of the algorithm ranges from 96% to 99%, the extraction accuracy of key information of digital archives is between 96 and 98%, and the average extraction time of key information of digital archives is 0.63 s. The practical application effect is good.

1. Introduction

Generally speaking, there are no management problems in the spontaneous stage of cultural production. Cultural production and business activities are the product of commodity economy [1, 2]. When material and cultural production develop to a certain extent, social division of labor is further clarified, and professionals and professional groups engaged in cultural production appear, both the ruling class and the ruled class try to use cultural production to serve the interests of their own class, which is the conscious stage of cultural production [3]. Only at the conscious stage of cultural production can cultural operation and management be put on the agenda. In a modern capitalist society, everything into a commodity culture products without exception has become a part of capitalists, for-profit special goods. The vast majority of cultural activities are restricted by the value of commodity production rules [4]. The tendency of commercialization of cultural production and cultural activities has become a common social phenomenon in the field of capitalist culture. The law of market economy dominates the management of cultural operations and management activities, and the quality of management is the key to the success or failure of specific cultural products in the free competition of the cultural market. Especially with the rapid development of social economy, different types of digital archives are gradually increasing, the main products of these are social culture, in order to better manage for these different types of digital files, need to study a new key to different types of digital archives information extraction algorithm, to enhance the management level of the digital scheme. Therefore, it is of great significance to study a key information extraction algorithm of different types of digital archives.

In the digital archives, information extraction is an important research topic, reference [5] proposed a digital book records mass data fast extraction algorithm. Based on the range characteristics of large-scale data attributes, the distribution samples of digital book archive data are divided into multiple subintervals to achieve data classification. By constructing a neuron model, the error terms of output are determined according to the data output of the hidden layer and output layer, and the weight of each layer of the BP neural network is adjusted. This method builds a fast extraction model based on BP neural network and realizes the fast extraction of massive archive data. Reference [6] proposed a key information extraction algorithm based on TextRank and cluster filtering. First, the key information is extracted and vectorized for Word2Vec. Then, TextRank is improved by constructing a graph model integrating word eigenvalues and edge weights, and the stable graphs obtained by iterative convergence are merged and clustered to form clusters. Then, a cluster quality evaluation formula was designed for cluster filtering, and TextRank was applied to form the final clustering. Finally, annotate the information type of the cluster. For testing the text, by comparing the key information vector distance cluster heart vector and the words information types, combines information type and key information to get the key of the text information. Reference [7] proposed a hidden Markov model based on an improved extraction algorithm of key information extraction. The web document is converted into D0M tree and preprocessed, and the information item to be extracted is mapped to state and the observation item to be extracted is mapped to vocabulary. The improved hidden Markov model is used to extract key information of the text. Reference [8] proposed a key information extraction algorithm based on word vector and location information. Vector representation model by word learning vector of each word in the target document said, will the reflect of the latent semantic relations between the word and the word vector combined with location feature fusion to the PageRank score model, choose a few top words or phrases as the key target document information, in order to complete the digital archives of key information extraction. Reference [9] proposed a key information extraction algorithm of unstructured text in the knowledge database. Six yuan group was used to optimize the hidden Markov model, probability model, and smooth processing of incomplete training samples. Initialization and termination operations were carried out for the sequences of observation values released at different times to obtain the optimal state sequence. After decoding the observation sequence, the positive sequence and reverse sequence were obtained by comparing them to filter out the states without decoding ambiguity and complete ambiguity elimination. According to the maximum probability state sequence, the text key information to be extracted is defined and the key information is extracted.

However, the above-mentioned key information extraction algorithm is suitable for different types of digital files, and the effect is not ideal because the boundary of key information extraction is uncertain. Therefore, this paper designs a new key information extraction algorithm for different types of digital archives. Firstly, the algorithm divides the main categories of key information, takes parts of speech and marks as features, and introduces the self-organizing feature mapping neural network to traverse the center of word set, thus realizing the extraction of key signals quickly and accurately. The effectiveness of the algorithm is verified by experiments.

2. Materials and Methods

2.1. Digital Archive Processing
2.1.1. The Text Participle

In the process of cultural management, there are many types of digital archives. Before extracting the key information of different types of digital archives, it is necessary to preprocess the key information of digital archives. The preprocessing process includes word segmentation and marking. Word segmentation refers to classifying the words in the text and setting the marks according to the categories, which lays the foundation for the key information extraction of digital archives in the future [10].

The difference between the word segmentation process of the reverse maximum matching algorithm and the forward maximum matching algorithm is that the scanning of the reverse maximum matching algorithm starts from the end of the string. Each unsuccessful match removes the preceding word until the match is successful. Then the basic idea of the bidirectional maximum matching algorithm is: When segmenting different types of digital archives information, firstly, a forward word-for-word maximum matching algorithm is applied to the character string to be processed, then a reverse word-for-word maximum matching algorithm is applied, and the output result is used to complete the word segmentation processing. Assuming bidirectional maximum matching word segmentation for , the algorithm process can be described as follows:(1)First take out the first word in , and search in the dictionary to see if there are any words with as the prefix. If there are, save them as word marks [11].(2)Take a word from and match it with the dictionary to determine whether there is a word with as the prefix.(3)If it does not exist, split from string , ending with a word split.(4)If there is, to determine whether into words, calculate the number headed by words.(5)If , the participle ends once [12].(6)If is not 0, then take a word from and match it with the dictionary to determine whether there is a word prefixed with .(7)If yes, go to Step 6.(8)If it does not exist, split from string , ending with a word split.(9)Continue word segmentation from string of , repeat the above steps until the end of string forward segmentation.(10)Take out the last word in and match it in the dictionary to find whether there is a word with suffix . If so, save it as a word mark [13].(11)Then take out a word from and match it with the dictionary to judge whether there is a word with suffix .(12)If it does not exist, it splits from string , ending with a word split.(13)If there is, then judge whether is a word and count the number of words starting with , expressed by .(14)If , then the participle ends.(15)If is not 0, take out a word from and match it with the dictionary to determine whether there is a word with as the suffix.(16)If yes, go to Step (15).(17)If it does not exist, will be cut out from string and a word segmentation will end.(18)Continue word segmentation from word of string , and repeat the above steps until the end of reverse segmentation of string , so as to remove the stop word. The specific implementation process is shown in Figure 1.

2.1.2. The Part of Speech Tagging

Part of speech is a grammatical attribute of vocabulary, which generally indicates the type of a word in the corpus. Part-of-speech tagging refers to the process and method of tagging the part of speech of each word. Some words contain multiple parts of speech, with different parts of speech and completely different ways of expression [14, 15]. However, in general, when a word contains one or more parts of speech, the frequency of its commonly used parts of speech is far greater than that of other parts of speech, so the accuracy of POS tagging can be ensured on the whole, and the POS tagging method can be applied to most application scenarios [16]. Conditional Random Field Algorithm (CRF) was proposed by Lafferty et al. in 2001. It is an undirected graph model combining the characteristics of the maximum entropy model and hidden Markov model. In recent years, good results have been achieved in sequence tagging tasks such as word segmentation, part-of-speech tagging, and named entity recognition [17]. One of the simplest conditional random fields is the chain structure, in this special conditional random field, the chain structure is composed of several character marks. In CRF models with only one order chain, the fully connected subgraph covers the set of the current marker and one marker before it, as well as the maximum connected graph of any subset of the observation sequence. The chained conditional random field is shown in Figure 2, and the set of vertices can be regarded as the maximum connected subgraph.

In the sequence labeling task, random variable represents the observable sequence, random variable represents the corresponding marker sequence of the observed sequence [18], and the chained conditional probability distribution of the random variable is:

In the above formula, is the state feature function for edge and capture mark transfer features. is the non-negative factor for each node. is the state feature function that captures the current marked feature for the edge. and are learning model parameters [19], said the weight of characteristic function. is a normalizing factor dependent only on the observation sequence. The specific calculation formula is as follows:

Conditional random field reasoning refers to finding a marker sequence corresponding to the most probable one given an observation sequence . In the distribution function of conditional random fields, the normalized factor is completely independent of the marker sequence [20]. Therefore, given the model parameters, the most likely marker sequence can be expressed as:

When the current sequence position is and the current label is , the algorithm can be used to obtain the unnormalized probability value of the optimal label sequence to the current position. Its recursive form is:

2.2. Key Information Feature Extraction of Digital Archives

Self-organizing feature mapping neural network was proposed by a professor of neural network expert self-organizing feature mapping network of University of Helsinki, Finland in 1981 [21]. This network simulates the function of self-organizing feature mapping of the brain nervous system. It is a kind of competitive learning network, which can carry out self-organizing learning without supervision in learning [22]. This paper uses this method to extract the key information features of different types of digital archives. This can improve the accuracy and efficiency of extracting key information from archives.

The structure of self-organizing feature mapping neural network is shown in Figure 3.

We set the number of neurons in the input layer to be , and the number of neurons in the competition layer to be . The input layer and the competition layer form a two-dimensional planar array. The two layers are connected, and sometimes neurons in the competing layer are also connected by edge inhibition [23]. There are two kinds of connection weights in the network, one is the connection weights of neurons responding to external inputs, and the other is the connection weights between neurons, whose size controls the size of interactions between neurons [24, 25].

The connections of neurons at the competitive layer of each input neuron in the self-organizing feature mapping network structure shown in Figure 3 are extracted, as shown in Figure 4.

Set the input mode of the network as and the neuron vector of the competition layer as . Where is a continuous value and is a numerical quantity. The connection vector between neuron of the competition layer and neuron of the input layer is .

The self-organizing learning process of the self-organizing feature mapping network can also be described as: for each input of the network, only part of the weight is adjusted to make the weight vector closer to or more deviated from the input vector. This adjustment process is competitive learning. With continuous learning, ownership vectors are separated from each other in vector space, forming a class of patterns representing input space, respectively, which is the clustering function of automatic feature recognition in a self-organizing feature mapping network. The learning and working rules of the network are as follows:(1)InitializationAssign the network connection weight to the random value , in the interval [0, 1]. The initial value of learning rate ,  < 1 was determined. Determine the initial value of neighborhood . Neighborhood is essentially a region centered on the winning neuron and contains several neurons. This area is generally uniformly symmetrical, most typically a square or circular area. The value of represents the number of neurons in the neighborhood during the -th learning. Determine the total number of studies .(2)One of the learning modes , is provided to the input layer of the network and normalized. The specific calculation formula is as follows:(3)Normalize the connection weight vector and calculate the Euclidean distance between and . The calculation formula of is as follows:The Euclidean distance between and can be calculated by the following formula:(4)Find the minimum distance and determine the winning neuron .(5)Adjust the connection weights, and modify the connection weights between all neurons in neighborhood of the competition layer and neurons of the input layer. The specific formula is as follows:In the above formula, is the learning rate at moment .(6)Select another learning mode to provide to the input layer of the network and return to step (3) until all learning modes are provided to the network.(7)Updated learning rate and neighborhood .In the above formula, is the initial learning rate, is the number of learning, and is the total number of learning.Assume that the coordinate value of a certain neuron in the competition layer in the two-dimensional array is , then the range of neighborhood is point and point as the square in the upper right corner and the lower left corner, and the modified formula is as follows:In the above formula, is the integral function.(8)Let , return to step (2), until .

2.3. Key Information Extraction Algorithm of Digital Archives

Key in the process of information extraction, in the digital archives to effectively extract the digital archives of key information, cannot individually understand the individual words of digital archives, and words or similar to each other in the digital archives correlation words combined into a block, a comprehensive understanding of the whole text content and the exact meaning of each word. Therefore, the semantic similarity between words is used as the clustering distance. All the semanemes of a word will form a hierarchical structure similar to a tree according to their upper and lower positional relations, which is traversed through the tree. Finally, the distance between words can be used to judge the similarity of word meaning. The formula for calculating word distance is as follows:

In the above formula, and represent two semesters, which are variable parameters. represents the length of the path between two sememes of a word. The semantic origin of describing concepts is divided into four parts: The first basic semantic origin, the symbolic semantic origin, the relational semantic origin, and other independent semantic origin. The overall similarity between concepts is calculated by the following formula:

In the above formula, and represent two concepts, and represents the result of feature extraction. If there are two words and in the set, among which word has concept descriptions and word has concept descriptions, the maximum similarity between concepts and can be used as the semantic similarity of the two words, and the calculation formula is as follows:

The process of key information extraction algorithm of digital archives is as follows:

Preprocessing: Word segmentation for digital archival text, stop word overconsideration.Step 1: Calculate all candidate words and semantic similarities between and in digital archival text .(1)TF-IDF value is calculated, and word with word frequency greater than the threshold is selected as the candidate key information. The calculation formula of TF-IDF value is as follows:In the above formula, is the number of occurrences of the word in the current digitized archival text, is the total number of digitized archival text, and is the number of digitized archives containing the word in the database.(2)During initialization, each word in the candidate word has a cluster , a total of clusters, and all of them are set with unaccessed markers.(3)Among all non-visited word clusters, select the cluster pair with the largest similarity, that is, the closest distance, by calculating the maximum value of . If is less than the given threshold, turn to (6); otherwise, merged clusters and are new clusters . Set to current cluster , to no access flag, and to access flag.(4)Calculate the semantic similarity among all unaccessed word clusters, and transfer to (4).(5)After clustering, the first words with better quality are selected from each cluster as the final key information, so as to obtain the candidate word set .Step 2: Treat each word in the text as a set , a total of sets ( is the number of words in the text).Step 3: Select the two sets and with the greatest similarity from the sets, and combine the two sets into a new set .Step 4: Select the center point of the current set: calculate the mutual information sum of the words in the current set and other words outside the set, and select the word with the largest mutual information value as the center point of the current set. If the calculated mutual information value between words is large, it indicates that they are also relatively large, on the contrary, it indicates that they are relatively small. The mutual information between and , that is, the public information between and , is calculated as follows:In the above formula, is the common frequency of and , is the separate frequency of , and is the separate frequency of . According to the above formula, when , the greater the value, the more public information between and and the stronger the correlation; when , there is less public information between and and the correlation is weak; when , there is no correlation between and .Step 5: Among other words outside the set, select the word with the highest similarity with the center point of the set. If the similarity value is greater than the threshold, add it to the current set ; calculate the mutual information between the central point of the current set and the words outside the set, and add the word with the largest mutual information value to the current set .Step 6: Turn to step 4 to update the current collection center point until all words are accessed. If the mutual information value between the central point of the set and other words outside the set is less than 0, perform step 3 for the remaining unreachable words until all the words are accessed and divided.Step 7: In the final cluster set, select its first central words as the key information of the text. The key information extraction algorithm flow of different types of digital archives is shown in Figure 5.

3. Results and Discussion

3.1. Experimental Scheme

In order to verify the effectiveness of the algorithm designed in this paper to extract archive information, we conducted simulation experiments. This experiment is a simulation experiment, so it is necessary to design the experimental parameters, consider various factors, compare various types of simulation software and computers, and complete the design of environmental parameters of the simulation experiment, as shown in Table 1.

During the experiment, 500 GB digital archives were randomly selected from schools, enterprises, and relevant administrative units as data sets, and 450 GB of them were randomly selected as training sets to train this method. The remaining 50 GB were used as test sets to test the key information extraction performance of different types of digital archives. In order to ensure the objectivity of the experiment, the title and core prompt will be filtered out in the process of extracting key information. Recall rate and accuracy rate are often used as indicators of the key information extraction effect of different types of digital archives. Recall rate and accuracy rate adopted in this experiment are defined as follows:

In the above formula, represents the number of extracted key information, and represents the actual number of key information.

In the above formula, represents the amount of key information accurately extracted.

The time-consuming calculation formula for extracting key information of different types of digital archives is as follows:

In the above formula, represents the time taken for the -th key information extraction step of digital archives.

3.2. Analysis and Discussion of Experimental Results

The recall rates of key information extraction of different types of digital archives of reference [5] algorithm, reference [6] algorithm, reference [7] algorithm, and algorithm of this paper are compared. The results are shown in Figure 6.

By analyzing the data in Figure 6, we can see that the recall rate of the algorithm in reference [5] changes in the range of 58%–85%, the recall rate of the algorithm in reference [6] changes in the range of 49%–79%, and the recall rate of the algorithm in reference [7] changes in the range of 50%–87%. Compared with the experimental comparison algorithm, the recall rate of the algorithm of this paper changes in the range of 96%–99%, which is always higher than the experimental comparison algorithm, it shows that the key information of digital archives can be extracted comprehensively by using this algorithm, and the integrity is higher.

The key information extraction accuracy of different types of digital archives of reference [5] algorithm, reference [6] algorithm, reference [7] algorithm, and algorithm of this paper are compared. The results are shown in Figure 7.

By analyzing the data in Figure 7, we can see that the extraction accuracy of key information of digital archives of reference [5] algorithm is 49%–85%, the extraction accuracy of key information of digital archives of reference [6] algorithm is 54%–80%, and the extraction accuracy of key information of digital archives of reference [7] algorithm is 56%–80%. Compared with these algorithms, the extraction accuracy of key information of digital archives of the algorithm of this paper is 96%–98%. On the whole, the key information extraction accuracy of this algorithm is relatively stable, and there is no fluctuation of too high or too low, which indicates that the reliability of this algorithm in extracting key information is high. The accuracy of information extraction is higher, which can achieve the ultimate goal of accurately extracting the key information of different digital archives.

The extraction time of key information of different types of digital archives of reference [5] algorithm, reference [6] algorithm, reference [7] algorithm, and algorithm of this paper are compared. The comparison results are shown in Table 2.

By analyzing the results in Table 2, it can be seen that the average time-consuming of digital archives key information extraction of reference [5] algorithm is 1.41 s, the average time-consuming of digital archives key information extraction of reference [6] algorithm is 1.39 s, and the average time-consuming of digital archives key information extraction of reference [7] algorithm is 1.49 s, which is the highest among the four algorithms. Compared with these algorithms, the average extraction time of key information of digital archives in this algorithm is 0.63 s, which has a shorter extraction time and higher efficiency, and can realize the rapid extraction of key information of digital archives.

To sum up, the recall rate of this algorithm changes in the range of 96%–99%, the accuracy of key information extraction of digital archives is 96%–98%, and the average time-consuming of key information extraction of digital archives is 0.63 s. It can achieve the goal of rapid and accurate extraction of key information of digital archives, solve a variety of problems existing in traditional methods, and can be widely used in many fields.

4. Conclusions

With the continuous optimization of cultural operation and management strategies, the level of cultural operation and management has been gradually improved, and digital archives management is an important part of cultural operation and management. Therefore, extracting the key information of different types of digital archives is of great significance to the level of cultural operation and management. Therefore, this paper designs a key information extraction algorithm of different types of digital archives for cultural operation and management. The experimental results show that the recall rate of the algorithm is between 96% and 99%, the accuracy of key information extraction of digital archives is 96%–98%, and the average time-consuming of key information extraction of digital archives is 0.63 s. It can achieve the goal of rapid and accurate extraction of key information of digital archives and can be widely used in cultural operation and management, in order to improve the quality of cultural operation and management to the greatest extent, promote the further development of the cultural industry. However, the convergence of this algorithm is not tested in the process of operation. In order to avoid falling into the local optimum, it is necessary to increase the optimization of the algorithm in future research work to avoid too many iterations or high errors.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare no conflicts of interest.