Abstract

Information and communication technologies are well thought-out as probable assets for the development of socioeconomics in developing countries. Studies have shown that enhanced infrastructure of telecommunication has facilitated means for underserved population development by various ways. Among the existing applications of ICT, the digital library systems provide with better solutions and respond to a variety of unmet needs of research institutions, scientific communities, and developments. With the development of digital library technology, the parallel database system has become the main tool for efficient information processing in the digital library system. On this basis, based on the parallel environment of the computer cluster, by coordinating the communication in the parallel environment, the coordinator, the collection machine, and the query processor can complete the operation of distribution, load, and maintenance, which has high efficiency and saves much precious time, supports the digital library to meet user requirements effectively, and meets the digital library’s performance requirements for data, and also, the key problem in the parallel algorithm has been solved. The experimental results show that this parallel technique has very good performance and efficiency.

1. Introduction

Information and communication technologies (ICT) are well thought-out as possible assets for the development of socioeconomics in developing countries. Recent studies have exposed that higher infrastructure of telecommunication has facilitated means for underserved population development by various ways [1]. Among the existing applications of ICT, the digital library systems provide with better solutions and respond to a variety of unmet needs of research institutions, scientific communities, and developments. With the development of information technology, more and more information needs to be stored and disseminated, and the types and forms of information are more and more multiple. The mechanism of the traditional library is obviously unable to meet these needs [2]. Therefore, people put forward the idea of the digital library. The digital library is an electronic information storage, which can store a large number of various forms of information. Users can easily access it through the network to get these information, and its information storage and user access are not limited by geographical restrictions [3]. The digital library integrates all kinds of information, such as the data of information, storage management, and query and message posting, in which multimedia is involved so that information can be spread on the Internet to make information be used to maximum [4]. Through multimedia database technology and hypermedia technology and aiming at the characteristics of various media in the digital library, an effective and feasible management retrieval scheme in image retrieval, video-on-demand, and literature has been proposed [5, 6]. The digital library is an innovation based on the traditional library in the information era. It contains not only the traditional library functions, providing the corresponding service to the public, but also integrates some functions of other information resources (such as museums and archives); some of the features provide comprehensive public information service [7]. In other words, the digital library will become the public information center and hub of the future society [8].

Based on the parallel environment of the computer cluster, the communication in the parallel environment is controlled by the coordinator, so that the coordinator, collection machine, and query processor can complete the operation of distribution, load, and maintenance, which has high efficiency and saves much precious time, supports the digital library to meet user requirements effectively, and meets the digital library’s performance requirements for data, and also, the key problem in the parallel algorithm has been solved [9]. The proposed study has considered the design and implementation of the human-computer interaction system in the parallel digital library system based on the neural network [10].

The paper is organized as follows. Section 2 shows the related work to the design and implementation of the human-computer interaction system in the parallel digital library system. Section 3 describes the methodology section of the paper. Section 4 briefly describes the results’ analysis and discussion of the paper. The paper is concluded in Section 5.

Our digital library workers have also done some work in the digital library, such as the research of the personalized active service system in the digital library system and query optimization algorithm. Although there are some works having been done in the field of digital library at home and abroad, the work is relatively scattered and preliminary [11]. The research that seemed the digital library as an independent and universal system tool has not been done and lacks of a comprehensive understanding of the digital library [12]. A lot of research is based on the use of the traditional database management system to implement the digital library system. It is noteworthy that there is no such report about the research of the parallel digital library [13]. Now, loading is a new research topic in the world. The research on parallel text loading is still at the initial stage, both at home and abroad. Although there are already some data-loading system prototypes at home and abroad, the research is mainly focused on the realization of two-dimensional relational data-loading technology of the single machine, and there is no parallel text-loading system. Moreover, the data loaded by these existing systems are mainly relational tables whose structure has been determined. No system can load fixed structure data according to the tables of the changing structure [14].

3. Methodology

3.1. Algorithm of the Collection Machine When the Digital Library System Adds Class

As the amount of data increases in the system, a new category needs to be added to the system at a certain stage. At this time, the classification system is dynamic. Which processor the new category should belong to is not predetermined, so we must first determine which processor the new category should belong to [15]. When the processor that a new class belonging to is determined, data related to the new class is sent to the corresponding processor instead of sending any message to other processors. When adding class, a special case is needed to be solved: the case of a minimum class on many data acquisition machines. In this way, each query processor should derive the thread according to the data acquisition machine, and these threads need to simultaneously receive data collected from different machines, more importantly, when the data is inserted into the same Oracle table, because multiple threads cannot simultaneously insert data to the same table, so it is necessary to with the help of intermediate files. These intermediate files are not merged into one big file until all threads input the data to the intermediate file, respectively; then, data merged is inserted into the Oracle table [16]. If the file on the query processor is a sharing file, such as many data acquisition machines input data into a certain file on the query processor at the same time, it also needs to input data to temporary files separately; finally, these temporary files are merged to form the final file [17].

The idea of the algorithm is that the data acquisition machine does not do any operation if the addition is not the smallest class, that is, it is not the leaf node in the classification pattern table. If the class is a leaf node, when the acquisition machine is receiving command of the adding class, it can also receive which query processor the new class should belong to, which is judged by the front-end machine. After the data is extracted, the extracted data is sent to the related query processor [18].

The function of the algorithm is to add new classes from the acquisition machine to the query processor. Input is a class of coding, processor number belonging to, and a string of number in the class composed of the number of text. The output returns 0 when succeeds. The process is to separate the parameters into class code, the processor number that the class should belong to, and the number of text in the class and store it in the array list of the structure [19]:

Then, read the network configuration file; the processor that the new add class should belong to is read, connecting to the IP address of the processor. If a new type of class needing to add files is only one, send the file name and the contents of the file in order; if a new type of class to add the files is multiple, first find the first file. After loading the text, you can view the detailed steps of the processor operation [20]:

For the analysis of algorithmic complexity, the number of minimum classes in the system is set as n, and the average amount of data under each minimum class is x, and the number of data acquisition machines is c. The number of the query processors is p. The time complexity of the worst case of the algorithm is o(x), and the time complexity in the best case is O(x/cP):

With a random process , if for any integer n ∈ T and arbitrary , the conditional probability satisfies

Then, would be called the MARKOV chain, which indicates that the probability of the system will be transferred to at the moment of when the system is in the state of at time n; put in order to obtain the following matrix:

This matrix is called the transition probability matrix. The state space I of any system can be decomposed into the following disjoint subsets:

In which, is composed of all the very return to the state set and is the reciprocal of the return of the state that is often composed of the closed set. If is the aperiodic normal return state, then

The variable in this formula is the average return time of the state .

We call the probability distribution a smooth distribution of MARKOV chains, where I is the state space if it satisfies the following conditions:

If is the smooth distribution of the MARKOV chain, then

So from the data sequence changes, Start at time of the state, only the state at time can predict the probability, the data sequence is divided into several states, recorded as the probable transfer time be recorded as the transition probability of the data sequence of the state at time transitioning to step is expressed as :where is the number of times the state transitions to the state after steps and is the number of occurrences of the state . As the uncertainty of the final state of the data sequence is steering, the last one data should be removed when calculating .

3.2. The Maintenance of Data in the Parallel Digital Library System

The parallel data manipulation subsystem (PDOP)S provides basic query operations based on multiple data distribution strategies and parallel storage structures, such as one-dimensional data partition, multidimensional data partition, and compressed multidimensional array storage structure and attribute partitioning storage structure [21]. The parallelism of all algorithms is based on parallel data, and it is easy to be implemented in the computer parallel cluster environment. All the implementation environment of the parallel algorithm is the computer cluster parallel environment that is composed of several ordinary PC computers; through the parallel high-speed, network processors are connected together, one of them as the front-end processor and coordination machine. The random processor is used as the back-end data acquisition machine, and the arbitrary processor is the final processor of the system administrator [22]. The front-end machine does not store any data and is only used to receive operation requests from multiple users and coordinate the execution of commands by threads on each back-end machine. The data is transferred from the acquisition machine to store in the query processor on the back-end machine according to a certain data distribution method. The back-end machine is responsible for the specific execution. When the parallel algorithm is executed, the scheduling module on the front-end machine will coordinate all the back-end query processors and the operation execution modules on the data acquisition machine to work in parallel. This parallelism is achieved by performing the same operation on different data items, so it is a data parallel [23].

The implementation of the parallel data-loading algorithm is divided into two phases. The first phase is the data division phase. The second phase is the operation execution phase. In the first phase, the scheduling module first receives the execution information from the system administrator and calculates the data distribution strategy locally. The purpose of data distribution is to uniformly distribute data objects on a certain data acquisition machine to multiple query processors so that parallelization of the system can be fully realized during query processing. Data distribution is an important and active field in the research of the parallel database system at present. There are several methods of data distribution in a parallel database system. The one-dimensional data distribution method is the simplest way of data distribution. By partitioning the domain values of one attribute, the whole relationship is partitioned, and a set of subrelationships is obtained, and then, these subrelationships are distributed among multiprocessors. At present, the one-dimensional data distribution method mainly includes Round一Robin, Hash, Range一partition, and Hybrid一Range一Partition. The one-dimensional data distribution method has a common problem: it is not able to effectively support queries with choice predicates on nonpartition attributes. In order to solve this problem, some multidimensional distribution methods have been put forward, including CMD method, ECC data distribution method, BM data distribution method, FX data distribution method, data distribution method based on the iHlbert curve, and BERD multidimensional data distribution method. Under the control of data distribution strategy, the threads of operation execution modules are derived from all the back-end query processors and data acquisition machines, and the operation execution information is broadcast to the threads of each back-end machine. After the operation information is received by the thread of the operation execution module on the data acquisition machine, the local data is transferred to the query processor.

Metadata is an important part of the digital library. The quality of metadata determines the quality management of the whole digital library. Metadata is stored in the metadata table. It is not only an important data material of the digital library to be used by workers but also can be used for querying of users so that the structure of the digital library can be better understood and its use level can be improved. The idea of the algorithm is that when modifying metadata, new metadata of modified metadata is stored in the Oracle tables of each back-end machine query processor, and new values in the table replace the old values of metadata tables. Because the new value table and metadata table both exist on all query processors, the algorithm only involves every query processor, and the algorithm on each query processor is exactly the same.

The algorithm modifies the back-end machine metadata information, and input is empty. The output returns 0 when it succeeds. Then, the name of the machine is obtained. Open the network configuration file and read it from the second records. If the name of the machine is the same as the name of a certain machine recorded of the network configuration file, the logical name of the machine is obtained. Using a cursor from the machine to select data for the table prepared for the update metadata, the condition is that the logical name of the machine is the same as the logical name of the machine. When data can be extracted from the cursor, update the title of the machine by using the following items, including document identifier, abstract, author, department, other information of the author, publisher, publication time, input time, page number, ISBN, and category code. Finally, return successfully.

4. Result Analysis and Discussion

4.1. Test One

The first experiment of the multithread parallel text-loading algorithm and serial algorithm in the query processor number is fixed, and the experimental results when the amount of data changing occurs is in the first experiment, and the back-end machine we use includes two sets of data acquisition machine and four sets of query processors; each back-end machine configuration is IG memory and 70G hard disk. The amount of data in this group of experiments is constantly changing, and the number of the back-end machines is fixed. The purpose of the experiment is to compare the efficiency of serial and parallel algorithms when the amount of data is increased. The following are the data of the test and the performance analysis table according to the test data. Performance analysis table is shown in Table 1.

Table 1 is the experimental data of the multithread parallel text-loading algorithm described in this article on 4 node machines. Row is the data volume, line is the parallel, serial different loading algorithms, and table content describes the running time of the algorithm. Comparison of parallelism and serialization when data quantity changes is shown in Figure 1.

We can see from the chart that when using the serial algorithm, although we use four queries for data processor loading at the same time, the query processor is working in the serial mode, that is to say, when the work of a query processor is over, the other one can work. In this way, the total load time will be the sum of the loading time of all processors, not only does the speed do not improve but also the communication between the processors should also be considered. When using parallel algorithms, four query processors will work at the same time, so the total time of loading will be the slowest processor loading time. As we can see in the graph, when the number of processors is constant, the parallel algorithm is much faster than the serial algorithm with the increase of data volume, and the total cost is about 1/3 of the serial algorithm.

4.2. Test Two

Experiment two is the experimental results of a multithread parallel text-loading algorithm and a serial algorithm when the data is fixed and the number of the query processor changes. When the amount of data to be loaded and the number of the query processors change, we use the 19.7 GB data, and the results are shown in the chart below. Line is the number of processors, row is parallel, serial different loading algorithms, and the table content describes the running time of the algorithm. Algorithm running time is shown in Table 2. Comparison of multiple threads and single thread when number of processors changes is shown in Figure 2.

The experimental results show that when using the serial text-loading algorithm, the more the number of the query processor is used, the lower the efficiency of the algorithm is. Since using the serial algorithm, the query processor still needs to load data one by one, and the latter query processor must wait until the last query processor is finished. And, the overhead of the algorithm also includes the communication time between the processors. But when using the parallel algorithm, the more the query processor is used, the higher the efficiency of the algorithm is because all the data query processor will be loaded at the same time, and the loading time is less than the communication time between processors, so increasing the communication overhead between processors is worth it. As we can see from the results, the efficiency of the parallel algorithm will be significantly improved when the number of processors increases.

4.3. Test Three

The experimental results of the multithreaded text parallel loading algorithm and single-thread loading algorithm are fixed in the number of query processors and data volume, and the number of data acquisition machine changes. When the amount of data to be loaded and the number of query processors are fixed and the number of data acquisition machines changes, we use 19.7 GB data and 4 query processors. The experimental results are shown below. Among them, line is the number of data acquisition machines, the row is whether each data acquisition machine and query processor can derive multiple threads and single-threaded loading algorithms, and tabular content describes the running time of the algorithm. The running time of the algorithm is shown in Table 3. Comparison of multiple threads and single thread when number of processors changes is shown in Figure 3.

The experimental results show that when the single threaded and multithreaded text parallel loading algorithms are used, the total execution time decreases with the increase of the number of data acquisition machines. When using a single thread loading algorithm, four sets of query processors are used; each query processor only derived a thread to receive the data of a data acquisition. Each data acquisition machine is derived from a thread to a query processor data; after a data acquisition machine finishes sending, the query processor receives the data from second sets; when using the multithread loading algorithm, each query processor derived M threads according to the data acquisition machine number; each data acquisition machine derived N threads according to the number of query processor so that each data acquisition machine with N threads simultaneously sends data to N query processors. And, each query processor with M threads simultaneously receives M data acquisition data, so the multithread loading algorithm will be much faster than the single thread algorithm. The running time prediction of the algorithm is shown in Table 4. Processor number prediction results were compared and are shown in Figure 4.

5. Conclusion

With the development of digital library technology, the parallel database system has become the key tool for efficient information processing in the digital library system. Data-loading operation is a significant part of the digital library, which is well known that the data-loading operations are time-consuming. Data loading, especially the loading of parallel text data, is a new field of research. In this paper, a novel data operation algorithm based on a new parallel digital library is proposed, and all the data operation algorithms are implemented in the prototype system. A multithread parallel text data-loading operation and maintenance algorithm has been proposed in this paper, which has no research about it up to now. A large number of experiments show that the algorithm proposed in this paper is more efficient than the existing algorithms and has high practical value. Considering the performance and price ratio, the parallel algorithm has high practical value and benefit. To sum up, the author thinks it is still a large area having many problems that need to be solved. There are many works to do, hoping that experts and scholars will pay enough attention to it.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Additional Points

This is a research involving human participants and/or animals.

Conflicts of Interest

The authors declare that they have no conflicts of interest.