Abstract

Distributed systems offer resources to be accessed geographically for large-scale data requests of different users. In many cases, replication of the vital data files and storing their replica in multiple locations accessible to the requesting clients is vital in improving the data availability, reliability, security, and reduction of the execution time. It is important that real-time distributed databases maintain the consistency constraints and also guarantee the time constraints required by the client requests. However, when the size of the distributed system increases, the user access time also tends to increase, which in turn increases the vitality of the replica placement. Thus, the primary issues that emerge are deciding upon an optimal replication number and identifying perfect locations to store the replicated data. These open challenges have been considered in this study, which turns to develop a dynamic data replication algorithm for real-time distributed databases using a multiobjective glowworm swarm optimization (MGSO) strategy. The proposed algorithm adapts the random patterns of the read-write requests and employs a dynamic window mechanism for replication. It also models the replica number and placement problem as a multiobjective optimization problem and utilizes MGSO for resolving it. The cost models are presented to ensure the time constraint satisfaction in servicing user requests. The performance of the MGSO dynamic data replication algorithm has been studied using competitive analysis, and the results show the efficiency of the proposed algorithm for the distributed databases.

1. Introduction

A distributed system is a group of independent computers that appear to the clients of the system as a solitary PC [1]. In these systems, data objects are placed onto servers which are situated at geologically distant locations. In the recent past, distributed databases have turned into a vital piece of business processing. Maintaining data authentically and offering precise and convenient processing of database queries and updating over numerous locales have been a critical factor in empowering organizations to use data in a range of various locations on a worldwide scale [2]. Standardisation of query languages, and of the relational and object models, has helped in the coordination of various database systems to frame systems of integrated data services [3]. The issues such as guaranteeing the integrity of data, which updates are opportune, and clients get a uniform rate of response regardless of their position in the system they have become real difficulties to database vendors and clients [4].

Replication [5] is the procedure of sharing information in order to guarantee consistency between redundant resources, such as software or hardware components, in order to enhance dependability, fault tolerance, or accessibility. It could either be data replication if similar data are stored on various storage devices or computation replication if a similar task is executed ordinarily. It is the procedure of consequently dispersing duplicates of data and database objects among SQL server instances and keeping the distributed information synchronized [6, 7]. The data replication issue is an extension of the classical file allocation issue. The protected sharing of information in this sort of condition is an unpredictable issue. The proprietors of the diverse data sources often have distinctive approaches on access and the dissemination of the data that they hold [8]. There are two principle sorts of replication conventions: dynamic replication, in which all replicas form simultaneously all input messages, and passive replication, in which only one of the replica forms all input messages and occasionally transmits its present state to alternate replicas so as to maintain consistency [9]. The principle goal of data replication in distributed systems is to expand dependability and to enhance execution of the entire system. The important concern in data replication is the number of replicas that ought to be made and their positioning in order to meet a specific performance goal, which is termed as total operation cost (TOC).

Data distribution and replication offer chances for enhancing execution through parallel query execution, load adjusting, and also through expansion of accessibility of data [10]. In a distributed database system, data are frequently replicated to enhance reliability and availability, thus expanding its dependability. Likewise, data are additionally stored on PCs, where it is most often possibly required in order to minimize the cost of expensive remote access [11]. Decision-making includes handling or applying information and knowledge, and the appropriate information/knowledge mix relies upon the attributes of the decision-making setting [12, 13]. The replication or relocation of shared data get obstructed at subjective areas on chip which require the utilization of the index or broadcast-based mechanisms for query and intelligence authorization, as each piece is likely going to have diverse placement requirements [14, 15]. Replicating the stock and client-related data at these diverse areas is desirable since it gives quick access to the nearby replica and survive disaster cases where all machines of a physical area crash [16]. In any case, the current systems for replication of distributed databases do not give anticipated outcomes due to the failure in choosing the number of replica and the best location for the replica placement [17]. These issues form the basic motivation for this proposed model. In order to tackle these challenges, multiobjective glowworm swarm optimization- (MGSO-) based dynamic data replication algorithm for distributed database systems has been proposed in this study. The remainder of this article is organized as follows. Section 2 provides the literature review of some of the related works. Section 3 describes the CAP theorem and its limitations. Section 4 explains the proposed replication algorithms, while Section 5 provides the simulation results of this model. Section 6 presents the conclusion related to the proposed algorithm.

Replication is an important part in all distributed systems. The two widely used replication procedures include the following: static replication [18] and dynamic replication [19]. Various researchers have developed replication strategies based on these two procedures. The number of replicas for each file is fixed manually in static replication which is preselected, while in dynamic replication it is determined by an automatic decision system that decides on the number during execution. These decision modules decide on the replica number with consideration of the system infrastructure and also the environments and client’s access strategies. In general cloud applications, the dynamic strategies are much preferred as it suits the flexible requirements of the clients.

Khanli et al. [20] developed the predictive hierarchical fast spread (PHFS) model for dynamic replication which decreases the latency in the multitier grid systems. This model is an enhanced version of the previously utilized fast spread [21] and aims at flexibly utilizing the spatial positions [21, 22] of the hosts to serve the user requirements. Though this model reduces the delay and increases the performance in locality access patterns, this model does not consider storage constraints during priority evaluation. Also this model has low adaptability in determining suitable thresholds and time interval for feature varying applications. Previously, Kunszt et al. [23] presented a file-based replication management system that incorporates features required for an application programmer and an end user effectively. However, similar to [20], this model is also aimed at serving multilevel grids with much higher storage, and hence, its suitability to storage constrained systems is very low. Lin et al. [24] introduced eStor, data center storage for ensuring energy-efficient data replication. This model enabled the underutilized storage servers to power off for limiting the energy wastage during the idle state. However, this model did not accompany the predictability evaluation function, and hence, the future clients’ requests may face extended delay. Chang et al. [25] developed a dynamic data replication mechanism named latest access largest weight (LALW) suitable for the grid cluster systems. This mechanism selects a popular file from the database and replicates it to suitable number of systems or grids to ensure effective data access. This model also enhances the load balancing or replicas with a registration process that selects the weights based on the time of the requests. The major limitation in this model is the high job execution time as the time interval is short and not suitable selection of based of exponential decay causing single point of failure. Dong et al. [26] introduced a replication strategy over multiple data centers to limit power utilization in the backbone network. However, this work focuses around replication methodologies between various data centers, albeit not inside data centers.

Ping et al. [27] developed the optimal data replication model that operates along the data centers to limit the data access delay in the client side. This model utilized weighted k-means clustering of user locations to determine the optimal replica. However, this model does not utilize the concept of adapting to the varying applications and its storage constraints. Li et al. [28] developed a novel cost-effective dynamic data replication strategy based on the incremental replication method. This strategy reduces the storage cost along with satisfying the data reliability requirements. However, this strategy is efficient only when the data reliability is low and the storage unit aging process is very critical. Also the trade-off between cost and performance is not satisfactory. Perez et al. [11] introduced a replication technique known as the Branch Replication Scheme (BRS) for heterogeneous grid systems. This BRS provides optimized storage usage, increased data access performance, and ability to modify the replicas. These advantages enable the operations like reading or updating a replica in a more efficient manner and reduce the overhead in data access processing. However, BRS is suitable only for the complex applications with dedicated storage space for each branch which is highly unlikely in real-time grids. Similar to BRS, CDRM (cost-effective dynamic replication management) has been proposed for heterogeneous cloud systems in [29]. This model analyses the relationship between availability and replica number and then determines the minimal replica number for given availability. This model is also flexible in adjusting the replica number and location based on the varying workload conditions and node capacities. However, this model has the limitation of delay in analysing the relationship function. Qu and Xiong [30] developed the resilient, fault tolerant, and high efficient (RFH) replication algorithm resolving the flash crowd problem in distributed systems. However, this model has less consistency maintenance which reduces the reliability of its performance.

Lin et al. [31] proposed another approach of utilizing the priority list-based concept of optimal replica placement. This strategy resolved the issues of users’ limited authority in resource access, satisfying the spatial requirements during placement stage. This model places the replica in the optical manner such that the workload is balanced through determination of the minimum replica number when the workload capacity is maximized. However, this model has the limitations of resolving the replica placement problem for general graphs and planar graphs. Similarly, Andronikou et al. [32] developed a dynamic QoS-aware replication strategy that utilizes the concept of data importance to select the replica locations. This model analysed the complete lifecycle of the replication, and then, the new replica and generated to migrate the old replica from their current location. This model utilized a set of interoperable novel file replication algorithms to analyse and determine the infrastructure constraints to simplify the replication process based on data popularity. However, this model does not consider all the QoS factors of both service provider and client-related requirements in the market-related constraints. Bonvin et al. [33] introduced a self-organized, fault tolerant, and scalable replication scheme which operates based on the multiple differentiated availability guarantees of the replica to each application. This scheme employed a virtual economy model as a game-theoretical model to determine the optimizer for data partition and mitigate the replica to their respective locations. However, this model suffers from the limitation of data size, i.e., it is much suitable only for smaller data applications. Boru et al. [34] developed an energy efficient data replication strategy by considering both energy consumption and bandwidth consumption. This model reduces the energy bottleneck problem and also reduces the network delays and bandwidth wastage. The only shortcoming of this model is the practical difficulties in implementing it in real-world applications. Mansouri et al. [35] presented a cost optimization model for ensuring dynamic replication and subsequent migration in cloud. This model utilized an optimal offline algorithm of dynamic and linear programming to analyse the workload of the system. Then, two online algorithms are introduced to minimize trade-off between the storage and migration costs followed by the dynamic selection of storage classes. This model reduces the time complexity of the offline algorithms, but the major drawback is its inability to tackle the availability issues. Nagarajan and Mohamed [36] introduced a prediction-based dynamic replication strategy utilizing multiple parameters of the neighboring sites. The parameters considered are storage capacity, bandwidth, and communication cost for selecting and placing the replica. This model also used modified a priori algorithm for predicting the future needs in the grids, but it does not focus on effective scheduling strategies which is highly not recommended. Pan et al. [37] proposed a dynamic replication management strategy based on the load balance condition of the distribution system. This approach improved the replica management, although it considers only the load balance state while ignoring the scheduling of waiting jobs. Though there has been extensive research providing numerous replication strategies, there still exists issues such as optimal selection of replica number and location to store them still continue to remain challenging. This study provides an approach that resolves these issues perfectly than most existing approaches.

3. CAP Theorem

CAP theorem has been displayed in the context of a web service, presented as trade-off between consistency, availability, and partition tolerance [38]. The CAP theorem is a fundamental part of the hypothesis of distributed systems. It expresses that, within the sight of partitions (i.e., network failures), it is not possible for a system to be both consistent and available, and therefore, it becomes important to select one of the two. It has really changed the landscape of how distributed storage systems were architected. It can be expressed as “In a network subject to communication failures, it is impossible for any web service to implement an atomic read/write shared memory that ensures a response to each request.” The three fundamental properties of CAP theorem are consistency, availability, and partition tolerance [39].

CAP offers architecture improvement of distribution systems. However, one cannot build up a distributed database system that is continually available, sequentially consistent, and tolerant to partition pattern. It must be assembled only with two of these three properties. Furthermore, certain restrictions degrade the efficiency of CAP theorem [40, 41]. The limitations of CAP are given in [41, 42] as illustrated below:(i)Binary existence is convenient for proof purposes, but does not closely match the intuitive notion of availability. The traditional CAP theorem’s definition does not take into account a quantitative measure of network latency. According to the availability property, if the response has not arrived, there is still hope that the response will arrive, but still it does not have an upper bound on latency.(ii)Availability requires that only the nonfailed nodes respond. In a networked partitioned area, even if one node fails at a given time, the availability of a system is hampered. In order to ensure complete availability, one of the solutions proposed is to forcibly make all nodes unavailable [41]. But this trivial solution is unacceptable since it is unnecessarily tampers with remaining active nodes.(iii)CAP theorem fails to encompass problems like node failure, loss, or delay of messages and restart time lapse of nodes other than partition. Fair link loss is possessed by a link if it has a nonzero probability of packet loss. In such a link, the lost packets will be delivered by a limited number of repeated attempts ensuring packets reaching the destination. The fair link loss is closely associated with mobile networks which are integral in today’s application. Problems like node failures and restarts are no longer accidental as much as they occur due to attacks on a system. For instance, denial-of-service attack is common, and it is one of most notorious attacks on a network operation [42].(iv)It is also possible to define consistency as a quantitative metric rather than a safety property. However, these stochastic definitions of consistency are not the subject of CAP.(v)Partitioning is inevitable in any distributed system. Even if we assume that one node has 99.9% chance of not failing in a particular time period, five such nodes in a cluster will have a probability of 99.5% chance of failure [40]. Thus, one cannot compromise on partition tolerance. Therefore, there is an inevitable choice between availability and consistency.

4. MGSO-Based Dynamic Replication Strategy

The limitations of the CAP theorem have been considered, and this proposed model of the MGSO-based dynamic replication algorithm is presented in order to achieve better performance. Initially, the distributed database system to be considered is described. The scheme consists of n nodes, represented as . Each node comprises of a processor and a local memory. All the local memories are remote and available simply by local processors. Internode communication is supported by sending data through the underlying network. All requests, each with its corresponding time deadline, are presumed to reach the processors. Requests reach at a processor synchronously, and there is a concurrency control mechanism to serialize them for handling. For each requested data, it is anticipated that there are at least replicas in the database system, where n is the number of network nodes. This limitation is customarily referred to as t-availability constraint. Figure 1 shows the overall flow of the proposed dynamic replication strategy. The overall process begins with the initialization of the LAN clusters which consists of 1 server structure with n number of clients. The data to be replicated are selected, and they are loaded to the main server in the test environment. Then the dynamic window-based mechanism is utilized to access the data on each processor, where request of the same data is stocked on a temporal window until the fixed time deadline of the user request. This window mechanism determines the concurrent execution of the arriving user requests or queries. Based on this temporal window, the processor receiving more number of requests is considered for storing a replica of the original data from the main server in order to make it available for certain set of users. The MGSO is applied to identify the optimal number of processers that can be allowed to replicate based on queries/requests while also in determining which processors are suitable for storing the replica data depending upon their load and other features. The output obtained from the MGSO model is employed for final replica storage which becomes invalid if the publisher or authority of the original data tends to modify the data with updated contents. At these stages, the abovementioned process is restarted to update the replica system.

In the proposed system, all requests reach the distributed processors in the network system. For evaluation purposes, we have utilized an example of storing replica of a dataset that contains YouTube, Sensor, and Twitter contents. These data are filtered and preprocessed to obtain complete individual set of data before storing them in the main server. Once the data are stored in the main server, the data requests from the users are analysed. When a processor wants to read data d, if the latest version of data d is in its local memory, then data d are directly recovered from their local memory, else, since knows the fixed processor set , will direct a read request to the nearby server, , in . This would sustain units of cost. As a reply, will recover data d from their local memory and send them to , incurring units of cost. Finally, in order to minimize the total servicing cost of forthcoming requests, might specify to save data d into their local memory. This read request is described as the saving-read request. It should be noticed that exclusive servers of data d can get requests from remote processors to read data d. Furthermore, each processor can issue a write request for any data in a distributed database without loss of all inclusive statement. In this manner, the request arriving at processor q can be anyone of these: a read request from processor q for data d in their nearby memory and a write request from processor q for any data in the system, or a read request from a remote processor for a data d′ if processor q is the server of data d′.

4.1. Cost Model

This cost model [43] is utilized to compute the cost of servicing a read request or a write request arriving at a processor q. The cost of servicing of request REQ by an algorithm A is defined as . Consequently, in view of servicing a read request , let be the allocation pattern of data d identified by processor q, and then the cost is given by

An important feature of this model is that, after getting the data d, when keeps data d into its local memory (saving-read request), then the servicing cost will be one unit greater than when does not save data d (nonsaving-read request). Moreover, whether this read request is a saving-read request or not is certain by processor q established on the dynamic request window mechanism. Once processor q chooses that request is a saving-read request, q and will transform their corresponding allocation patterns.

Likewise, consider servicing a write request , and let be the allocation pattern of data d recognized by processor q. Then, the cost of servicing this request is specified as follows:where represents the set of processors in , excluding processor q. A write request generates a different form of data. In order to sustain the consistency among the replicas of the data, the new form should be moved to those processors which have the available replicas of these data in their corresponding local memories. It must be noted that each relocation of these data to the corresponding processors will sustain units of cost.

4.2. Window Mechanism

A certain concurrency control mechanism in each processor is expected to serialize the arriving requests so that it yields at most one request in a δ time units, with δ being imperceptibly small. Without loss of consensus, it is expected that δ = 1. Each request landing at a processor creates an underlying request scheme. In order to process these requests, their individual time deadline is noted from the discharge points of the concurrency control mechanism and the window mechanism is invoked to respond to this request. Figure 2 shows the proposed window mechanism working process which defines the addition of new write requests to the currently available read requests. It can be seen that the concurrency control is performed in time slots for each requests.

Furthermore, the scheme of the window mechanism comprises a dynamic practice. Numerous dynamic request windows are produced in each processor, one for each requested object. A request window for data d in a processor is symbolized as . Each request window is of a FIFO-type window with size τ to collect most τ number of requests in τ time units for the similar data. Furthermore, two counters, and , are connected for each . has a preliminary value τ, and the value of is decremented by one per time unit till it touches 0. will track the time deadline of requests in . The window mechanism is described in Algorithm 1.

Initialize the servers and subservers
For each time unit
 If there is request REQ for data d with time deadline
  If does not exist
   Generate and insert REQ into ;
   ; ;
  Else If REQ is a read request
   Insert REQ into ; ;
  Else
   Service the requests in ; Insert REQ into ; ; ;
  End if
 End if
End for
For each presently existing request window in processor q
;;
 If
  Service the requests in ; Delete ;
  If (q is a server of data )
   Invalid the copies of in processors that fit to ;
   Empty ;
  
  End if
 Else if
  Service the requests in ; ; ;
 End if
End for

It can be witnessed that the handling of requests in a request window can be determined by the deadline obligation of a specific request and is not utterly constrained by the realisation of the request window. In fact, three conditions can activate instant servicing of the requests in . At any time, of in processor q ranges 0, will be removed from q, and window mechanism will reset in q to . Likewise, if q is a server of data d, then q will directly invalidate control-messages to the processors in to invalidate the replicas of data d and void . The succeeding request for data d received at processor q will be deliberated as the first request for data d, and will be restarted by the window mechanism. Thus, the request structure introduced into is considered through its distinct lifetime as , and it is obvious that, by using the window mechanism, a will be principally subdivided into numerous phases P (1), P (2), …, P (r). Each phase fits either into Type I or into Type II. While the Type I phase comprises of a number of read requests, a Type II phase comprises of a write request monitored by numerous read requests. Whenever a phase originates into a reality, the window mechanism will prove the request sequence in that phase without any understanding of the upcoming phases [44].

It must be noticed that the extent of a request window τ is a key parameter in our system. Different estimations of τ bring about various execution of the system. The estimation of τ ought to be resolved based on the node ability (for example, CPU power, memory limit, and network data transmission), system request arriving rate, and deadlines forced by the application requests. It is clear that the estimation of τ ought not to be too small. If τ = 1, at that point, each request will frame a request sequence, and each request window would be produced and erased in one time unit. This will suddenly consume the vast majority of the computing capacity of the processor. It must also be noticed that if the deadlines imposed by the requests are too short for the system to process, certain requests might be dropped by the system. Without loss of generality, the deadline imposed by a request that can be effectively handled by the system is at least equivalent to 1. Such dropped requests known as the blocked requests would either leave the system or are resubmitted, depending on the underlying application. It can be managed by the admission control mechanism of this model.

4.3. Replica Selection and Placement Problem

The selection of replica and the problem of placement in the distributed systems can be defined by considering a distributed system containing K data objects that would be replicated onto N servers. Let and be the names of server n and data object k, correspondingly. and denote the capacity of server n and the volume of data object k, respectively, where and . A link amongst two servers and (if it occurs) has an integer number , and this provides the communication cost for transmitting a data unit amongst servers and . It is presumed that . Let and be the number of read and write demands requested from server n for data object k. Each data object has a principal server , which holds the main copy of . The main copy of data objects cannot be deallocated. Figure 3 shows the replica placement strategy in which the replica has been selected after determining the constraints and then the placement is carried out for practical use.

It is expected that each primary server has the replication representation of k-th data object, , which comprises a list of servers where is replicated. In order to achieve a write request, obtains the update request from the source server which requests to update the data object and broadcasts it to all servers in its replication representation . The main objective of replica placement problem is allocating replicas over all the servers in order to reduce the total operation cost , which is induced by two modules, and . Hence, is obtained as follows:where is the total operation cost due to all servers reading requests for and the total operation cost due to all servers writing requests for .

and are given as follows:where is the nearby server to that encloses a replica of .

Consequently, the concise TOC based on total read and write requests’ cost for all data objects is obtained as follows:

Reducing this equation is the solution to the replica selection and placement problem in distributed systems. This is achieved in this proposed model by using a multiobjective glowworm swarm optimization algorithm for optimally selecting the number of replica and also with the server location for placement of replica.

4.4. MGSO-Based Replica Selection and Placement

In this proposed model, each glow-worm agent () is a matrix with Boolean components [45]. The replication problem is modelled into each glow-worm, and it moves towards the brighter glowworm, i.e., determining which processor is better suited to store the replica data. As the utilized glowworm and agents are of Boolean matrix, the logical operations are employed. The logical OR operations are performed by the inbuilt processing unit to move the selected glowworm towards the better one in terms of brightness (i.e., better replica conditions). Let element if comprises a replica of and otherwise. The fitness value of each agent is calculated in terms of TOC percentage, which is saved using the replication strategy of the algorithms, compared to the initial one, i.e., when only primary copies exist. This value indicates the solution quality of replication schema associated with the glowworm agent. Fitness function is computed as follows:where is the TOC for replication scheme for the i-th agent and , the TOC achieved in the initial allocation, which is calculated only when the primary copy of the data objects exists while replica has not been produced.

In addition to the Boolean logical operations, the update and other processes of the MGSO are also needed to be processed which are undertaken by the agents specified. The validity of each is influenced by the servers’ capacity. The foremost principles in a multiobjective GSO is that the solution is not an absolute optimal solution as in the common GSO because only the most viable among the optimal solution is chosen, and this cannot be consider as 100% optimal. The population and step size are initialized randomly for the noninferior sorting process. The luciferin update process is performed as follows:where represents the luciferin level associated with glowworm i at time t, is the luciferin decay constant , γ is the luciferin enhancement constant, and represents the value of the objective function at glowworm i-th location at time t.

During the movement phase, each glowworm decides, using a probabilistic mechanism, to move toward a neighbor that has a luciferin value higher than its own. However, in order to increase population diversity and the convergence speed, a disturbance term is added to the location updating formula:where , s step, and which can control the range of disturbance, , the random number that meets normal distribution. is the algorithm parameter set in advance, which is relevant to the problem. The range of values 0 to 1, but in general, the value of is set to 0.001.

When the glowworms are determined only by the local information to select their movements, it is anticipated that the number of peaks caught would be a function of the radial sensor range. Consequently, GSO utilizes an adaptive neighbourhood range in order to detect the existence of multiple peaks in a multimodal utility landscape. It is represented as follows:where is the neighbourhood range, is the neighbourhood range with respect to step size, is a constant parameter, and is a parameter used to control the number of neighbors. The complete process of MGSO for replication process in distributed systems is given as shown in Algorithm 2.

Initialize the population N, step size s
Set servers (solutions) as
A = 0; All dominated solution will be placed in A
Let t = 0;
Compute ;
Update ;
Determine direction of movement of each ;
Update location;
Noninferior sorting process initiated;
Update A;
t = t + 1;
Until
Place replica at in A;
End

The most time-consuming parts of the algorithms are associated with population initialization, and updating best positions is achieved by all glowworms in each iteration. The time complexity of initializing N glowworms is . In order to update best neighbourhood locations in each iteration, fitness values of all must be considered. The time complexity of fitness value calculation is . Thus, the time complexity of solving replica placement in distributed systems via MGSO is .

5. Simulation Results

5.1. Validation of MGSO-Based Replication Algorithm in Hadoop Simulation

The proposed MGSO-based dynamic replication algorithm uses the strategies of MGSO with the dynamic window mechanism to tackle the limitations of the CAP theorem. The theoretical soundness of any algorithm becomes legated when it is practically not viable. Hence, the validation of the MGSO-based algorithm is performed in Hadoop clusters. The Hadoop framework consists of Hadoop distributed file system (HDFS) and MapReduce framework, which seem efficient in setting up multiple clusters of LAN servers and subservers for evaluation. The HDFS enables to utilize it for memory purpose, without requiring the user to purchase cloud or other storage platforms. The cluster set up ensures that there is adequate data transmission between the client systems in distant location without much packet loss and also with minimal energy consumption. As said above, for validation, a raw database consisting of data which is collected from three different domains is used. The raw database includes environmental data collected by sensors and social data collected from Twitter and also YouTube. Initially, these three types of data are unstructured, which will eventually be refined during the replication primary copy storage time using the window mechanism. Let the validation results help us in justifying the performance of the proposed model.

Let us take the abovementioned database for simulating the proposed algorithm. The initial Hadoop settings are done as in the instructions manual, and then the main project for simulation is done. Figure 4 shows the initial processes involved in setting up primary data copy in a distributed system for replication. These data are the main source that contain all the three types of collected data, namely YouTube, Sensor, and Twitter.

Figure 5 shows the effect of window mechanism for the replication process. Figure 5(a) illustrates the practical settings required for initializing the window mechanism, while Figure 5(b) shows the query/request analysis based on the data categorization of those requests. Based on the requests, the demand for each datum is obtained which is helpful in determining the number of replica.

Figure 6 shows the backend server table after the admission control mechanism where the storage and processing capabilities of each server/processor are stored. This information will be utilized for selecting the processors where the data have to be replicated. Figure 7 shows the replica selection and placement process as performed based on the concept of Figure 3 utilizing the MGSO algorithm.

Figure 8 shows the replica selection using MGSO. This process is achieved after the initialization of the window mechanism. When the raw data are categorized into three individual categories, the subserver which is most suitable for storing specific category is chosen using MGSO.

Figure 9 shows the MGSO-based replica placement stage. In this stage, the replica which is basically a single category of data are placed in the relevant server in another cluster. These data can be fetched through a user query by the clients as and when required.

Figure 10 shows the data replication placement table which is maintained at the servers in each cluster as a routing table for priority selection when any user query arises. Based on these tables, the servers in the clusters select the best server to fetch the data relevant to the query when a client raises a query.

The query process is described in Figures 11(a) and 11(b). When a user requires query, they intimate the query to the client system, which in turn is forwarded to the respective server. As mentioned before, the server checks the availability of the data for the query, and based on the priority of stored server locations, it chooses the server with minimum time consumption time. The request REQ is then forwarded, and the data are obtained for the write requests, as shown in Figure 11(b).

5.2. Performance Evaluation

The simulations are done using the Hadoop system, and the performance of the proposed MGSO-based model is compared with that of other algorithms. The performance of the MGSO-based replication algorithm is compared with the following four replication algorithms, namely, least frequently used (LFU), least recently used (LRU), 3-level hierarchical algorithm (3LHA), and bandwidth hierarchy-based replication algorithm (BHR). Execution time is how long it takes a program to run while the time complexity is the asymptotic behavior of running time as input size tends to infinity. In this evaluation, mean execution time is used which is the average of time taken by the algorithm to select the optimal replica and subsequent placement and the time to execute one read/write request for 12 iterations. We have utilized this parameter to evaluate the efficiency of the proposed model.

Figure 12 illustrates the mean execution time comparison of the replication algorithms with respect to the storage size, while Figure 13 shows the comparison of the algorithms in terms of mean execution time with respect to file size. It can be observed that the MGSO-based dynamic replication algorithm has lesser execution time when compared to the other algorithms. Similarly, even when larger files are employed for replication, the proposed model has less execution time, thus justifying the performance of the model. The reason that can validate this deviation is the fact that the window mechanism and optimal replica selection concept significantly reduce the execution time.

Figure 14 shows the execution time for replica selection and placement, and Figure 15 shows the execution time of 1 read request for different file sizes. The execution time of the other algorithms are considerably higher than the proposed MGSO-based model as the optimal number of replica selection is performed within less time and the subsequent selection of processors is also optimal. Due to this reason, the MGSO algorithm has better performance in the selection process. Similarly, the proposed model consumes less time in executing 1 read request with different file sizes. Even with increased file size, the proposed model outperforms the other models with faster execution. It also hosts an auxiliary processor that has programmable accelerators to perform various tasks more efficiently and is attached to a number of peripherals enabling efficient performance. In this paper, the mean execution time is considered as the main efficiency criteria for analysing data reusability and parallelization. In this regard, the alternative configurations are comparable. In order to determine the best implementation, several options were explored. The simplest is to perform the entire algorithm on the host processor; this approach was used as a benchmark to assess the improvements made by the different implementations. Even though this approach incurs some negligible overhead due to replica operations, the improvement in execution time favours this approach. Thus, from the performance evaluation, it can be concluded that the proposed MGSO-based dynamic replication algorithm is the better algorithm.

6. Conclusion

Distributed systems are the highlights of most data-intensive computing application. These systems predominate most web services and geographic data processing systems. The issues of selecting optimal replica number and replica server location still continue to throw challenges that prove to be roadblocks to efficient performance the distributed systems’ applications. This study focused on developing a dynamic data replication algorithm for real-time distributed databases using MGSO that resolves the existing open challenges to a considerable extend. The simulation results also illustrate the efficient performance of the proposed MGSO-based replication algorithm. In future, MGSO-based replication algorithm can be combined with a proper scheduling algorithm to improve the overall performance of the system. The model can also be extended with additional parameters such as fault tolerance and security.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.