Review Article

Research on Multifeature Data Routing Strategy in Deduplication

Algorithm 2

The DRMF routing algorithm.
Input: when DRMF routing, avoid routing file data to storage nodes with high load. Before routing, the master node sends the disk storage utilization of each node to the routing server, and the routing server calculates the average disk storage utilization of the node µ.
Output: Indexid.
(1)Candidate nodes are determined. When the disk storage utilization of a node does not exceed a specific threshold σ, the node is used as a candidate target node. Σ is usually set to 0.05; that is, when the node disk storage utilization does not reach 1.05 of μ, this node is used as a candidate node.
(2)Data routing in the initial state of the cluster. In the initial state, no storage node in the deduplication cluster stores any data. At this time, MCS is used for routing decisions.
(3)Multifeature-based data routing. The routing server performs routing communication with multiple candidate nodes to obtain the maximum similarity between the candidate node storage file and the file to be routed, the disk storage utilization of the candidate node, and the most similar “box” address of the node. According to the similarity, node load characteristics are fused based on the benefit function, and the route benefit value of each candidate node is calculated. The node corresponding to the maximum benefit value is selected as the final target node.
(4)File storage. Route the file to the target node. If there is a specific “box” on the node, the “box” stores the file the most similar to the routing file; then the routing file is stored in the “box.” Otherwise, create a new “box,” save the routing file in the “box,” use the routing file as the representative file of the “box,” and add an index entry to the main index.