Mathematical Problems in Engineering

Volume 2018, Article ID 3702808, 14 pages

https://doi.org/10.1155/2018/3702808

## On Detecting and Removing Superficial Redundancy in Vector Databases

^{1}Departamento de Matemáticas, Universidad de León, Campus de Vegazana, s/n, ES-24071 León, Spain^{2}Research Institute on Applied Sciences in Cybersecurity, Universidad de León, Campus de Vegazana, s/n, ES-24071 León, Spain

Correspondence should be addressed to Noemí DeCastro-García; se.noelinu@gsacn

Received 10 December 2017; Revised 31 March 2018; Accepted 12 April 2018; Published 24 May 2018

Academic Editor: Emilio Insfran Pelozo

Copyright © 2018 Noemí DeCastro-García et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A mathematical model is proposed in order to obtain an automatized tool to remove any unnecessary data, to compute the level of the redundancy, and to recover the original and filtered database, at any time of the process, in a vector database. This type of database can be modeled as an oriented directed graph. Thus, the database is characterized by an adjacency matrix. Therefore, a record is no longer a row but a matrix. Then, the problem of cleaning redundancies is addressed from a theoretical point of view. Superficial redundancy is measured and filtered by using the 1-norm of a matrix. Algorithms are presented by Python and MapReduce, and a case study of a real cybersecurity database is performed.

#### 1. Introduction

Current systems of knowledge extraction are based on the creation of the best models to solve a specific problem with particular data. In addition, the computational algorithms used are implemented and applied through different data management and processing architectures, from the most rudimentary to the most advanced analytical platforms using Big Data (BD) in real time.

In the most current cases, the creation of specific models, capable of analyzing, categorizing, and predicting different situations, such as anticipating trends or reacting to certain events, requires Big Data analytics. These techniques give rise to different challenges such as data inconsistency, incompleteness, scalability, timeliness, or data security [1, 2]. But, previously, data must be well-constructed [3].

On the other hand, good quality data is required to obtain good quality knowledge (in another case we fall in the well-known* garbage-in, garbage-out* constituted scenario; see [4]). Hence we should notice that not every datum is useful; for instance, it is expected that only 35% of considered data for analysis will be really useful by 2020 (see [5]).

In addition, motivated by the high increase of the number of incidents, the sensors, and the Internet of Things (IoT) devices, the rate of acquisition of data grows exponentially and, therefore, the volume of databases can become in a dangerous situation because the data obesity can be presented. Also, real-world databases are severely sensitive to be inconsistent, incomplete, and noisy. This fact turns out to be especially significant when several data sources need to be integrated. Working in a multisource system of acquisition of data generates high overlapping because new data are continuously included from different sources and thus increasing the probability of finding noise and dirty data. For instance, the above situation is typically in data warehouses, federated database systems, critical infrastructures, etc. (see [3, 6]). Thus, an appropriate strategy to remove unnecessary data (redundant data) in an automatized process is needed.

Data cleaning deals with detecting and removing errors from data and with eliminating the noise produced by the owned data collection procedure [7]. Hence, in a first approach, we state two types of redundancy: superficial and deep redundancy that could appear at instance or variables (features) level.(1)Superficial redundancy refers to all variables that we do not need to take into account in our further analysis from a natural point of view (empty variables, constant, and identical cases). The study of superficial redundancy allows us to filter the database without advanced statistical analysis or previous transformation of the data in treatable variables. Moreover, this redundancy may be studied in any database.(2)Deep redundancy collects all variables containing the same information, encoded in different ways as well as correlated variables, associated variables, or in general nondiscriminant variables to the fixed target. Note that in the first case of deep redundancy a simple frequency analysis could be enough to recognize the variables with the same information. However, the detection of the relation of correlation between variables, or the computation of the relevant features for a specific target, requires more advanced statistical analysis.

Note that it would be expected that redundancy appears in more than one type. The case of duplicated cases in a database is a special type of redundancy when the database is build up from several data sources, and it can show up in both types described above. In fact, removing the duplicate information is a very complex process in databases of cybersecurity reports, since identifying them is a difficult task that requires expert knowledge (deep redundancy). For example, we can have the same incident, reported by different sources, at different times, and by using a different lexical language. Or we can observe the same case reported twice by the same source because of defects in the collection procedure such as stuck-up of cases by updates.

Following [7], a data cleaning approach should satisfy several requirements:(1)It should be able to remove all main errors and inconsistencies of data from individual and multiple sources.(2)Manual inspection and programming effort should be limited.(3)It should be flexible enough in order to integrate additional sources.(4)Furthermore, data cleaning should be integrated together with schema-related data transformations.(5)Data transformations along the cleaning procedure should be specified in a declarative way and be reusable.

There are several research works that develop different approaches to data cleaning of databases by as special data mining treatment, data transformations, or specific operators (see [8–14]). Also, some of them perform the data cleaning on a separate cluster that, later, we need to integrate into the data smart center. However, these works are focused on the study of duplicated cases, remove of typographical errors, or detecting inconsistencies or discrepancies. Thus, one of the remaining challenges of the data science is to design and propose efficient representations, access, and tools that let us preprocess and clean huge amount of variety of data before starting data analysis procedures [3]. Although there are a quite amount of commercial tools available to support these tasks, the cleaning and transformation work needs to be done manually or by low-level programs (see [7]).

Our goal in this paper is to give a mathematical model for detecting and removing superficial redundancies in an instance and variable level, for a single or multisource context, over certain kind of databases (vector databases). Our proposal is an oriented directed-graph which is theoretically based. Then, we address the problem of cleaning redundancies by using elementary algebraic tools. A matrix with entries in is attached to a given database and removing redundancy operations arise as standard transformations on the matrix. Then we can give a concrete expression of the level of redundancy by using the 1-norm of a matrix. Thus we would be able to report redundancy in order to clean reports. Note that we do not intend just to delete all superficial redundancy data but to know the level in order to perform further actions in the design of statistical analysis. We remark that redundancy is not bad itself because some measures like the reputation of sources might be performed using redundant reports.

Moreover, in this work, we present a tool in open source that cleans the database in an automatized way and that computes the level of redundancy of the database. It also permits obtaining the original and the filtered database at any time of the process as well as the level of redundancy and the associated graph. The aim and procedures would be fully applied to any standard of reporting cases by means of formal language processing. The scripts mentioned above are available at a public GitHub repository (see [15, 16]).

In particular, this approach is applicable to clean up databases of cybersecurity reports (cyber databases). A cyber database contains a lot of unstructured information together with a high level of correlations where they are performed by human agents, that is, expert knowledge. The structural variety of the data of security reports is not unique (from machine-generated data to synthetic or artificial data). Moreover, the value of each feature could be structured, unstructured, or semistructured, and these typologies provide quantitative, (pseudo) qualitative, or string features. A security report usually is integrated, transformed, and combined with different data collection engines that provide only limited or null support for data cleaning, focusing on data transformations of management and schema integration. Since these engines receive information from different sources, in most cases we cannot modify the acquisition data process. Therefore, in order to extract knowledge from data, the best chance to get success is to optimize the different phases of the treatment and analysis of the data, and the first point is cleaning the database in an automatized way. Thus, we need to study the redundancy levels in order to detect superfluous reporters or optimize the resources. But, not every tool is useful. It must be noted here that some tools might be useless due to security constraints. In this case, it would not be possible to use online and private license software because sharing the data is not allowed. Therefore, in this situation, the tool to clean up the database would need to be integrated into an ecosystem with high levels of security.

In the final part of this paper, we also apply the developed tool to a real case of cleaning up a cyber database obtaining a 64% of superficial redundancy.

The paper is structured as follows: In Section 2 we give the model of a vector database from a matrix approach by graphs and our main results related to the computation of the level of redundancy. In Section 3, we develop the experimental section. This section includes the materials, the development of the tool that we have created to clean up databases as well as a comparison with some existent tools, and the case of study, in which we apply our tool to compute the level of the redundancy of a real fragment of a cyber database. Finally, our conclusions and references are given.

#### 2. A Graph Approach to the Redundancy of a Database

A graph database is a database that can be structured in graph form so that the nodes of the graph contain the information and the edges contain properties and/or define relations of the information contained in the nodes. One of the main strengths of these kinds of databases is the capability to give answers in short time for questions regarding relations (see [17]).

In this section, we will define a graph structure on a database that conceptually differs from the usual one described above, and the motivation is based on the problem of detecting or cleaning redundancies in a database. In general, to show whether two columns or variables of our database are redundant or not, in some sense, one looks at the information contained in these variables and then decide. Although this will be our procedure eventually, we will cluster the set of variables by looking at the meaning they have and then we will consider the usual procedure. The point is that once the clustering is done the database and the clusters define a graph structure in a natural way where not all the nodes contain information.

##### 2.1. A Graph Model for a Database

Observe that in the above discussion we started by considering a usual database and we finished with the database plus a clustering of the variables. Before defining the graph structure, we will formalize this situation, and we will use it as the starting point.

*Definition 1. *A vector database is a tuple of databases , each one of them coming with a label, which satisfy the following properties:(1)All the databases have the same length; that it is, all of them have the same number of rows .(2)If a database has a unique column, the column name agrees with the database label.(3)Two different databases must have different labels.(4)To have same column names in different databases is allowed.(5)The nature of features is of any type (strings, floats, integers, etc.).

*Remark 2. *We will state some notation for the sake of clarity.(1)We will use the notation for the labels of the databases that are one-column composed as well as their unique column name, and we will denote by the above set.(2)The set is going to be the set of labels of the databases.(3)The set is the set of column names of each single database . From , we can construct the set and the different column labels collected from all the databases. We can reorder according to , and then we use the following notation for the ordered .(4)The th row or report of a vector database is given by the vector constructed from the th rows of each database and is denoted by .With the notation described in Remark 2, a vector database has the following form:where could have the formor

*Remark 3. *If we apply the ordering that the set provides, then we can understand a vector database as a unique table; see Table 1.

So, any database in the form of (1) or Table 1 verifies the condition of Definition 1.