Abstract

This paper describes the term Big Data in aspects of data representation and visualization. There are some specific problems in Big Data visualization, so there are definitions for these problems and a set of approaches to avoid them. Also, we make a review of existing methods for data visualization in application to Big Data and taking into account the described problems. Summarizing the result, we have provided a classification of visualization methods in application to Big Data.

1. Introduction

The customers need to process secondary data, which is not directly connected to the customers business which has lead to the phenomenon called Big Data. Bellow we will provide the definition of the Big Data term.

Big Data, as mentioned by Gubarev Vasiliy Vasil’evich—is a phenomenon, which have no clear borders, and can be presented in unlimited or even infinite data accumulation. And even more, the accumulated data can be presented in various data formats, most of them are not structural data flows.

Usually, under the term of Big Data we understand a large data set, with volume growing exponentially. This data set can be too large, too “raw”, or too unstructured for classical data processing methods, used in relational data bases theory. Still, the main concern in that question is not the data volume, but the field of application of that data [1].

It is used to provide the following Big Data properties in different analytical literature sources: large volume of data (Volume), multiformat data presentation (Variety), and high data processing speed (Velocity). It is thought that if the exact data satisfies only two of three described properties, it can be related to the Big Data class [2, 3]. Therefore, nowadays, there are the following Big Data classes: “Volume-Velocity” class, “Volume-Variety” class, “Velocity-Variety” class, and “Volume-Velocity-Variety” class.

The Big Data processing is not a trivial task at all, and it requires special methods and approaches. Graphical thinking is a very simple and natural type of data processing for a human being, so, it can be said, that image data representation is an effective method, which allows for easing data understanding and provides enough support for decision making. But, in case of Big Data, most of classical data representation methods become less effective or even not applicable for concrete tasks. Analysis of applicability for one of the concrete classes of Big Data is a topical problem of subject area as there are no such case studies held before. Therefore, there is a purpose for this paper: Classification of existing visualization methods by criterion of its applicability to one of the described Big Data classes.

To make a decision for classification to one of described Big Data classes, method needs to be analyzed from the following points: applicability for a large volume data, possibility of data visualization, presented in different data formats, speed, and performance of data presentation.

2. Big Data Visualization Problems

Paying attention to the described Big Data properties, we can identify the following problems, making visualization not a trivial task.

2.1. Visual Noise

The simple presentation of whole array of data, being studied, can become a total mess on a screen, we will see only one big spot, consisting of points, which represents each data row. This problem comes from the fact that most of the objects in dataset are too relative to each other, and on the screen watcher cannot divide them as separate objects. So, sometimes, the analyst cannot get even a bit of useful information from whole data visualization without any preprocessing tasks. It must be mentioned that under the noise in this topic we should not understand any data damage or distortion, it is just should be thought of as a phenomenon of data visibility loss.

2.2. Large Image Perception

As a solution for the above problem, comes an approach, concluded in data distribution above a larger screen. But, occasionally, it ends up in another problem which is large image perception. There is a certain level of human being perception for different data visualization. Despite that this level for graphical data visualization is much higher, compared to table data visualization, it has its own limitations. And after achieving this level of perception, the human being just looses the ability to acquire any useful information from the data overloaded view. All visualization methods are limited by device resolution that is responsible for visualization output, so there is a limit on the number of points to show per visualization. Of course, we can replace visualization device for a modern one or a group of devices for partial data visualization, allowing us to present a more detailed image with a larger number of data points, but even if we could repeat this process for an infinite number of times, we will meet a human perception limitation. With growth of data volumes shown at once, human being will meet a difficulty in understanding data and its analysis.

Therefore, it can be said that data visualization methods are limited not only by aspect ratio and resolution of device but also by physical perception limits.

2.3. Information Loss

On the other hand, the approaches, which end up in reduction of visible data sets can be used. But, despite the solving of the above problems, these approaches lead to another problem which is information loss. These approaches operate with data aggregation and filtration, based on the relatedness of objects in concrete dataset by one or more criteria. Using these approaches can mislead the analyst, when he cannot notice some interesting hidden objects, and, sometimes, complex aggregation process can consume a large amount of time and performance resources in order to get the accurate and required information.

2.4. High Performance Requirements

The graphical analysis does not stop on only static image visualization, so the above problems become more significant in dynamic visualization. And there is also another problem, which can be hardly noticed in static visualization, because of lower visualization speed requirements—high performance requirement.

In behavior analysis tasks, the analyst usually wants to access a whole data array, ant this process can consume a lot of time, even if there is no requirement for a frequent refresh ration. It ends up in a continuous increase of computing resources or in filtering more and more data. Usually, both of these approaches can be widely applied in practice, because of their high organization, support and economical cost, or useful information loss during monitoring process. The second approach is difficult to customize and adapt because usually the analysis system or the analyst does not know the nature of the incoming data. So, the filtering task in this approach can consist only of simple steps, such as excluding each second row, or removing some factors from data.

2.5. High Rate of Image Change

And the last problem is high rate of image change. This problem becomes the most significant in monitoring tasks, when a person who observes the data just cannot react to the number of data changes or its intensity on display. The simple decrease of changing rate cannot provide the desired result, as the reaction speed of the human being directly depends on it.

As a result of that part of the article, it can be said that Big Data visualization results in analysis quality decreasement, which underlines the topicality of this paper.

3. Big Data Visualization Approaches

There are many different graphical visualization methods, but multidimensional data visualization is still just a little known and a topical subject of research.

Graphical visualization has been already used in different aspects of human activity, but the effectiveness and even applicability of methods can become a real problem with data volumes growth and data production speed. The described problem comes from the following points:(1)the need of artificial preparation of data slices, for partial data visualization;(2)visual limitation to the number of perceived data factors.

We need to overview existing data visualization methods and provide approaches, which can solve these problems. These approaches must provide more perceptible and informative data representations to help the analyst in finding hidden relations in Big Data.

Most of data visualization methods usually does not appear from nothing, but they become a development of earlier existing methods.

At most, the analyst tools must meet the following requirements:(1)analyst should be able to use more than one data representation view at once;(2)active interaction between user and analyzable view;(3)dynamical change of factors number during working process with view.

Bellow we will describe these requirements more clearly.

3.1. More Than One View per Representation Display

In order to reach a full data understanding, analyst usually uses simple approach, when he places different classical data views, which include only a limited set of factors so that he can easily find some relations between these views or in one concrete view [4, 5].

Despite the fact that there can be used completely every method of data visualization, often, we can see an approach, when the analyst uses just some similar or near to similar graphical objects. As an example, linear or dot diagrams (Figure 1). Of course, the analyst might be interested in comparing totally different visualizations of the same data, but the whole process of visual analysis, in that case, becomes much harder. Now, the researcher must compare not only similar graphical objects, but he also has to clearly distinguish different data and make a decision, based on different factors [6].

As, a consequence, it can be said that such an approach can guide analyst into desired location and provide enough support to make a decision in the very first stage of research. Therefore, there can be cases when this stage can become a final in the current research, driving analyst away from completely incorrect decisions.

Also, another key point in this approach is an ability to select desirable data areas onto all related representations, as shown in Figure 2.

Analysts may wish to coordinate views in a variety of ways: selecting items in one view might highlight matching records in other views, or instead provide filtering criteria to remove information from the other displays. Linked navigation provides an additional form of coordination: scrolling or zooming one view can simultaneously manipulate other views [5, 7].

3.2. Dynamical Changes in Number of Factors

Perhaps the most fundamental operation in visual analysis is contained in a data visualization specification. An analyst has to indicate which data is to be shown and how it should be shown to ease the information perception.

Any graphical visualization can be applied absolutely to any data, but there is always a topical question of whether the chosen method is correctly applied to the dataset, in order to get any useful information? Typically, for Big Data, the analyst cannot observe the whole dataset, find anomalies in it, or find any relations from the first glance [6]. So, another one topical approach is a dynamical change in the number of factors. After the analyst has chosen one factor, he is willing to see a classical histogram, which shows the distribution of records number depending on record type. As an example below, in Figure 3, on top histogram, we can see dependency between the number of cash collector units currently in use by payment system and the volume of each cash collector.

After the analyst has chosen another factor, for example support expense, the diagram type also changed into point diagram. The bottom part of Figure 3 shows the distribution of support expenses for each cash collector unit.

Continuing on, we can vary number of factors consequentially, lowering or increasing the number of visible factors and we will see changes in the diagram. This process is iterative and can be repeated until the desired pattern has not been found.

3.3. Filtering

The issue of value discernibility was always topical for visual analysis and it becomes more important in case of Big Data [6]. Even if we show only 60 unique values, not to mention millions of them, on one diagram, it is very difficult to place a label for each one.

And even more, there can be totally different value ranges in one data set. Therefore, some values would be just dominated by others with higher amplitude levels. So, as a result, the perception of whole diagram would be complicated. For example, some organizations, which work 24 hours per day can have different customers flow, and showing the dependency of customers by hour, we will lose perception ability for a group of night hours, when values are near equal and have a much lower amplitude comparing to a day hours.

Analyst usually wants to see both whole data representation and a partial and more detailed data representation lying in his area of interest. Moreover, the area of his interest is not static and can dynamically change during research process.

The filtering system and an overview map are used as an approach, solving these problems (Figure 4). Analyst can change the range on an overview map and see the detailed visualization of data in that range.

Moreover, detailed view does not have to be limited only by one level, as shown in Figure 4, but the detalization level can get wider and wider on each iteration.

For example, analyst can select one of the values, which is within his area of interest and get its distribution around city map, or highlighting of similar objects on overview map (Figure 5).

Data filtration by different criteria is also a topical key in visual data analysis. Human being cannot properly percept large number of visible objects at once, so the limitation to object quantity is a natural requirement. The main concepts, used for data filtering are in [8].

3.3.1. Dynamic Query Filters [8]

Some behavior patterns, which are used in analytic research process, can be identified. The most popular of these patterns can be grouped and linked for simpler user interface components, which allows analyst to have a direct access to them in order to ease some routine actions, they need to perform. So, now the analyst only has to press on one of the user elements to achieve the desired results and probably that result would be enough to make a decision or to make the area of search a bit smaller.

3.3.2. Starfield Display [9]

This approach is based on the idea that the whole data set is always visible. At the first level, some data need aggregation and the analyst sees only some grouped information, but as he makes a detailed request, which is represented in zooming actions, each group collapses into more and more detailed data.

3.3.3. Tight Coupling [9]

Some user interface elements can be directly linked to each other, so that their coupling can prevent the analyst from making mistakes for data input, or restrict him to move his research into an obviously wrong direction.

The basic example of such user interface elements is a group of radio buttons. After pressing the one radio button in group, other buttons loses user selection. Different filters based on selection inversion are usually developed using this approach.

4. Big Data Visualization Methods

This paragraph contains big data visualization method description. Each description contains arguments for method classification to one of the big data classes. We assume the following data criteria:(i)large data volume;(ii)data variety;(iii)data dynamics.

4.1. TreeMap

This method is based on space-filling visualization of hierarchical data. And as follows from the definition, there is a strict requirement applied to data—data objects which have to be hierarchically linked. The Treemap is represented by a root rectangle, divided into groups, also represented by the smaller rectangles, which correspond to data objects from a set [10].

Examples of this method are free space on hard drive visualization, profitability from different organizations, and its affiliates.

Method can be applied to large data volumes, iteratively representing data layers for each level of hierarchy. In case of device resolution exciding, the analyst always can move forward to the next block to continue his research into more detailed data on lower level of hierarchy. So, the large data volume criterion is satisfied.

Because method is based on shapes volume estimation, calculated from one or more data factors, every change in data is followed by total repaint of whole image for the currently visible level of hierarchy. Changes on higher levels don’t require the image repainting because the data it contains is not visible for an analyst.

The visualization acquired by this method can only show two data factors. The first one is the factor used for a shape volume calculation. And the second is a color, used for grouping the shapes. Also, factors used for volume estimation must be presented by computable data types, so the criterion data variety is not met.

And the last criterion also cannot be satisfied, because Treemap only shows data representation at one moment in time.

Method advantages:(i)hierarchical grouping clearly shows data relations;(ii)extreme outliers are immediately visible using special color.

Method disadvantages:(i)data must be hierarchical and, even more, TreeMaps are better for analyzing data sets where there is at least one important quantitative dimension with wide variations;(ii)not suitable for examining historical trends and time patterns;(iii)the factor used for size calculation cannot have negative values [11].

4.2. Circle Packing

This method is a direct alternative to treemap, besides the fact that as primitive shape it uses circles, which also can be included into circles from a higher hierarchy level. The main profit of this method is that possibly we can place and percept larger a amount of objects, by using classical Treemap [12].

Because the circle packing method is based on the Treemap method, it has the same properties. So, we can assume that only large data volumes criterion is met by this method.

Still, there are some differences in methods merits and demerits as follows.Method advantages: space-efficient visualization method compared to Treemap.Method disadvantages: the same disadvantages as for Treemap Method.

4.3. Sunburst

This method is also an alternative to Treemap, but it uses Treemap visualization, converted to polar coordinate system. The main difference between these methods is that the variable parameters are not width and height, but a radius and arc length. And this difference allows us not to repaint the whole diagram upon data change, but only one sector containing new data by changing its radius. And because of that property, this method can be adapted to show data dynamics, using animation.

Animation can add dynamics to data, manipulating only with sunburst rays radius, so, it can be said, that data dynamics criterion is met.

As the two previous methods, Sunburst have the same advantages and disadvantages.Method advantages: easily perceptible by most humans [13].Method disadvantages: the same disadvantages as for Treemap Method.

4.4. Circular Network Diagram

Data object are placed around a circle and linked by curves based on the rate of their relativeness. The different line width or color saturation usually is used as a measurement of object relativeness. Also method usually provides interactions making unnecessary links invisible and highlighting selected one. So, this method underlines direct relation between multiple objects and shows how relative it is [14].

As for typical use-cases for that method, there are the following examples: product transfer diagram between cities, relations between bought product in different shops, and so forth.

This method allows us to represent aggregated data as a set of arcs between analyzed data objects, so that the analyst can get quantity information about relations between objects. This method can be applied to large data volumes, placing data objects by circle radius and varying ark area of objects. Also, there can be additional information, shown near an arc, which can be provided from other factors of data objects. And it is necessary to add that there are no limitation in using only one factor per diagram, we can always put different factors of objects and make relations between them. It can be difficult to percept and understand, but in some cases, this approach will produce the analyst with enough information to change the direction of his research or to make a final decision. That property of circular diagram satisfies data variety criterion.

The circular form encourages eye movement to proceed along curved lines, rather than in a zigzag fashion in a square or rectangular figure [15].

And, as a result of the whole data representation, every single change in data must be followed by the repainting of the diagram.

Method advantages:(i)allows us to make relative data representation, which can be easily percepted;(ii)within the circle, the resolution varies linearly, increasing with radial position. This makes the center of the circle ideal for compactly displaying summary statistics or indicating points of interest.

Method disadvantages:(i)method may end in imperceptible representation form and may need regrouping of data objects on the screen;(ii)objects with the smallest parameter weight can be suppressed by larger ones, ending up in total mess onto the diagram [16].

4.5. Parallel Coordinates

This method allows visual analysis to be extended with multiple data factors for different objects. All data factors to be analyzed are placed on one of the axis, and the corresponding values of data object in relative scale are placed on the other. Each data object is represented by a series of linked traverse lines, showing its place in context of other objects. This method allows us to use only a thick line on screen to represent individual data object and this approach allows it to met the first criterion—large data volumes [17].

One extension of standard 2D parallel coordinates is the multirelational 3D parallel coordinates. Here, the axes are placed, equally separated, on a circle with a focus axis in the centre. A data item is again displayed as a series of line segments intersecting all axes. This axis configuration has the advantage that all pairwise relationships between the focus variable in the centre and all outer variables can be investigated simultaneously [18].

This method can handle several factors for a large number of objects per single screen, so it satisfies the data variety criterion. Because method is based on relative values, it requires calculation of minimum and maximum values for each factor. While values are changing between the minimum and maximum values of each factor, there is no need for repainting of all images but for a case when value exceeds this limit, we have to repaint the image to show adequate visualization. That approach can be used for visualization of dynamic data. The second way to represent a data in time is to use three-dimensional extensions for polar coordinates method [18].

Method advantages:(i)factors ordering does not influence total diagram perceptions;(ii)method allows us to analyze both whole data set of objects at once and individual data objects;

Method disadvantages:(i)method has limitation to the number of factors, shown at once;(ii)visualization dynamic data end up in changing whole data representation [18, 19].

4.6. Streamgraph

Streamgraph is a type of a stacked area graph, which is displaced around a central axis, resulting in flowing and organic shape. This method shows the trends for different sets of events, quantity of its occurrences, its relative rates, and so one. So, there can be a set of similar events, shown through the timeline on the image [20].

The method has the twin goals: to show many individual time series, while also conveying their sum. Since the heights of the individual layers add up to the height of the overall graph, it is possible to satisfy both goals at once. At the same time, this involves certain trade-offs. There can be no spaces between the layers, since this would distort their sum. As a consequence of having no spaces between layers, changes in a middle layer will necessarily cause wiggles in all the other surrounding layers, wiggles which have nothing to do with the underlying data of those affected time series [20].

This method works only with one data-dimension, so, it does not support data variety criterion, but still it can be applied to large datasets.

After the new data have arrived into the analytical system, the diagram, made by this method, can be dynamically continued by new values, so it meets data dynamics criterion. But still, there is always one strict limitation, number of factors, and this method can be used only for representation of quantity factors.

Examples: musical trends and cinema genre trends.Method advantages: effective for trends visualization;

Method disadvantages:(i)data representation shows only one data factor;(ii)method depends on data layers (objects) sorting [20, 21].

5. Results

As a result, we have provided Table 1, showing which method can process various data, large volumes data, and handles changes in time data (Table 1).

Analyzing Table 1, it can be said that methods based on Treemap method cannot be applied to one of the Big Data classes because it meets only one criterion, while it requires satisfying a minimum of two criteria.

According to Table 2, now we can clearly classify visualization methods by Big Data classes (Table 2).

6. Conclusion

In this paper, we have described main problems of Big Data visualization and approaches of how we can avoid them. Also, we have provided a classification of Big Data visualization methods based on applicability to one of the three Big Data classes.

Future works in this field can be held in the following areas: research of visualization methods applicability for different scales, making decisions and recommendation for visualization method selection for concrete Big Data classes, and formalization of requirements and restrictions for visualization methods applied to one or more Big Data classes.