Modern physics is based on both theoretical analysis and experimental validation. Complex scenarios like subatomic dimensions, high energy, and lower absolute temperature are frontiers for many theoretical models. Simulation with stable numerical methods represents an excellent instrument for high accuracy analysis, experimental validation, and visualization. High performance computing support offers possibility to make simulations at large scale, in parallel, but the volume of data generated by these experiments creates a new challenge for Big Data Science. This paper presents existing computational methods for high energy physics (HEP) analyzed from two perspectives: numerical methods and high performance computing. The computational methods presented are Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, and Random Matrix Theory used in analysis of particles spectrum. All of these methods produce data-intensive applications, which introduce new challenges and requirements for ICT systems architecture, programming paradigms, and storage capabilities.

1. Introduction

High Energy Physics (HEP) experiments are probably the main consumers of High Performance Computing (HPC) in the area of e-Science, considering numerical methods in real experiments and assisted analysis using complex simulation. Starting with quarks discovery in the last century to Higgs Boson in 2012 [1], all HEP experiments were modeled using numerical algorithms: numerical integration, interpolation, random number generation, eigenvalues computation, and so forth. Data collection from HEP experiments generates a huge volume, with a high velocity, variety, and variability and passes the common upper bounds to be considered Big Data. The numerical experiments using HPC for HEP represent a new challenge for Big Data Science.

Theoretical research in HEP is related to matter (fundamental particles and Standard Model) and Universe formation basic knowledge. Beyond this, the practical research in HEP has led to the development of new analysis tools (synchrotron radiation, medical imaging or hybrid models [2], wavelets-computational aspects [3]), new processes (cancer therapy [4], food preservation, or nuclear waste treatment), or even the birth of a new industry (Internet) [5].

This paper analyzes two aspects: the computational methods used in HEP (Monte Carlo methods and simulations, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation, and Random Matrix Theory) and the challenges and requirements for ICT systems to deal with processing of Big Data generated by HEP experiments and simulations.

The motivation of using numerical methods in HEP simulations is based on special problems which can be formulated using integral or differential-integral equations (or systems of such equations), like quantum chromodynamics evolution of parton distributions inside a proton which can be described by the Gribov-Lipatov-Altarelli-Parisi (GLAP) equations [6], estimation of cross section for a typical HEP interaction (numerical integration problem), and data representation using histograms (numerical interpolation problem). Numerical methods used for solving differential equations or integrals are based on classical quadratures and Monte Carlo (MC) techniques. These allow generating events in terms of particle flavors and four-momenta, which is particularly useful for experimental applications. For example, MC techniques for solving the GLAP equations are based on simulated Markov chains (random walks), which have the advantage of filtering and smoothing the state vector for estimating parameters.

In practice, several MC event generators and simulation tools are used. For example, HERWIG (http://projects.hepforge.org/herwig/) project considers angular-ordered parton shower, cluster hadronization (the tool is implemented using Fortran), PYTHIA (http://www.thep.lu.se/torbjorn/Pythia.html) project is oriented on dipole-type parton shower and string hadronization (the tool is implemented in Fortran and C++), and SHERPA (http://projects.hepforge.org/sherpa/) considers dipole-type parton shower and cluster hadronization (the tool is implemented in C++). An important tool for MC simulations is GATE (GEANT4 Application for Tomographic Emission), a generic simulation platform based on GEANT4. GATE provides new features for nuclear imaging applications and includes specific modules that have been developed to meet specific requirements encountered in SPECT (Single Photon Emission Tomography) and PET (Positron Emission Tomography).

The main contributions of this paper are as follows:(i)introduction and analysis of most important modeling methods used in High Energy Physics;(ii)identifying and describing of the computational numerical methods for High Energy Physics;(iii)presentation of the main challenges for Big Data processing.

The paper is structured as follows. Section 2 introduces the computational methods used in HEP and describes the performance evaluation of parallel numerical algorithms. Section 3 discusses the new challenge for Big Data Science generated by HEP and HPC. Section 4 presents the conclusions and general open issues.

2. Computational Methods Used in High Energy Physics

Computational methods are used in HEP in parallel with physical experiments to generate particle interactions that are modeled using vector of events. This section presents general approach of event generation, simulation methods based on Monte Carlo algorithms, Markovian Monte Carlo chains, methods that describe unfolding processes in particle physics, Random Matrix Theory as support for particle spectrum, and kernel estimation that produce continuous estimates of the parent distribution from the empirical probability density function. The section ends with performance analysis of parallel numerical algorithms used in HEP.

2.1. General Approach of Event Generation

The most important aspect in simulation for HEP experiments is event generation. This process can be split into multiple steps, according to physical models. For example, structure of LHC (Large Hadron Collider) events: hard process; parton shower; hadronization; underlying event. According to official LHC website (http://home.web.cern.ch/about/computing): “approximately 600 million times per second, particles collide within the LHC Experiments at CERN generate colossal amounts of data. The Data Centre stores it, and sends it around the world for analysis.” The analysis must produce valuable data and the simulation results must be correlated with physical experiments.

Figure 1 presents the general approach of event generation, detection, and reconstruction. The physical model is used to create simulation process that produces different type of events, clustered in vector of events (e.g., the fourth type of events in LHC experiments).

In parallel, the real experiments are performed. The detectors identify the most relevant events and, based on reconstruction techniques, vector of events is created. The detectors can be real or simulated (software tools) and the reconstruction phase combines real events with events detected in simulation. At the end, the final result is compared with the simulation model (especially with generated vectors of events). The model can be corrected for further experiments. The goal is to obtain high accuracy and precision of measured and processed data.

Software tools for event generation are based on random number generators. There are three types of random numbers: truly random numbers (from physical generators), pseudorandom numbers (from mathematical generators), and quasirandom numbers (special correlated sequences of numbers, used only for integration). For example, numerical integration using quasirandom numbers usually gives faster convergence than the standard integration methods based on quadratures. In event generation pseudorandom numbers are used most often.

The most popular HEP application uses Poisson distribution combined with a basic normal distribution. The Poisson distribution can be formulated as with ( is variance and is expectation value). Having a uniform random number generator called RND() (Random based on Normal Distribution) we can use the following two algorithms for event generation techniques.

The result of running Algorithms 1 and 2 to generate around random numbers is presented in Figure 2. In general, the second algorithm has better result for Poisson distribution. General recommendation for HEP experiments indicates the use of popular random number generators like TRNG (True Random Number Generators), RANMAR (Fast Uniform Random Number Generator used in CERN experiments), RANLUX (algorithm developed by Luscher used by Unix random number generators), and Mersenne Twister (the “industry standard”). Random number generators provided with compilers, operating system, and programming language libraries can have serious problem because they are based on system clock and suffer from lack of uniformity of distribution for large amounts of generated numbers and correlation of successive values.

(1)   procedure  RANDOM_GENERATOR_POISSON( )
(2)      ;
(3)      ;
(4)      ;
(5)     while     do
(6)        ;
(7)        ;
(8)        ;
(9)     end while
(10)    return number;
(11)  end procedure

(2)       ;
(3)       ;
(4)       ;
(5)       ;
(6)      while     do
(7)         ;
(8)         ;
(9)         ;
(10)     end while
(11)     return number;
(12)  end procedure

The art of event generation is to use appropriate combinations of various random number generation methods in order to construct an efficient event generation algorithm being solution to a given problem in HEP.

2.2. Monte Carlo Simulation and Markovian Monte Carlo Chains in HEP

In general, a Monte Carlo (MC) method is any simulation technique that uses random numbers to solve a well-defined problem, . If is a solution of the problem (e.g., or has a Boolean value), we define , an estimation of , as , where is a random variable that can take more than one value and for which any value that will be taken cannot be predicted in advance. If is the probability density function, , the cumulative distributed function is

is a monotonically nondecreasing function with all values in . The expectation value is And the variance is

2.2.1. Monte Carlo Event Generation and Simulation

To define a MC estimator the “Law of Large Numbers (LLN)” is used. LLN can be described as follows: let one choose numbers randomly, with the probability density function uniform on a specific interval , each being used to evaluate . For large (consistent estimator),

The properties of a MC estimator are being normally distributed (with Gaussian density); the standard deviation is ; MC is unbiased for all (the expectation value is the real value of the integral); the estimator is consistent if (the estimator converges to the true value of the integral for every large ); a sampling phase can be applied to compute the estimator if we do not know anything about the function ; it is just suitable for integration. The sampling phase can be expressed, in a stratified way, as

MC estimations and MC event generators are necessary tools in most of HEP experiments being used at all their steps: experiments preparation, simulation running, and data analysis.

An example of MC estimation is the Lorentz invariant phase space (LIPS) that describes the cross section for a typical HEP process with particle in the final state.

Consider where is the matrix describing the interaction between particles and is the element of LIPS. We have the following estimation: where is total four-momentum of the -particle system; and are four-momenta and mass of the final state particles; is the total energy momentum conservation; is the on-mass-shell condition for the final state system. Based on the integration formula obtain the iterative form for cross section: which can be numerical integrated by using the recurrence relation. As result, we can construct a general MC algorithm for particle collision processes.

Example 1. Let us consider the interaction: where Higgs boson contribution is numerically negligible. Figure 3 describes this interaction ( is the azimuthal angle, the polar angle, and are the four-momenta for particles).
The cross section is where , (fine structure constant), is the center of mass energy squared, and and are constant functions. For pure processes we have and , and the total cross section becomes

We introduce the following notation: and let us consider an approximation of . Then . Now, we can compute where and is the estimation of based on . Here, the MC estimator is and the standard deviation is

The final numerical result based on MC estimator is

As we can show, the principle of a Monte Carlo estimator in physics is to simulate the cross section in interaction and radiation transport knowing the probability distributions (or an approximation) governing each interaction of elementary particles.

Based on this result, the Monte Carlo algorithm used to generate events is as follows. It takes as input and in a main loop considers the following steps: generate peer from ; compute four-momenta ; compute . The loop can be stopped in the case of unweighted events, and we will stay in the loop for weighted events. As output, the algorithm returns four-momenta for particle for weighted events and four-momenta and an array of weights for unweighted events. The main issue is how to initialize the input of the algorithm. Based on formula (for and ), we can consider as input . Then .

In HEP theoretical predictions used for particle collision processes modeling (as shown in presented example) should be provided in terms of Monte Carlo event generators, which directly simulate these processes and can provide unweighted (weight = 1) events. A good Monte Carlo algorithm should be used not only for numerical integration [7] (i.e., provide weighted events) but also for efficient generation of unweighted events, which is very important issue for HEP.

2.2.2. Markovian Monte-Carlo Chains

A classical Monte Carlo method estimates a function with by using a random variable. The main problem with this approach is that we cannot predict any value in advance for a random variable. In HEP simulation experiments the systems are described in states [8]. Let us consider a system with a finite set of possible states , and the state at the moment . The conditional probability is defined as where the mappings can be interpreted as the description of system evolution in time by specifying a specific state for each moment of time.

The system is a Markov chain if the distribution of depends only on immediate predecessor and it is independent of all previous states as follows:

To generate the time steps we use the probability of a single forward Markovian step given by with the property and we define . The 1-dimensional Monte Carlo Markovian Algorithm used to generate the time steps is presented in Algorithm 3.

(1)   Generate according with
(2)   if     then               Generate the initial state.
(3)     ;          Compute the initial probability.
(4)    Retain   ;
(5)   end if
(6)   if     then          Discard all generated and computed data.
(7)     ; ;
(8)    Delete   ;
(9)    EXIT.                 The algorithm ends here.
(10)  end if
(11)    ;
(12)  while (1) do              Infinite loop until a successful EXIT.
(13)   Generate according with
(14)   if     then          Generate a new state and new probability.
(15)     ;
(16)    Retain   ;
(17)   end if
(18)   if     then           Discard all generated and computed data.
(19)     ; ;
(20)    Retain   ;  Delete   ;
(21)    EXIT.                    The algorithm ends here.
(22)   end if
(23)    ;
(24)  end while

The main result of Algorithm 3 is that follows a Poisson distribution:

We can consider the 1-dimensional Monte Carlo Markovian Algorithm as a method used to iteratively generate the systems’ states (codified as a Markov chain) in simulation experiments. According to the Ergodic Theorem for Markov chains, the chain defined has a unique stationary probability distribution [9, 10].

Figures 4 and 5 present the running of Algorithm 3. According to different values of parameter used to generate the next step, the results are very different, for 1000 iterations. Figure 4 for shows a profile of the type of noise. For profile looks like some of the information is filtered and lost. The best results are obtained for and and the generated values can be easily accepted for MC simulation in HEP experiments.

Figure 5 shows the acceptance rate of values generated with parameter used in the algorithm. And parameter values are correlated with Figure 4. Results in Figure 5 show that the acceptance rate decreases rapidly with increasing value of parameter . The conclusion is that values must be kept small to obtain meaningful data. A correlation with the normal distribution is evident, showing that a small value for the mean square deviation provides useful results.

2.2.3. Performance of Numerical Algorithms Used in MC Simulations

Numerical methods used to compute MC estimator use numerical quadratures to approximate the value of the integral for function on a specific domain by a linear compilation of function values and weights as follows:

We can consider a consistent MC estimator a classical numerical quadrature with all . Efficiency of integration methods for 1 dimension and for dimensions is presented in Table 1. We can conclude that quadrature methods are difficult to apply in many dimensions for variate integration domains (regions) and the integral is not easy to be estimated.

As practical example, in a typical high-energy particle collision there can be many final-state particles (even hundreds). If we have final state particle, we face with dimensional phase space. As numerical example, for we have dimensions, which is very difficult approach for classical numerical quadratures.

Full decomposition integration volume for one double number (10 Bytes) per volume unit is Bytes. For the example considered with and divisions for interval we have, for one numerical integration, Considering events per second, one integration per event, the data produced in one hour will be 3197.4  Bytes.

The previous assumption is only for multidimensional arrays. But due to the factorization assumption, , we obtain for one integration which means 2.62  Bytes of data produce for one hour of simulations.

2.3. Unfolding Processes in Particle Physics and Kernel Estimation in HEP

In particle physics analysis we have two types of distributions: true distribution (considered in theoretical models) and measured distribution (considered in experimental models, which are affected by finite resolution and limited acceptance of existing detectors). A HEP interaction process starts with a true knows distribution and generate a measured distribution, corresponding to an experiment of a well-confirmed theory. An inverse process starts with a measured distribution and tries to identify the true distribution. These unfolding processes are used to identify new theories based on experiments [11].

2.3.1. Unfolding Processes in Particle Physics

The theory of unfolding processes in particle physics is as follows [12]. For a physics variable we have a true distribution mapped in and an -vector of unknowns and a measured distribution (for a measured variable ) mapped in an -vector of measured data. A response matrix encodes a Kernel function describing the physical measurement process [1215]. The direct and inverse processes are described by the Fredholm integral equation [16] of the first kind, for a specific domain , In particle physics the Kernel function is usually known from a Monte Carlo sample obtained from simulation. A numerical solution is obtained using the following linear equation: . Vectors and are assumed to be 1-dimensional in theory, but they can be multidimensional in practice (considering multiple independent linear equations). In practice, also the statistical properties of the measurements are well known and often they follow the Poisson statistics [17]. To solve the linear systems we have different numerical methods.

First method is based on linear transformation . If then and we can use direct Gaussian methods, iterative methods (Gauss-Siedel, Jacobi or SOR), or orthogonal methods (based on Householder transformation, Givens methods, or Gram-Schmidt algorithm). If (the most frequent scenario) we will construct the matrix (called pseudoinverse Penrose-Moore). In these cases the orthogonal methods offer very good and stable numerical solutions.

Second method considers the singular value decomposition: where and are matrices with orthonormal columns and the diagonal matrix . The solution is where , , are called Fourier coefficients.

2.3.2. Random Matrix Theory

Analysis of particle spectrum (e.g., neutrino spectrum) faces with Random Matrix Theory (RMT), especially if we consider anarchic neutrino masses. The RMT means the study of the statistical properties of eigenvalues of very large matrices [18]. For an interaction matrix (with size ), where is an independent distributed random variable and is the complex conjugate and transpose matrix, we define , which describes a Gaussian Unitary Ensemble (GUE). The GUE properties are described by the probability distribution : it is invariant under unitary transformation, , where , is a Hermitian matrix (); the elements of matrix are statistically independent, ; and the matrix can be diagonalized as , where , is the eigenvalue of and if

The numerical methods used for eigenvalues computation are the QR method and Power methods (direct and indirect). The QR method is a numerical stable algorithm and Power method is an iterative one. The RMT can be used for many body systems, quantum chaos, disordered systems, quantum chromodynamics, and so forth.

2.3.3. Kernel Estimation in HEP

Kernel estimation is a very powerful solution and relevant method for HEP when it is necessary to combine data from heterogeneous sources like MC datasets obtained by simulation and from Standard Model expectation, obtained from real experiments [19]. For a set of data with a constant bandwidth (the difference between two consecutive data values), called the smoothing parameter, we have the estimation where is an estimator. For example, a Gauss estimator with mean and standard deviation is and has the following properties: positive definite and infinitely differentiable (due to the exp function), and it can be defined for an infinite supports (). The kernel is a nonparametric method, which means that is independent of dataset and for large amount of normally distributed data we can find a value for that minimizes the integrated squared error of . This value for bandwidth is computed as

The main problem in Kernel Estimation is that the set of data is not normally distributed and in real experiments the optimal bandwidth it is not known. An improvement of presented method considers adaptive Kernel Estimation proposed by Abramson [20], where and are considered global qualities for dataset. The new form is and the local bandwidth value that minimizes the integrated squared error of is where is the normal estimator.

Kernel estimation is used for event selection to confidence level evaluation, for example, in Markovian Monte Carlo chains or in selection of neural network output used in experiments for reconstructed Higgs mass. In general, the main usage of Kernel estimation in HEP is searching for new particle, by finding relevant data in a large dataset.

A method based on Kernel estimation is the graphical representation of datasets using advanced shifted histogram algorithm (ASH). This is a numerical interpolation for large datasets with the main aim of creating a set of histograms , with the same bin-width . Algorithm 4 presents the steps of histograms generation starting with a specific interval , a number of points in this interval, and a number of bins and a number of values used for kernel estimation, . Figure 6 shows the results of kernel estimation if function on and graphical representation with a different number of bins. The values on vertical axis are aggregated in step 17 of Algorithm 4 and increase with the number of bins.

(1)   procedure  
(2)     ; ;
(3)    for     do
(4)       ; ;
(5)    end for
(6)    for     do
(7)       ;
(8)      if     then
(9)        ;
(10)     end if
(11)   end for
(12)   for     do
(13)     if     then
(14)       ;
(15)     end if
(16)     for     do
(17)       ;
(18)     end for
(19)   end for
(20)   for     do
(21)      ;
(22)      ;
(23)   end for
(24)   return   , .
(25)  end procedure

2.3.4. Performance of Numerical Algorithms Used in Particle Physics

All operations used in presented methods for particle physics (Unfolding Processes, Random Matrix Theory, and Kernel Estimation) can be reduced to scalar products, matrix-vector products, and matrix-matrix products. In [21] the design of new standard for the BLAS (Basic Linear Algebra Subroutines) in C language by extension of precision is described. This permits higher internal precision and mixed input/output types. The precision allows implementation of some algorithms that are simpler, more accurate, and sometimes faster than possible without these features. Regarding the precision of numerical computing, Dongarra and Langou established in [22] an upper bound for the residual check for system, with a dense matrix. The residual check is defined as where is the relative machine precision for the IEEE representation standard; is the infinite norm of a vector: ; and is the infinite norm of a matrix .

Figure 7 presents the graphical representation of Dongarras result (using logarithmic scales) for simple and double precision. For simple precision, , for all the residual check is always lower than imposed upper bound, similarly for double precision with , for all . If matrix size is greater than these values, it will not be possible to detect if the solution is correct or not. These results establish upper bounds for data volume in this model.

In a single-processor system, the complexity of algorithms depends only on the problem size, . We can assume , where is a fundamental function (). In parallel systems (multiprocessor systems, with processors) we have the serial processing time and parallel processing time . The performance of parallel algorithms can be analyzed using speed-up, efficiency, and isoefficiency metrics.(i)The speed-up, , represents how a parallel algorithm is faster than a corresponding sequential algorithm. The speed-up is defined as . There are special bounds for speed-up [23]: , where is the average parallelism (the average number of busy processors given unbounded number of processors). Usually , but under special circumstances the speed-up can be [24]. Another upper bound is established by the Amdahls law: where is the fraction of a program that is sequential. The upper bound is considered for a time of parallel fraction.(ii)The efficiency is the average utilization of processors: .(iii)The isoefficiency is the growth rate of workload in terms of number of processors to keep efficiency fixed. If we consider for any fixed efficiency we obtain . This means that we can establish a relation between needed number of processors and problem size. For example for the parallel sum of numbers using processors we have , so .

Numerical algorithms use for implementation a hypercube architecture. We analyze the performance of different numerical operations using the isoefficiency metric. For the hypercube architecture a simple model for intertask communication considers where is the latency (the time needed by a message to cross through the network), is the time needed to send a word ( is called bandwidth), and is the message length (expressed in number of words). The word size depends on processing architecture (usually it is two bytes). We define as the processing time per word for a processor. We have the following results.(i)External product  . The isoefficiency is written as Parallel processing time is . The optimality is computed using (ii)Scalar product (internal product) . The isoefficiency is written as (iii)Matrix-vector product . The isoefficiency is written as

Table 2 presented the amount of data that can be processed for a specific size. The cases that meet the upper bound are marked with (*). To keep the efficiency high for a specific parallel architecture, HPC algorithms for particle physics introduce upper limits for the amount of data, which means that we have also an upper bound for Big Data volume in this case.

The factors that determine the efficiency of parallel algorithms are task balancing (work-load distribution between all used processors in a system to be maximized); concurrency (the number/percentage of processors working simultaneously to be maximized); and overhead (extra work for introduce by parallel processing that does not appear in serial processing to be minimized).

3. New Challenges for Big Data Science

There are a lot of applications that generate Big Data, like social networking profiles, social influence, SaaS & Cloud Apps, public web information, MapReduce scientific experiments and simulations (especially HEP simulations), data warehouse, monitoring technologies, and e-government services. Data grow rapidly, since applications produce continuously increasing volumes of both unstructured and structured data. The impact on the approach to data processing, transfer, and storage is the need to reevaluate the way and solutions to better answer the users’ needs [25]. In this context, scheduling models and algorithms for data processing have an important role becoming a new challenge for Big Data Science.

HEP applications consider both experimental data (that are application with TB of valuable data) and simulation data (with data generated using MC based on theoretical models). The processing phase is represented by modeling and reconstruction in order to find properties of observed particles (see Figure 8). Then, the data are analyzed a reduced to a simple statistical distribution. The comparison of results obtained will validate how realistic is a simulation experiment and validate it for use in other new models.

Since we face a large variety of solutions for specific applications and platforms, a thorough and systematic analysis of existing solutions for scheduling models, methods, and algorithms used in Big Data processing and storage environments is needed. The challenges for scheduling impose specific requirements in distributed systems: the claims of the resource consumers, the restrictions imposed by resource owners, the need to continuously adapt to changes of resources’ availability, and so forth. We will pay special attention to Cloud Systems and HPC clusters (datacenters) as reliable solutions for Big Data [26]. Based on these requirements, a number of challenging issues are maximization of system throughput, sites’ autonomy, scalability, fault-tolerance, and quality of services.

When discussing Big Data we have in mind the 5 Vs: Volume, Velocity, Variety, Variability, and Value. There is a clear need of many organizations, companies, and researchers to deal with Big Data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. For these examples, a popular data processing engine for Big Data is Hadoop MapReduce [27]. The main problem is that data arrives too fast for optimal storage and indexing [28]. There are other several processing platforms for Big Data: Mesos [29], YARN (Hortonworks, Hadoop YARN: A next-generation framework for Hadoop data processing, 2013 (http://hortonworks.com/hadoop/yarn/)), Corona (Corona, Under the Hood: Scheduling MapReduce jobs more efficiently with Corona, 2012 (Facebook)), and so forth. A review of various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era is presented in [30]. The challenges that are described for Big Data Science on the modern and future Scientific Data Infrastructure are presented in [31]. The paper introduces the Scientific Data Life-cycle Management (SDLM) model that includes all the major stages and reflects specifics in data management in modern e-Science. The paper proposes the SDI generic architecture model that provides a basis for building interoperable data or project centric SDI using modern technologies and best practices. This analysis highlights in the same time performance and limitations of existing solutions in the context of Big Data. Hadoop can handle many types of data from disparate systems: structured, unstructured, logs, pictures, audio files, communications records, emails, and so forth. Hadoop relies on an internal redundant data structure with cost advantages and is deployed on industry standard servers rather than on expensive specialized data storage systems [32]. The main challenges for scheduling in Hadoop are to improve existing algorithms for Big Data processing: capacity scheduling, fair scheduling, delay scheduling, longest approximate time to end (LATE) speculative execution, deadline constraint scheduler, and resource aware scheduling.

Data transfer scheduling in Grids, Cloud, P2P, and so forth represents a new challenge that is the subject to Big Data. In many cases, depending on applications architecture, data must be transported to the place where tasks will be executed [33]. Consequently, scheduling schemes should consider not only the task execution time, but also the data transfer time for finding a more convenient mapping of tasks [34]. Only a handful of current research efforts consider the simultaneous optimization of computation and data transfer scheduling. The big-data I/O scheduler [35] offers a solution for applications that compete for I/O resources in a shared MapReduce-type Big Data system [36]. The paper [37] reviews Big Data challenges from a data management respective and addresses Big Data diversity, Big Data reduction, Big Data integration and cleaning, Big Data indexing and query, and finally Big Data analysis and mining. On the opposite side, business analytics, occupying the intersection of the worlds of management science, computer science, and statistical science, is a potent force for innovation in both the private and public sectors. The conclusion is that the data is too heterogeneous to fit into a rigid schema [38].

Another challenge is the scheduling policies used to determine the relative ordering of requests. Large distributed systems with different administrative domains will most likely have different resource utilization policies. For example, a policy can take into consideration the deadlines and budgets, and also the dynamic behavior [39]. HEP experiments are usually performed in private Clouds, considering dynamic scheduling with soft deadlines, which is an open issue.

The optimization techniques for the scheduling process represent an important aspect because the scheduling is a main building block for making datacenters more available to user communities, being energy-aware [40] and supporting multicriteria optimization [41]. An example of optimization is multiobjective and multiconstrained scheduling of many tasks in Hadoop [42] or optimizing short jobs [43]. The cost effectiveness, scalability, and streamlined architectures of Hadoop represent solutions for Big Data processing. Considering the use of Hadoop in public/private Clouds; a challenge is to answer the following questions: what type of data/tasks should move to public cloud, in order to achieve a cost-aware cloud scheduler? And is public Cloud a solution for HEP simulation experiments?

The activities for Big Data processing vary widely in a number of issues, for example, support for heterogeneous resources, objective function(s), scalability, coscheduling, and assumptions about system characteristics. The current research directions are focused on accelerating data processing, especially for Big Data analytics (frequently used in HEP experiments), complex task dependencies for data workflows, and new scheduling algorithms for real-time scenarios.

4. Conclusions

This paper presented general aspects about methods used in HEP: Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, Random Matrix Theory used in analysis of particles spectrum. For each method the proper numerical method had been identified and analyzed. All of identified methods produce data-intensive applications, which introduce new challenges and requirements for Big Data systems architecture, especially for processing paradigms and storage capabilities. This paper puts together several concepts: HEP, HPC, numerical methods, and simulations. HEP experiments are modeled using numerical methods and simulations: numerical integration, eigenvalues computation, solving linear equation systems, multiplying vectors and matrices, interpolation. HPC environments offer powerful tools for data processing and analysis. Big Data was introduced as a concept for a real problem: we live in a data-intensive world, produce huge amount of information, we face with upper bound introduced by theoretical models.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.


The research presented in this paper is supported by the following projects: “SideSTEP—Scheduling Methods for Dynamic Distributed Systems: a self-* approach”, (PN-II-CT-RO-FR-2012-1-0084); “ERRIC—Empowering Romanian Research on Intelligent Information Technologies,” FP7-REGPOT-2010-1, ID: 264207; CyberWater Grant of the Romanian National Authority for Scientific Research, CNDI-UEFISCDI, Project no. 47/2012. The author would like to thank the reviewers for their time and expertise, constructive comments, and valuable insights.