#### Abstract

The Indian Railways Reservation System (IRRS) is one of the world’s busiest reservation systems of railway tickets. Recently, the COVID-19 pandemic situation has severely impacted the Indian Railway’s (IR) transportation, which eventually has enforced the IR to alter the passenger reservation system. This research attempts to evaluate and analyse the factors that modify the IRRS. In this research, a rough set-based Data Mining Scaffolding (DMS) has been proposed. Here, the relevant preferential information related to the IRRS is managed by introducing a multi-criteria decision-making (MCDM), where a decision-maker (DM) can make a decision based on several decision rules. The effectiveness of the proposed DMS is explained by gathering realistic data of 26 trains, which run between railway stations of two metro cities of India during the COVID-19 pandemic period.

#### 1. Introduction

Since the inception of the Rough Set Theory (RST) [1, 2], it has emerged as an important mathematical technique and has gradually drawn the attention of researchers and engineers to become an alternative approach of fuzzy set, vague set, etc., in order to tackle vagueness and uncertainty. The RST has wide spread applications in different domains including marketing, banking, data mining, engineering, medicine, and expert systems. The main aim of the RST is the approximation of a set by a pair of two crisp sets called the lower and upper approximation sets. The lower approximation incorporates those classifications that definitely belong to the set and the upper approximation contains those classifications that probably belong to the set. In the RST, data are represented in a tabularized format known as an information table.

As a mathematical tool for data analysis, the RST has found applications in many practical fields like data envelopment analysis (DEA) [3, 4], data mining [5], multi-criteria decision analysis [6, 7], medical diagnosis [8, 9], neural network [10], signal processing [11], etc. The RST [12–14] is a recently developed efficient technique for managing uncertainty. It has been used quite successfully while exploring data dependencies, evaluating attributes’ significance, discovering data patterns, reducing redundancies, and recognising and classifying objects. Further, the extraction of rules from databases can also be done using the RST. As far as the decision-making process under the paradigm of the rough set is concerned, Stević et al. [15] proposed the concept of rough numbers and employed the rough best-worst method to determine the weight values of the criteria. Subsequently, Ye et al. [16] proposed a multi-attribute decision-making approach based on fuzzy-rough sets. Moreover, they verified the feasibility of the proposed method by solving the building shape selection problem present in the UCI database. Furthermore, Hu et al. [17] incorporated different weights in the neighbourhood relation and proposed an innovative approach for attribute reduction of the rough set. The efficiency of the proposed method is analysed by employing it on the benchmark machine learning and biomedical datasets. Besides, some recent studies related to railways transport and its safety [18–23] are also observed in the literature.

The Indian Railways (IR) is a vast and busy network of trains for travel spanning around all corners of India. Everyday a large number of passengers reserve their tickets to avail the services of the IR. Consequently, the Indian Railways Reservation System (IRRS) remains so congested that many passengers do not get their desired seat confirmation even after multiple attempts. The parameters that influence this congestion mostly are the Current Booking Status (CBS) of the trains, Travel Time (TT), Average Speed (AS), etc. These parameters are very dynamic and change continuously with time. Hence, uncertainty is always a major factor while associating with these parameters. To process and represent such uncertainty of vagueness effectively, rough sets can be proved to be efficient in this context.

In this paper, we have considered a case study on the IRRS. We have used rough sets as the data mining tool. The main contributions in this case study are: a Data Mining Scaffolding (DMS) based on the RST has been proposed which can guide passengers efficiently to reserve their train tickets for their journey. The pertinent preferential information of ticket reservation, which is relevant to the IRRS, is then processed effectively by a Multi-Criteria Decision-Making (MCDM) approach where a decision is taken based on several decision rules. These decision rules are defined by considering various crucial factors regarding ticket reservation in the IRRS.

In this case study, we have defined rules based on 12 different effective parameters using the RST. The generated rule base will actually guide passengers efficiently during the reservation of tickets and will also prove worthwhile to the IRRS as it can also suggest the IR how they can improve the IRRS to serve the passengers.

The rest of the paper is organized as follows. The basic concepts and some related properties of the RST are discussed in Section “Preliminaries.” A rough set-based algorithm for the multi-criteria decision model is presented in Section “Proposed Algorithmic Approach.” The application of the proposed algorithm through a case study of the IRRS is discussed in Section “Case study.” Finally, the epilogue and some future scope of our study are stated in Section “Results, Discussion, and Conclusions.”

#### 2. Preliminaries

This section presents a brief introduction to the RST and their related properties.

##### 2.1. Data Table and the Indiscernibility Relation of the RST

RST is a mathematical technique to tackle vagueness and uncertainty. It is founded on the assumption that some sort of information is associated with every object in the universe of the discourse. The theory of rough set can be recognized by means of lower approximation and upper approximation ([1], [24]). In the RST, information related to decision objects is often represented in the form of an information table. This information table is represented as a four-tuple information system >, where is the finite set of objects, is the finite set of attributes, and is the information function. For any set of attributes , there exists an equivalence relation such that , where represents the value of the attribute for the element . is known as the indiscernibility relation containing equivalence classes . set is known as an elementary set if it contains a set of all indiscernible members with respect to particular attributes.

Given the set of attribute and in , the lower and upper approximation of are defined as follows:

The boundary region of set *X* is described as follows:

The set is known as the set of all components, which can be surely classified as a member of in the knowledge , whereas is the set of elements that can be probably classified as a member of involving knowledge .

The boundary region is the set of objects, which cannot decisively classify into consisting knowledge . If there is no boundary region of an exact set, then its lower approximation and upper approximation sets are similar. Otherwise, if there exists a boundary region for the set, then the set is referred as rough with respect to . Figure 1 depicts a diagrammatic representation of a rough set.

Table 1 demonstrates a sample of a simple information table of seven objects, in which = {, , , , , , }. We can classify a decision table in following manner: , where is the universal set of houses, is a collection of all attributes. Let us consider (condition attributes) and (decision attributes) to be the subsets of the attribute set . Now, the indiscernibility relations and equivalence classes of the decision table are described as follows:(1) in view of the objects, and have the value “high,” while the objects , , , and contribute a similar value “medium” for this attribute; object contains the value “low” (i.e., = low). Similarly, we can compute the other indiscernibility relations:(2)(3)(4) Thus, the indiscernibility relation of the decision attribute is as follows:(5)(6)Next, an indiscernibility relation is an equivalence relation that splits the set of objects into equivalence classes. Every equivalence class consists of a set of all similar objects for the provided set of condition attribute . In this example, and ; it is represented as follows: and

Now using (1) and (2) we compute the approximate class of the set of house purchasing performance having “good,” i.e., , and house purchasing performance having “average,” i.e., :(1)lower approximation of class “good” purchasing performance = .(2)upper approximation of class “good” purchasing performance = .

In terms of the RST, objects , , and come in the lower approximation of , i.e., these three objects surely belong to the set of houses having “good” purchasing performance. , and come in the upper approximation of ; these five objects possibly belong to the set of houses with “good” purchasing performance.

Analogously, we can obtain the approximate class of the set of houses having purchasing performance “average,” i.e., (1)lower approximation of class “average” purchasing performance = ;(2)upper approximation of class “average” purchasing performance = , , , .

*Example 1. *Table 1 represents the data table of house purchasing. Table 1 contains seven houses that are related by means of five attributes and every house is described using four condition attributes, and one decision attribute, .

##### 2.2. Accuracy of Approximation and Quality of Approximation in the RST [24]

Inexactness of a category (set) can occur due to the existence of a boundary line region. As the boundary line region of a category increases, the accuracy of the set decreases.

Numerically, we can define the accuracy of approximation in the RST by the following:where represents the cardinality of any set .

Clearly, . If , is exact with respect to ; otherwise, is rough (ambiguous) with respect to , when

Let be a partition of universe , where , *j* = 1, 2, …, *m*, are classes of and ; then, the coefficientis known as the quality of approximation of the class by the set of attributes where represents the cardinality of any set; the quality of classification represents the percentage of all correctly classified objects to the partition of employing the knowledge

If the quality of approximation , then entirely depends on otherwise, is partially dependent on , when If and , it is called -reduct of . Information systems may contain more than one -reduct. The intersection set of reducts is known as the core of .

For example, the accuracy of approximation of sets and corresponds to the class of “average” purchasing performance, and “good” purchasing performance of houses are () and (), respectively. The accuracy of approximation of sets and can be calculated as follows: () = 2/4 = 0.5000 and () = 3/5 = 0.6000, respectively. The quality of approximation can be computed using formula (4), 5/7 = 0.7143, since the total number of elements in lower approximation is five and the total number of objects is seven.

##### 2.3. Reduction of Knowledge

By the process of attribute removing, unnecessary attributes are removed from the dataset and a necessary attribute subset for an information system is obtained. This type of attribute subset is known as reduct, and it is a necessary part of the information system.

##### 2.4. Positive Region, Reduct, and Core

Positive region [24] is a very crucial perception of the RST. The -Positive region of contains the set of all objects of the universal set , i.e., those that are assuredly categorized into the group of by attributes from .

The -Positive region of is described as follows:

Furthermore, an attribute, *b* is called dispensable in corresponding to if ; otherwise, attribute becomes an indispensable attribute in the information table. is known a reduct of the attribute set corresponding to the decision attribute if and only if is an indispensable subset of , such that

The core is known as the set of all common reducts of , which consists of the set of all indispensable attributes of the information table. Moreover, the core also contains the set of more essential immovable parts of information systems.

The core can be denoted as follows:

Now, following Example 1, the positive region of the decision attribute with respect to can be calculated as:

where *C* = (*β*_{1}, *β*_{2}, *β*_{3,}*β*_{4}) and D = (*β*_{5})

U/*C* = {(H_{1}, H_{3}), (H_{2}, H_{4}), (H_{5}), (H_{6}), (H_{7})}, and U/D = {(H_{1}, H_{2}, H_{4}, H_{6}), (H_{3}, H_{5}, H_{7})}; we have *POSc(D)* = , where {H_{2}, H_{4}, H_{6}} and {H_{5}, H_{7}}

Hence, POS_{C}(*D*) = {H_{2}, H_{4}, H_{5}, H_{6}, H_{7}}.

For computing the reduct and the core, first, we calculate the indiscernibility relation for the sequence of attribute sets:(1)U/(C-{*β*_{1}}) = {(H_{1}, H_{3}), (H_{2}, H_{4}, H_{7}), (H_{5}), (H_{6})};(2)U/(C-{*β*_{2}}) = {(H_{1}, H_{3}), (H_{2}, H_{4}), (H_{5}), (H_{6}), (H_{7})};(3)U/(C-{*β*_{3}}) = {(H_{1}, H_{3}), (H_{2}, H_{4}), (H_{5}), (H_{6}), (H_{7})};(4)U/(C-{*β*_{4}}) = {(H_{1}, H_{3}), (H_{2}, H_{4}), (H_{5}), (H_{6}), (H_{7})}.

Thus, the indiscernibility relation of the decision attribute *D* is as follows:(i)U/D = {(H_{1}, H_{2}, H_{4}, H_{6}), (H_{3}, H_{5}, H_{7})};(ii)U/*C* = {(H_{1}, H_{3}), (H_{2}, H_{4}), (H_{5}), (H_{6}), (H_{7})}.

And then using the positive region concept, we try to find the set of all indispensable attributes:(1)POS_{(C-{β1})} (*D*) ≠ POS_{C}(*D*), then the attribute *β*_{1} is called indispensable;(2)POS_{(C-{β2})} (*D*) = POS_{C}(*D*), then the attribute *β*_{2} is called dispensable;(3)POS_{(C-{β3})} (*D*) = POS_{C}(*D*), then the attribute *β*_{3} is called dispensable;(4)POS_{(C-{β4})} (*D*) = POS_{C}(*D*), then the attribute *β*_{4} is called dispensable.

Thus, the attribute {*β*_{1}} is the core of this example; we can say this attribute is most important attribute for our dataset.

Core is the most crucial part of the condition attribute C; thus, in the decision table, attributes “price” is necessary for decision construction.

##### 2.5. Decision-Making Using the RST

The initial dataset can be reduced without excluding any necessary attributes, which are represented in the reduct set. The minimal rule set created [24, 25] from the minimized information table possess the following steps from the RST:

Decision rules are adhered to simplify the structure of the decision table listed by following the required steps:(1)construction of an information table of the dataset;(2)calculate the lower and upper approximations of the dataset;(3)reduct the computation of the condition attributes, which is equivalent to the elimination of some columns of the decision table;(4)remove superfluous attribute value;(5)determine the core attribute of the attribute set *A* and find the minimal subset for all decision attributes.

A decision rule in an information system can be obtained by the following expression, containing decision rules of the type “”; thus,

IF .

#### 3. Proposed Algorithmic Approach

This study discusses a reservation system of the IR based on the RST for the selection rules of train berths. In this study, we generate rules to determine the availability (decision attribute) of tickets (berth) in the IR depending on other condition attributes. All required steps of this approach are interpreted in the following subsections (Figure 2).

##### 3.1. Problem Description and Data Cluster

The two most critical steps required to extract useful information are to develop the understanding of particular problems and setting their objectives. This present study of the IR has been performed for train ticket availability and other condition attributes. All useful information has been collected from the official website of the IR (https://www.irctc.co.in, http://indiarailinfo.com) and passengers’ feedback. The important factors of the IRRS were critically analysed before discussing all the attributes (condition and decision). Making efficient decisions requires the knowledge of the reservation system before booking train tickets for a particular journey. The collected data have been processed precisely to ensure the high quality of subsequent analysis.

##### 3.2. RST Examination and Determination

The current study considers various condition attributes and a decision attribute of the IRRS. The 12 factors of the IRRS have been considered as condition attributes, whereas train ticket availability has been considered as a single decision attribute. Furthermore, the accuracy of approximation, reduct, and decision rules have been calculated by approaching rough set methodologies. Moreover, rules based on the RST have been described for passengers with a view towards selecting the most appropriate train for their journey. The accuracy of approximation of randomly generated data is analysed by a set of RST tools like ROSE2.

##### 3.3. Information Extraction

To verify the pragmatism and efficiency of the analysed result, the qualified rules must be examined and inspected by the mature judgement of domain experts. Any unexpected situations regarding the considered dataset need to be examined properly before the rules are applied to develop strategies for the availability of a train berth for a passenger at the time of reservation. The rules developed for an IRRS must be reviewed at regular intervals. In order to establish the validity of the derived rule base of a dynamic system like the IRRS, the rule base should be subjected to continuous surveillance and systematic examination by domain experts.

#### 4. Case Study

This section presents the application of the proposed approach using the data from the IR.

##### 4.1. Research Problem and Data Collection [26]

The IR is the premier transport organization of India, and it is Asia’s largest and World’s second largest rail network under a single management service. The IR has been owned and operated by the Government of India through the Ministry of Railways. Back in 1853, Railways were first introduced in India between Mumbai and Thane. Thereafter in 1951, the various constituent units of railways were nationalised as a single unit, i.e., the IR. Here, the current study includes the Delhi to Mumbai multi-gauge long distance rail network, which is one of the oldest routes with the maximum number of passenger trains running through the route. For passenger amenities, the IR created the Indian Railway Catering and Tourism Corporation (IRCTC), which handles the catering, tourism, and the IRRS. The present study is focused on the IRRS for selected 26 best passenger trains between the metro cities Delhi and Mumbai. From the passenger point of view, it is one of the critical factors to reserve a confirm ticket to a suitable train running between the two metro cities during COVID-19. For the passengers, the possibility of getting a confirmed ticket is maximised by selecting certain related attributes of the IRRS. Therefore, the data related to the berth reservation of passenger trains have been considered one month before the date of journey. To understand the relationship between the IRRS factors and different IRRS variables affecting the decision of the passengers, data have been listed from the IRCTC and the IR for 26 best passenger trains on the route from Delhi to Mumbai.

The information for the present study has been collected with the help of various domain experts from the IR, IRCTC, and also from passengers’ feedback. Passenger decision is controlled by the decision attribute, which is governed by conditional attributes. Conditional attributes were further classified into IRRS factors, which have been derived from the IRRS variables. Hence, the driving force for decision-making in this study is the IRRS variables, which include (i) “Departure time of train,” “Travel time,” “Running days,” and “Punctuality of train” (based on the interview of the deputy chief controller/dispatcher); (ii) “Current booking status,” “Fare of ticket,” and “Ticket availability” (based on the information provided by the IR senior divisional commercial staff and zonal officer); (iii) “Distance from source to destination” and “Average Speed” (based on the Research Designs & Standards Organisation (RDSO) guideline and dispatcher); (iv) “Cleanliness,” “Food quality,” and “Railfanning” (based on a discussion with private contractors for hire by the IRCTC); (v) “Safety of train” (Railway police force (RPF) guidelines).

All considered attributes, factors, and variables with explanation can be seen in Table 2.

Collected data have been pre-processed to convert it into the most suitable format for analysing and consequently deriving meaningful information (Table 3).

##### 4.2. Analysis of the Case Study

The knowledge is mapped from 13 important features (criteria) of the IR to the 26 trains arranged in Table 3 for the rough set analysis. Out of these 13 attributes, 12 are considered as the condition attributes and the remaining one is recognised as a decision attribute (ticket availability). Ticket availability is again divided into four decision classes represented as poor, average, good, and excellent. The following Table 4 shows the approximation of sets and the accuracy of approximation.

The quality of lower approximation: 0.8462.

##### 4.3. Decision Rules Using the RST

The decision rule of the initial dataset (Table 5) is obtained using objects supporting a certain decision rule.

Certain decision rules can be expressed in the form of IF-THEN form. Here are some example to illustrate the IF-THEN rules:(1)IF the departure time is evening, AND safety is average, THEN the decision ticket availability will be poor,(2)IF the fare of ticket is medium, AND the train’s speed is average, THEN the decision ticket availability will be good.

We can see from Table 5 that if the departure time is evening and the train’s safety is average, then ticket availability is poor, which eventually means that there is a very less chance of getting a ticket for that particular train. Rule 4, if the running days of trains are biweekly, then ticket availability is good. According to rule 2, if the distance is far, train speed is fast, and punctuality of train is excellent, then ticket availability is average

##### 4.4. Machine Learning Implementation

Using the attribute reduction technique of the rough set, we have observed that the attributes Punctuality of train (PT) and Railfanning (RF) are eliminated from the list of all the 12 attributes. For comparison purposes, we have employed Recursive Feature Elimination with Cross-Validation (RFECV) as one of the machine learning (ML) techniques on the same dataset. Subsequently, we observed that the identical attributes, PT and RF, are also eliminated from the dataset. The feature importance of the remaining 10 attributes is presented in Table 6. Here, we observe that Safety of train (SF) and Ticket availability (TA) are selected as the two most important attributes with the maximum feature importance score as determined by the RFECV when applied on the classifiers. In this study, we have also considered two classifiers: Random Forest Classifier (RFC) [27] and Extra Trees Classifier (ETC) [28], and analysed the predictive capability of the classifiers with respect to the cross-validation score as well as six performance metrics: (i) accuracy, (ii) precision, (iii) recall, (iv) *f*1-score, (v) Hamming loss, and (vi) Matthews correlation coefficient. The cross-validation score is determined by considering the *StratifiedKFold* cross-validation technique, where *K* is set to 10. We have considered *RandomizedSearchCV* to optimize the hyper-parameters (cf. Table 7), which are the same for RFC ETC. For both the classifiers, *random_state* is set to 9. For the experimental study, we have used *Jupyter* notebook server 6.4.2 and *Python* 3.9.6 to implement the ML techniques on our dataset.

Subsequently, we split the dataset into training and testing datasets with 80% and 20% of data samples, respectively. Once the classifiers are trained on the training dataset, we observe that all the data samples of the testing set are predicted correctly. This fact is well observed by studying the six performance metrics reported in Table 8 and the confusion matrix depicted in Figure 3. For both the classifiers, the values of all the performance metrics are the same and the confusion matrices are identical. However, while determining the cross-validation score with 10-fold stratified cross-validation, it is observed that ETC outperforms RFC with a better cross-validation score. This is also listed in Table 8 with the better cross-validation score highlighted as bold.

#### 5. Results, Discussion, and Conclusions

Due to the COVID-19 lockdown, numerous passenger trains of the IR were suspended. Such a scenario has created a chaotic situation for the passengers who wish to travel for their urgency. Hence, to normalize the troublesome situation to some extent, the government of India allowed some important trains for significant routes. Delhi to Mumbai is essentially an important route, as Mumbai is the financial capital of India. Furthermore, the route is financially profitable for the IR. As a matter of fact, in this study, we have focused on 26 best trains that were functional after COVID-19 lockdown in the country.

The railway reservation system is dynamic in nature since the status of the train berth, i.e., CNF, RAC, WL, alters quite frequently. This dynamic nature of the data is basically dependent on imprecise attributes, which have been processed in this study by employing rough set approaches. We have analysed the data using the RST, which will actually guide the passenger while reserving a train berth.

In this paper, we have presented a MCDM problem based on the IRRS. In this problem, a DM (passenger) has to decide whether to reserve a train berth for availing the journey on a particular train between stations, based on 12 conditional attributes of the IRRS. The analysis has been performed on real datasets corresponding to the IR. According to passenger decisions, we have analysed a set of attributes, which are necessarily critical as far as the reservation of the tickets in a particular train is concerned. In this context, we have used the attribute reduction technique of the rough set. This approach is also essential to analyse a suitable train for the journey. For comparative analysis, we have also used machine learning approaches by considering two machine learning estimators, RFC ETC.

In contrast to the traditional technique, the proposed approach provides a technique to help passengers select the convenient train that suits their journey.

As a future research interest, the extension of the proposed data mining technique of the decision support system can prove to be beneficial to its proprietary. Further, the natural language interpretation of the decision rules can be understood in a better way. Moreover, the development of some suitable metrics in order to remove redundant and contradictory decision rules is also considered as our future research interest.

#### Data Availability

The data used to support the findings of this study are available from the first/corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Authors’ Contributions

Haresh Kumar Sharma and Saibal Majumder conceptualized the study. Haresh Kumar Sharma developed the methodology; Saibal Majumder and Arindam Biswas dealt with the software; Haresh Kumar Sharma and Saibal Majumder validated the results; Olegas Prentkovskis and Paulius Skačkauskas were responsible for the formal analysis; Haresh Kumar Sharma and Olegas Prentkovskis dealt with the investigation; Paulius Skačkauskas collected the resources; Haresh Kumar Sharma and Saibal Majumder curated the data; Haresh Kumar Sharma prepared and wrote the original draft; Haresh Kumar Sharma, Saibal Majumder, Olegas Prentkovskis, Paulius Skačkauskas, and Samarjit Kar reviewed and edited the manuscript; Haresh Kumar Sharma and Arindam Biswas were responsible for visualization; Olegas Prentkovskis and Paulius Skačkauskas supervised the study; All the authors have read and agreed to the published version of the manuscript.