Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron Case

Creamer, Germán G.; Stolfo, Salvatore J.; Creamer, Mateo; Hershkop, Shlomo; Rowe, Ryan

doi:https://doi.org/10.1155/2022/8154476

Complexity

On this page

Abstract Introduction Data Results Discussion Conclusions Data Availability Disclosure Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Tales of Two Societies: On the Complexity of the Coevolution between the Physical Space and the Cyber Space 2021

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 8154476 | https://doi.org/10.1155/2022/8154476

Discovering Organizational Hierarchy through a Corporate Ranking Algorithm: The Enron Case

Germán G. Creamer,¹Salvatore J. Stolfo,²Mateo Creamer,³Shlomo Hershkop,⁴and Ryan Rowe⁵

Academic Editor: Ning Cai

Received25 Aug 2021

Revised07 Dec 2021

Accepted05 Jan 2022

Published21 Feb 2022

Abstract

This paper proposes the CorpRank algorithm to extract social hierarchies from electronic communication data. The algorithm computes a ranking score for each user as a weighted combination of the number of emails, the number of responses, average response time, clique scores, and several degree and centrality measures. The algorithm uses principal component analysis to calculate the weights of the features. This score ranks users according to their importance, and its output is used to reconstruct an organization chart. We illustrate the algorithm over real-world data using the Enron corporation’s e-mail archive. Compared to the actual corporate work chart, compensation lists, judicial proceedings, and analyzing the major players involved, the results show promise.

1. Introduction

A significant challenge in any organization is identifying the underlying organizational structure that might be different from the official version. A simple way to approximate this structure and the corporate hierarchy is based on the pattern of communication between its members. Competitors or clients might use the disclosure of corporate conversations to reveal strategic information such as the development of new products or corporate finance decisions that may have a significant impact on the stock price. However, it is challenging to collect that information, especially considering corporate communications’ high level of confidentiality. This paper proposes a method to approximate this organizational structure and hierarchy using a corporate e-mail dataset in an automated fashion. Several bankruptcy scandals in publicly held U.S. companies such as Enron offer a collection of electronic communication records that could be used to identify its organizational structure.

The Enron Corporation’s e-mail collection, described in Section 2, is a publicly available set of private corporate data released during the judicial proceedings against the Enron Corporation. Several researchers have explored it mainly from a natural language processing (NLP) perspective [1–5]. Social network analysis (SNA) examining structural features, Diesner and Carley [6], has also been applied to extract properties of the Enron network and attempts to detect the key players around the time of Enron’s crisis: Diesner et al. [7] studied the patterns of communication of Enron employees differentiated by their hierarchical level; Cotterill [8] investigated a set of stylistic language features to predict organizational hierarchy relationships; Danescu-Niculescu-Mizil [9] presented an analysis framework on linguistic coordination to analyze power relationships in static and situational forms; Chundi et al. [1] conducted a time series analysis of communication patterns from Enron’s email; Priebe et al. [10] developed a theory of scan statistics on graphs for anomaly detection using time series; Berry and Browne [11] applied a non-negative matrix factorization to identify and monitor semantic features and message clusters; Keila and Skillicorn [12] interestingly enough, found that word use changed according to the functional position; Padmanabhan et al. [13] conducted a thread analysis to determine employees’ responsiveness; Elsayed and Oard [14] presented a method for identity resolution in the Enron e-mail dataset; Bar-Yossef et al. [15] applied a cluster ranking algorithm based on the strength of the clusters to this dataset; Zhang et al. [16] modeled interactions between groups using the semantic content of Enron’s email; Menges et al. [17] proposed an agent-based approach to model email-based social networks; Pathak and Srivastava [18] presented a technique to extract concealed relations from social networks; Chapanond et al. [19] used graph theoretical and spectral analysis to discover structures within Enron’s organizations; and Shetty and Adibi [20] used an entropy model to identify the most relevant people in the organization.

The problem of organization structure, networks’ properties, and hierarchical discovery has also been explored by several researchers without using Enron’s database: Hu and Liu [21] reconstructed a multiplex network from an unstructured personal e-mail corpus to analyze social status and roles; Maiya and Berger-Wolf [22] proposed a method to infer a maximum likelihood social hierarchy from weighted social networks; Memon et al. [23] developed an algorithm to capture hidden hierarchies in terrorist networks; Kemp and Tenenbaum [24] implemented a method to discover hierarchies that are optimal for a given dataset; Freeman [25] adapted a method, a canonical analysis of asymmetry, to study hierarchies in organizations; Aral and Van Alstyne [26] analyzed the tradeoff between network diversity and communication channel bandwidth in the case of information diffusion.

Our previous paper was published as follows: Creamer et al. [27] proposed a method to calculate a social score that approximates the organizational hierarchy of Enron using the e-mail Mining Toolkit (EMT) project [28, 29]. Several authors have incorporated our work ([27]) in the area of social hierarchy detection as a basis for their research as follows: Agarwal et al. and Maktoubian et al. [30, 31] incorporated SNA with a simple lower bound instead of an upper bound for social network-based systems; Palus et al. [32] applied the resulting social scores from this research to distinguish levels of hierarchy in the Enron Corpus; Kalia et al. [33] adopted the mentioned social features of degree and betweenness centrality to identify organizational hierarchy, adding an emphasis on content, patterns, and emotions of messages; Nguyen and Zheng [34] referenced this research to assume that a person’s influence score is positively correlated with his rank and predict influence spreads in Twitter communities. Others have studied the application of social networks to practical corporate improvement methods as follows: Hovelynck and Chidlovskii [35] adopted commonly used features of nodes to represent key properties of actors in response to this work, assigning a social score for each node to improve classification performance; Li and Somayaji [36] applied SNA to organizational access control; Michalski et al. [37] matched social network hierarchies in organizations with a stable corporate structure to improve company management; Rivera-Pelayo [38] considered the application of data mining and SNA for its program, ExpertSN, allowing for effective people search in a given work context; Ganjaliyev [39] proposed a new method to identify network communities to enhance social network analysis; and Wang et al. [40] used HumanRank, a method of ranking individuals based on importance using personal electronic interactions.

A possible concern with this line of research is that corporations’ emails or electronic communication records are difficult to obtain or that other electronic channels are substituting email systems. However, this line of research is becoming more important as several recent papers have extended our initial research [27] based on e-mail communications or enterprise social networks (ESN; i.e., intranet) to evaluate organizational structures using cluster validation [41]; to infer professional roles in an organization [42]; to extract the social structure and its hierarchy using the flow hierarchy derived from frequent interactions [43, 44]; to evaluate the impact of formal hierarchies on ESN [45, 46]; to classify power relations [47]; to classify texts incorporating social network information [48]; to detect relations and anomalies in text and speech [49]; to detect organizational structure based on machine learning and social network analysis [50]; to identify key actors in an organization [51, 52]; to detect communities [53]; to assign roles and community discovery in social networks [54]; and to visualize a social network with emails’ topics [55].

One of the major limitations of our previous research ([27]) is that the calculation of the social score gives equal weights to all the inputs without considering their relative importance to approximate the outcome. In this paper, we expand our previous work [27] and propose the CorpRank algorithm to rank the officers of an organization based on principal component analysis (PCA). We use PCA to weigh several social network and e-mail indicators to calculate each officer’s importance and compare our results with several community detection and clustering algorithms.

The rest of the paper is organized as follows: Section 2 describes the dataset; Section 3 presents the CorpRank algorithm; Section 4 describes the research design; Section 5 presents the results; Section 6 discusses the results, and Section 7 concludes the paper.

2. Enron Antecedents and Data

In this paper, we will use the well-known Enron e-mail dataset for the period 1998–2002 [6, 56–58]. As a part of Enron’s legal proceedings in 2002, the Federal Energy Regulatory Commission (FERC) built a public dataset of 619,449 emails without attachments from 158 Enron employees. The emails used came from about 92% of relevant Enron employees. We use a version of the dataset provided by Shetty and Adibi [58] that has deleted extraneous and unnecessary emails and has fixed some anomalies in the collection data having to do with empty or illegal user e-mail names and bounced email messages. Any duplicates or blank emails were not included. Our dataset, called the ENRON dataset, has 149 users after cleaning. We also included the position of each officer as provided by Shetty and Adibi [58]. According to this information, 38.5% of the users are classified as “employee” or “N/A.” “employee” only means that the user is working at Enron. We imputed the positions of several employees that were not well classified after reviewing their emails. We looked at signatures, content, internal traders’ list, and documents released during the bankruptcy proceedings [59]. After this review, we classified many “unknown” employees as traders or supporting traders.

We also used a FERC dataset that we call TRADER, with emails of 47 members of the North American West Power Trading division of Enron. We also had access to the organization chart of this division as provided by McCullough [60].

As a proxy of organizational importance, we used executive compensation provided by the United States Congress Joint Committee on Taxation [59] that included reports of outside consultants such as Towers Perkin [61].

3. The CorpRank Algorithm

We propose a ranking algorithm to evaluate the importance of the participants of an e-mail network. Even though the application of this algorithm could be used with any social group, we apply it to a corporate environment, as the ranking allows organizations to approximate the organizational hierarchy of a corporation.

The primary input is an e-mail corpus with a fixed number of accounts, each one of individual users.

The algorithm creates an undirected network where every account is a vertex. An edge is made between two vertices only when a minimum number of emails are transmitted between them. The weights of the edges are the number of emails exchanged between each pair of accounts.

The algorithm also finds all subgraphs (maximal complete cliques). It obtains several scores associated with the importance of the cliques according to their size and average response time of the primary account. The assumption is that users associated with a more extensive set and frequency of cliques will be more critical.

The final output of the algorithm is a score that ranks the users according to the mentioned clique scores, a set of social network indicators, the number of emails, the number of responses, and the average response time. This last feature is calculated as the period elapsed between the moment an e-mail is sent, and a response is received within three business days. This limit is established to avoid long response times or unrelated answers after a long time. Average response time and the number of emails are considered indicators of “importance” as probably most relevant messages might have priority over the rest, and a senior management position may require the supervision of a large unit. Therefore, the volume of emails may increase. Another essential set of features is about the nature of the network’s connections and the social network structure, such as different measures of centrality, average distance, and the importance of hubs. All these social network metrics are indicators of the “importance” of an account in the e-mail network.

We assume that all these features are perfectly reasonable in an equation for “importance.” The weight of each feature may change by situation and organization, and therefore can be optimized with a method that emphasizes the role of each feature to the variations of the dataset such as PCA. For this reason, all calculated features are normalized and combined using PCA, each with an individual contribution to an overall score with which the users are ultimately ranked. The formal definition of the CorpRank algorithm is in Algorithm 1.

	Input: a set of corporate emails with number of individual accounts.
(1)	Build an undirected graph , where is the set of vertices that represents the e-mail accounts, is the set of edges, and is the edge between vertices and that have exchanged at least emails. The value of the edge is the number of emails exchanged between and .
(2)	Find all maximal complete cliques (subgraphs) using a recursive algorithm such as 457 [62].
(3)	Calculate the adjacency matrix and geodesic distance matrix (the matrix of all shortest paths between every pair of vertices) for . and are the elements of and , respectively. The mean of all the distances is .
(4)	The following features are calculated for each vertex :
Number of emails (e-mail): total number of emails sent and received.
Average response time (AvgTime): average amount of time elapsed between every email sent from to any other account and the next email received by from account .
Number of responses (NResponse): sum of all the responses to emails sent by to any other accounts .
Number of cliques (Clique): number of all cliques that is contained within.
Raw clique score (RCS): , where is the number of users in the clique.
Weighted clique score (WCS): raw score weighted by the “importance” of according to the average response time .
Degree centrality (Degree):
Betweenness centrality (Betweenness): , where is the number of geodesic paths between vertices k and j that include vertex i, and is the number of geodesic paths between k and j [63].
Clustering coefficient (CC): , where . Each vertex has a neighborhood M defined by its immediately connected neighbors: .
Average distance (AvgDistance): mean of the shortest path length from a specific vertex to all vertices in the graph : , where , and is the number of vertices in .
“Hubs-and-authorities” importance (Hubs): calculated with a recursive algorithm as proposed by Kleinberg [64]. “hub” refers to the vertex that points to many authorities, and “authority” is a vertex that points to many hubs.
(5)	Each feature is mapped to a [0, 100] scale and weighted with the following formula: is the value of the feature for , is the weight for the feature ; the supremum and infimum are computed across all .
(6)	Run a principal component analysis (PCA) on all features and select the principal components that explain at least 80% of the variance of the dataset. The weight for each feature is its normalized contribution to the variations of the selected principal components as follows: , where is the contribution of the feature to explaining the variation of principal component , and is the eigenvalue of the principal component .
(7)	The CorpRank score, a ranking score between 0 and 100, is obtained for as a weighted sum of the indicators:
Output: CorpRank score for each account .

Once we obtain the CorpRank score, we can evaluate if the scores are consistent with the organization chart of the corporation under study. Any significant difference may indicate employees playing a central role even though the organization does not correctly recognize them.

4. Research Design

We classified the workers into nine organizational categories:(1)CEOs-Presidents: includes CEOs, chairman, chief operating officers (COOs), and presidents of Enron’s subsidiaries.(2)Executive V.P.s: includes executive vice presidents, functional chief officers (risk, staff, and general counsel), and an assistant to the president. This latter position may qualify as “employees;” however, an employee may communicate with the rest as their immediate subordinates.(3)Attorneys-legal asst.: includes lawyers, legal specialists, and general counsel assistant.(4)Managing directors(5)Vice presidents(6)Directors: includes directors and senior directors.(7)Managers: includes managers, senior managers, senior specialists, specialists, associates, analysts, functional managers (risk), and contractors.(8)Traders: includes analysts, senior specialists, specialists, and associates engaged in or supporting trading activities.(9)Employees: includes staff members not well classified in the other categories or with an unknown position.

We ranked the employees of both datasets ENRON and TRADER using the CorpRank score. We separated both datasets and evaluated our results in three equal-sized clusters: high, medium, and low scores, respectively. Additionally, we grouped Enron workers as senior managers (CEOs-Presidents, Executive V.P.s, Managing directors, and Vice presidents), middle managers (directors and managers), and others. We expect a particular relationship between the cluster and the occupational category of each employee. For instance, senior managers, middle managers, and employees should be mainly in the first, second, and third segments.

To validate our work, we built contingency tables for ENRON and TRADER with the three clusters of the CorpRank score and the occupational categories. In the case of TRADER, the dataset did not include employees of the first three organizational categories of the above list. We tested the null hypothesis, using the Chi-Square statistics, that either the ENRON’s or TRADER’s contingency table is not different from the expected contingency table. We compared the contingency tables with contingency tables that homogeneously distribute the same number of workers among the three clusters. We also contrast our results with the clusters generated by the following algorithms listed in increasing complexity levels. Besides the community detection algorithms, the rest of the following algorithms generated three clusters to be consistent with our previous classification:(1)Community detection algorithms: communities or clusters are detected using the e-mail network.(a)Walktrap algorithm: finds an efficient community structure according to a measure of similarity between vertices based on random walks [65].(b)Edge-betweenness algorithm: discovers community structures of a network by iteratively removing the edges with the highest betweenness, generating a hierarchical map of the different modules [66].(2)K-means: partitions the data finding a prespecified number of clusters that minimizes the within-cluster variation with the features used to generate the CorpRank score introduced in Algorithm 1.(3)PCA hierarchical clustering (PCA HC) uses as inputs the principal components and their coordinates of the Algorithm 1's features. PCA HC initially treats every observation as a cluster. Iteratively, this method fuses the two clusters with the smallest distance, such as the Euclidean distance, and repeats the same process until all the observations belong to a cluster.(4)PCA K-means: conducts a K-means using the coordinates of the principal components of the Algorithm 1’s features.

We calculated the average bonus of 2001 for each hierarchical category as an additional proxy of organizational importance. The bonus combines the regular performance bonus and a retention bonus paid by Enron in 2001 to maintain its crucial employees (this information is available in [59] and [61]). This bonus is a direct indicator of the importance that each employee acts as a leader or as a profit generator for the company. For instance, some star energy traders received bigger bonuses than some senior managers because of Enron’s urgency to generate significant revenues in a short time.

We used R and EMT [29], a Java-based e-mail analysis engine built on a database backend, for data processing and analysis.

5. Results

The first three principal components capture 78.4% and 93% of the variance for the ENRON and TRADER datasets, respectively (Figure 1). For both datasets, the features betweenness centrality and clustering coefficient, and for ENRON alone, the average response time and average distance have a much smaller contribution to the first three principal components than the rest (Table 1). Average response time is associated with the importance of emails; however, this effect is already captured by the number of responses. The rest of the above features is all social network features, where degree, hubs, and clique features are dominant.

(a)

(b)

The community detection algorithm Walktrap shows the highest modularity for both the ENRON and TRADER datasets (Table 2). Modularity is an indicator of the quality of a partition that evaluates if there are many links connected to every partition above an expected random number of links [66].

Hence, we use Walktrap as our primary benchmark to evaluate the quality of our algorithm. In this respect, the structure of clusters generated by the Walktrap algorithm is not significantly different than the expected value for ENRON (Table 3); while for TRADER, it is significant at the 95% confidence level (Table 4-E). In both datasets, most observations of most occupational categories are concentrated in one segment. Therefore, the algorithm cannot separate observations into clusters related to their organizational importance.

The three segments defined by the CorpRank score have aggregated Enron’s employees in a nonrandom pattern (Table 5). The Chi-Square test rejects the null hypothesis of a random assignment with value smaller than 0.1% for the ENRON (Table 6-A) and TRADER (Table 4-A) datasets.

To evaluate if the aggregation generated by the CorpRank score also corresponds to the organizational hierarchy, we obtained the average bonus of 2001 for each organizational category (see Tables 7 and 8) The correlation coefficient between the average CorpRank score and the average bonus for each organizational category is 0.9 and 0.83 for ENRON and TRADER. Even though the number of observations (similar to the number of organizational categories) is small, a high correlation coefficient is a good indicator that the two features move in the same direction. A casual observation of Tables 7 and 8 shows that the average CorpRank score ranks the organizational categories according to their expected importance: senior management, middle management, and traders.

As the traders were the primary profit drivers at Enron, we analyze the organizational hierarchy of traders and managers independently in the next section. There are alternative communication systems for the traders (instantaneous messages, tweets, phones, Bloomberg, or trading terminals). They do not have many emails in our datasets, and they communicate primarily among themselves. Hence, they may have lower CorpRank scores than the rest of the workers. However, in the Enron North American West Power Trading division, the traders are more critical than employees, as this is a specialized trading division. They are mainly distributed between the first and third cluster.

Employees are mainly concentrated in the second cluster for TRADER and in the third cluster for ENRON. According to the emails, some of the employees have a lot of influence and could have been assistants of senior managers or directors. However, the emails studied did not indicate their professional position. For this reason, the average CorpRank score for this group is not the lowest one as expected.

6. Discussion

6.1. Analysis of Complete ENRON Dataset

Tables 5 and 6 show that most of the senior managers and the legal team (Attorneys-legal asst.) are concentrated in the first segment, which has the most significant CorpRank score. Even though 78% of CEOs and presidents are in this segment, this proportion decreases to 62.5% for the vice presidents. A particular case is the legal team because it does not include any senior managers (the general counsel and other vice presidents involved in legal aspects were classified as senior managers); however, it has about the same average CorpRank score as the executive VPs and managing directors (see Table 7). This can be explained considering the central role that this professional group played while Enron hid its losses using new financial vehicles and then filed for bankruptcy in 2001.

There is a critical jump between middle managers and senior managers. About 46% of the directors are in the second segment according to the CorpRank score (Table 5), while managers have about the same presence in the third segment. The distribution of these two groups reflects their role as middle managers where the directors are at the top of the group and regularly have several managers in their teams. The group of managers is very diverse and may have small team leaders and subject-matter experts with minimal contact with the rest of the organization.

The top-ranked employees according to the CorpRank score include Liz Taylor, assistant to the president, and COO (proxy of Gregg Whalley and Jeff Skilling); Louise Kitchen, who was the president of Enron Online, had tremendous importance because she implemented the online trading capability; Sally Beck, COO, and Ken Lay, CEO (Table 9). Even though some of the most influential leaders are not at the top of the scale, such as Gregg Whalley and Jeff Skilling, who are in the 18th and 35th places, respectively, their assistant is Liz Taylor (Gregg Whalley replaced Jeff Skilling as president and COO after the latter resigned on August 14, 2001). Her high ranking reflects the high rankings of Gregg Whalley and Jeff Skilling.

The clusters generated by PCA K-means and PCA hierarchical clustering split the occupational categories in a cluster for the senior managers and another for the rest. Even though most occupational categories are concentrated in two clusters, these clusters are still consistent with the CorpRank score’s clusters. A significant difference is in the structure generated by K-means where most of the observations are concentrated in one cluster. As Figure 2(a) shows, K-means generates very well-differentiated clusters according to the most relevant features evaluated. At the same time, there is a certain overlap among the CorpRank score’s clusters (Figure 2(b)). A similar path is observed by the Enron North American West Power Trading division (although the figures are not included to save space).

(a)

(b)

This analysis reflects the flexibility of our approach as the organizational importance of Enron’s employees may lead to situations where a senior manager and a manager may have similar values across different features; however, they still belong to other clusters. Additionally, the PCA weighting helps identify the most relevant features, and the final ranking of employees is associated with their organizational importance.

6.2. Analysis of North American West Power Trading Division

Table 4 shows that the senior managers are concentrated in the first segment (high CorpRank score), middle managers and employees in the second segment, and traders in the third segment (low CorpRank score). Different than what we observed in the previous section, the other cluster algorithms included in Table 4 do not show a consistent pattern to separate the employees by occupational categories. Partially, this might be explained because TRADER has about a third of ENRON’s observations. Additionally, Table 8 demonstrates that the average CorpRank score and the average bonus follow the same trend.

As seen in Table 10, we can reproduce with high accuracy the very top of the hierarchy of the North American West Power Trading division. Figure 3 shows that Tim Belden, head of the trading operation, and his administrative assistants appear on the top of our list and the organization chart. Most of the directors and a critical number of managers and specialists are in the first fifteen positions. It is more difficult to differentiate between employees at the lower level of the organization chart as they may have similar communication patterns. However, it is possible to build a hierarchy from small groups up as the algorithm recognizes the two or three most essential individuals in every segment (interested researchers can find organization charts of public companies in their annual reports).

7. Conclusions and Future Work

This paper shows the CorpRank algorithm’s capacity to capture the organizational hierarchy in a corporation, which may differ from the formal organizational structure. However, this variation may indicate different degrees of leadership or communication that senior managers may use to recognize informal leaders and influencers in an organization. We tested this algorithm on the Enron dataset; however, it can be extended to other corporations or social groups. We also show that our method provides more information than different well-known clustering algorithms.

The next step in this research is to explore organizational changes in a dynamic framework using the Enron dataset. By varying the feature weights, it is possible to use the mentioned parameters to identify the most critical officers in an organization, cluster individuals by their social attributes, organizational characteristics, or compensation, and draw a chart of the actual organizational hierarchy in question.

We could also include some variations in our algorithm to improve our calculations. First, the average response time can be defined by the order of responses like in [67]. Another approach is to take into account the e-mail usage pattern for each officer and adjust the received time of email to the beginning of the next common e-mail usage time. Consequently, this may improve the average response time calculation as people have different work schedules. A third improvement is aggregating users by percentile or standard deviations of common distributions. Furthermore, rather than ignoring the clique connections, the graph edges could help group users according to their social attributes, organizational characteristics, or compensation.

Future work would extend our model using dynamic processes in social networks [22, 68] and analyze individual business units as we did in the case of the Enron North American West Power Trading division. Chung et al. [68] shows that the temporal signatures of an e-mail thread by officers of a trading unit are consistent with their hierarchical relationship with the president of the division. These results are similar to the results presented in this paper, where we automatically extracted “informal” relationships among employees based on communication patterns. Another direction would be to independently evaluate the hierarchical subsets of senior managers, middle managers, traders, or Enron’s central business units (transportation, wholesale, energy, and broadband services) and integrate them into a single organizational dynamic process. Such efforts may require reorganizing the e-mail dataset and including new players that initially were eliminated because of quality concerns or ignorance of their roles in the organization. Kan et al. [69] paved the way for the possibility to extend this research to querying evolving graphs, showing that spatiotemporal patterns can be used to find evolving subgraphs that can be related to known real-world events or hierarchies. Dynamic network analysis can be further expanded through the generation of stochastic-based simulated networks, a concept explored by Menges et al. [17] based on social network analysis.

Additionally, our analysis can be extended to other databases of electronic communications such as Twitter or electronic bulletin boards. In forensic analysis, this technology plays a central role as it would be able to detect the importance and influence of different actors. In several criminal cases such as in the case of Enron, the senior managers claim that they were only concentrated on the high-level decisions and policies and that they were not responsible for the implementation of these policies. However, the forensic analysis may reveal their direct or indirect involvement with all parts of the organization.

Another critical area of application is forensic finance. According to the SEC’s report [70], the significant increase of the price and volatility of GameStop (GME), a meme stock, in January 2021 was driven by a large number of investors that were investing in this stock rather than by a “short squeeze.” Many of these individual investors exchanged the information that led to this rally through WallStreetBets, an electronic forum of Reddit specialized in trading and finance. Using our methodology, it could be possible to identify the leaders that moved the stocks’ prices. For instance, Keith Gill played a significant role in GameStop’s retail-trading frenzy as many investors followed him at Reddit’s WallStreetBets forum [71]. Analysis of Reddit’s messages may identify him and his main followers that influenced GME’s stock price.

The public availability of a significant amount of documentation collected during the judicial proceedings regarding the Enron managers or other cases offers new insights into the organizational structure previously reserved for insiders. This new information and the generation of specialized models for dynamic social networks could help uncover the relationships that are the backbone of any modern corporation. The automatic application of these models can protect corporations’ assets and aid corporations in the early detection of information abuse and misuse.

Data Availability

The Enron dataset used to support the findings of this study are available from the corresponding author upon request. We used a version of the Enron dataset provided by J. Shetty and J. Adibi. The Enron e-mail dataset database schema and brief statistical report, 2004.

Disclosure

This paper is an extended version of our paper published in the 9th International Workshop on Knowledge Discovery on the Web, WebKDD 2007 and 1st International Workshop on Social Networks Analysis, SNA-KDD 2007, San Jose, CA, 2007 [27].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank both the National Science Foundation and DARPA for funding this work, particularly, the NSF grant (e-mail Mining Toolkit Supporting Law Enforcement Forensic Analysis from the Digital Government research program, No. 0429323). The authors would like to thank Sinan Aral, participants of the Eastern Economics Association meeting 2019, the 2nd Stevens Internal Workshop on Cognition-Centric Enterprises 2011, and the 9th International Workshop on Knowledge Discovery on the Web 2007.

References

P. Chundi, M. Subramaniam, and D. K. Vasireddy, “An approach for temporal analysis of email data based on segmentation,” Data & Knowledge Engineering, vol. 68, no. 11, pp. 1253–1270, 2009.
View at: Publisher Site | Google Scholar
E. Gilbert, “Phrases that signal workplace hierarchy,” in Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, pp. 1037–1046, ACM, Washington, DC, USA, February 2012.
View at: Publisher Site | Google Scholar
B. Klimt and Y. Yang, “The Enron corpus: a new dataset for email classification research,” in Proceedings of the European Conference on Machine Learning, Pisa, Italy, September 2004.
View at: Publisher Site | Google Scholar
A. McCallum, A. Corrada-Emmanuel, and X. Wang, “The author-recipient-topic model for topic and role discovery in social networks: Experiments with Enron and academic email,” in Proceedings of the NIPS’04 Workshop on ‘Structured Data and Representations in Probabilistic Models for Categorization’, Whistler, Canada, July 2004.
View at: Google Scholar
J.-Y. Yeh and A. Harnly, “Email thread reassembly using similarity matching,” in Proceedings of the Third Conference on Email and Anti-Spam, Mountain View, CA, USA, July 2006.
View at: Google Scholar
J. Diesner and K. M. Carley, “Exploration of communication networks from the Enron email corpus,” in Proceedings of the Workshop on Link Analysis, Counter Terrorism and Security, SIAM International Conference on Data Mining 2005, Newport Beach, CA, USA, April 2005.
View at: Google Scholar
J. Diesner, T. L. Frantz, and K. M. Carley, “Communication networks from the enron email corpus “It's Always about the people. Enron is no different”,” Computational and Mathematical Organization Theory, vol. 11, no. 3, pp. 201–228, 2005.
View at: Publisher Site | Google Scholar
R. Cotterill, “Using stylistic features for social power modeling,” Computación Y Sistemas, vol. 17, no. 2, pp. 219–227, 2013.
View at: Google Scholar
C. Danescu-Niculescu-Mizil, “A computational approach to linguistic coordination,” Cornell University, New York, NY, USA, 2012, PhD Thesis.
View at: Google Scholar
C. E. Priebe, J. M. Conroy, D. J. Marchette, and Y. Park, “Scan statistics on enron graphs,” Computational and Mathematical Organization Theory, vol. 11, no. 3, pp. 229–247, 2005.
View at: Publisher Site | Google Scholar
M. W. Berry and M. Browne, “Email surveillance using non-negative matrix factorization,” Computational and Mathematical Organization Theory, vol. 11, no. 3, pp. 249–264, 2005.
View at: Publisher Site | Google Scholar
P. S. Keila and D. B. Skillicorn, “Structure in the Enron email dataset,” Computational and Mathematical Organization Theory, vol. 11, no. 3, pp. 183–199, 2005.
View at: Publisher Site | Google Scholar
D. Padmanabhan, D. Garg, and V. Varshney, “Analysis of Enron email threads and quantification of employee responsiveness,” in Proceedings of the Text Mining and Link Analysis Workshop on International Joint Conference on Artificial Intelligence, Hyderabad, India, August 2007.
View at: Google Scholar
T. Elsayed and D. W. Oard, “Modeling identity in archival collections of email: a preliminary study,” in Proceedings of the Third Conference on Email and Anti-spam (CEAS), Mountain View, CA, July 2006.
View at: Google Scholar
Z. Bar-Yossef, I. Guy, R. Lempel, Y. S. Maarek, and V. Soroka, “Cluster ranking with an application to mining mailbox networks,” in Proceedings of the ICDM ’06: Proceedings of the Sixth International Conference on Data Mining, pp. 63–74, IEEE Computer Society, Washington, DC, USA, December 2006.
View at: Publisher Site | Google Scholar
W. Zhang, A. Ahmed, J. Yang, V. Josifovski, and A. J. Smola, “Annotating Needles in the Haystack without looking: product information extraction from emails,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2257–2266, ACM, Sydney, Australia, August 2015.
View at: Google Scholar
F. Menges, B. Mishra, and G. Narzisi, “Modeling and simulation of e-mail social networks: a new stochastic agent-based approach,” in Proceedings of the 40th Conference on Winter Simulation, pp. 2792–2800, Miami, Florida, December 2008.
View at: Publisher Site | Google Scholar
N. Pathak and J. Srivastava, “Automatic extraction of concealed relations from email logs,” in Proceedings of the International Workshop/School on Network Science, Bloomington, Indiana, May, 2006.
View at: Google Scholar
A. Chapanond, M. S. Krishnamoorthy, and B. Yener, “Graph theoretic and spectral analysis of enron email data,” Computational and Mathematical Organization Theory, vol. 11, no. 3, pp. 265–281, 2005.
View at: Publisher Site | Google Scholar
J. Shetty and J. Adibi, “Discovering important nodes through graph entropy: the case of Enron email database,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL. USA, August 2005.
View at: Publisher Site | Google Scholar
X. Hu and H. Liu, “Social status and role analysis of palin’s email network,” in Proceedings of the 21st international conference companion on World Wide Web, pp. 531-532, ACM, Lyon France, April 2012.
View at: Google Scholar
A. S. Maiya and T. Y. Berger-Wolf, “Inferring the maximum likelihood hierarchy in social networks,” in Proceedings of the IEEE CSE’09, 12th IEEE International Conference on Computational Science and Engineering, vol. 4, pp. 245–250, IEEE, Vancouver, Canada, August 2009.
View at: Publisher Site | Google Scholar
N. Memon, H. L. Larsen, D. L. Hicks, and N. Harkiolakis, “Retracted: Detecting hidden hierarchy in terrorist networks: some case Studies,” in Proceedings of the PAISI, PACCF and SOCO ’08: Proceedings of the IEEE ISI 2008 PAISI, PACCF and SOCO international workshops on Intelligence and Security Informatics, pp. 477–489, Springer-Verlag, Berlin, Germany, June 2008.
View at: Publisher Site | Google Scholar
C. Kemp and J. B. Tenenbaum, “The discovery of structural form,” Proceedings of the National Academy of Sciences, vol. 105, no. 31, pp. 10687–10692, 2008.
View at: Publisher Site | Google Scholar
L. C. Freeman, “Uncovering organizational hierarchies,” Computational & Mathematical Organization Theory, vol. 3, no. 1, pp. 5–18, 1997.
View at: Publisher Site | Google Scholar
S. Aral and M. W. Van Alstyne, “The diversity-bandwidth trade-off,” American Journal of Sociology, vol. 117, no. 1, SSRN eLibrary, 2011.
View at: Google Scholar
G. Creamer, R. Rowe, S. Hershkop, and S. Stolfo, “Segmentation and automated social hierarchy detection through email network analysis,” in Proceedings of the Advances in Web Mining and Web Usage Analysis - 9th WEBKDD and 1st SNA-KDD Workshop at KDD 2007, Lecture Notes in Computer Science, Springer-Verlag, San Jose, CA, USA, August 2007.
View at: Google Scholar
S. J. Stolfo, G. Creamer, and S. Hershkop, “A temporal based forensic discovery of electronic communication,” in Proceedings of the National Conference on Digital Government Research, San Diego, CA, USA, May 2006.
View at: Google Scholar
S. J. Stolfo, S. Hershkop, C.-W. Hu, W.-J. Li, O. Nimeskern, and K. Wang, “Behavior-based modeling and its application to email analysis,” ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 187–221, 2006.
View at: Publisher Site | Google Scholar
A. Agarwal, A. Omuya, A. Harnly, and O. Rambow, “A comprehensive gold standard for the enron organizational hierarchy,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pp. 161–165, Association for Computational Linguistics, Stroudsburg, PA, USA, July 2012.
View at: Google Scholar
J. Maktoubian, M. Noori, M. Amini, and M. Ghasempour-Mouziraji, “The hierarchy structure in directed and undirected signed networks,” International Journal of Communications, Network and System Sciences, vol. 10, no. 10, pp. 209–222, 2017.
View at: Publisher Site | Google Scholar
S. Palus, P. Bródka, and P. Kazienko, “How to analyze company using social network?” in Knowledge Management, Information Systems, E-Learning, and Sustainability Research, Springer, New York, NY, USA, 2010.
View at: Google Scholar
A. K. Kalia, N. Buchler, D. Ungvarsky, R. Govindan, and M. P. Singh, “Determining team hierarchy from broadcast communications,” in Social Informatics, Springer, New York, NY, USA, 2014.
View at: Google Scholar
H. Nguyen and R. Zheng, “A data-driven study of influences in Twitter communities,” in Proceedings of the Communications (ICC), 2014 IEEE International Conference on Communications, pp. 3938–3944, IEEE, Sydney, Australia, June 2014.
View at: Publisher Site | Google Scholar
M. Hovelynck and B. Chidlovskii, “Multi-modality in one-class classification,” in Proceedings of the 19th international conference on World wide web, pp. 441–450, ACM, Raleigh Convention Center in Raleigh, NC, USA, April 2010.
View at: Publisher Site | Google Scholar
Y. Li and A. Somayaji, Fine-grained Access Control Using Email Social Network, CA Labs Research, Chennai, India, 2012.
R. Michalski, S. Palus, and P. Kazienko, “Matching organizational structure and social network extracted from email communication,” in Business Information Systems, Springer, New York, NY, USA, 2011.
View at: Google Scholar
V. Rivera-Pelayo, S. Braun, U. V. Riss, H. F. Witschel, and B. Hu, Building Expert Recommenders from Email-Based Personal Social Networks, Springer, New York, NY, USA, 2013.
F. Ganjaliyev, “New method for community detection in social networks extracted from the Web,” in Proceedings of the Problems of Cybernetics and Informatics (PCI), 2012 IV International Conference, pp. 1-2, IEEE, Baku, Azerbaijan, September 2012.
View at: Publisher Site | Google Scholar
Y. Wang, M. Iliofotou, M. Faloutsos, and B. Wu, “Analyzing interaction communication networks in enterprises and identifying hierarchies,” in Proceedings of the Network Science Workshop (NSW), pp. 17–24, IEEE, New York, NY, USA, June 2011.
View at: Publisher Site | Google Scholar
V. Boeva, L. Lundberg, S. M. H. Kota, and L. Sköld, “Evaluation of organizational structure through cluster validation analysis of email communications,” Journal of Computational Social Science, vol. 1, no. 2, pp. 327–347, 2018.
View at: Publisher Site | Google Scholar
D. Jin, M. Heimann, T. Safavi et al., “Smart roles: Inferring professional roles in email networks,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2923–2933, Anchorage, Alaska, August 2019.
View at: Google Scholar
T. M. G. Tennakoon, “Knowledge discovery from social networks using interaction frequency and user hierarchy,” Queensland University of Technology, 2020, PhD thesis.
View at: Google Scholar
T. M. G. Tennakoon and R. Nayak, “A concise social network representation with flow hierarchy using frequent interactions,” in Proceedings of the 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 631–638, Volos, Greece, November 2018.
View at: Publisher Site | Google Scholar
S. Behrendt, J. Klier, M. Klier, A. Richter, and K. Wiesneth, “The impact of formal hierarchies on enterprise social networking behavior,” in Proceedings of the International Conference on Information Systems (ICIS) Proceedings, p. 12, Fort Worth, TX, USA, December 2015.
View at: Google Scholar
J. Klier, M. Klier, A. Richter, and K. Wiesneth, “Two sides of the same coin?-the effects of hierarchy inside and outside enterprise social networks,” in Proceedings of the International Conference on Information Systems (ICIS), Singapore, December 2017.
View at: Google Scholar
P. Raut, R. Chawhan, T. Joshi, and P. Kasle, “Classification of power relations based on email exchange,” in Proceedings of the 2020 IEEE International Conference on Computing, pp. 486–489, Power and Communication Technologies (GUCON), New Delhi, India, December 2020.
View at: Publisher Site | Google Scholar
S. Alkhereyf, Text Classification: Exploiting the Social Network, Columbia University, New York, NY, USA, 2021.
O. Rambow, M. Diab, J. Hirschberg, K. McKeown, S. Muresan, and M. Ostendorf, Detecting relations and anomaly in Text and speech (drats), Columbia University, New York, NY, USA, 2018, Technical report.
M. Nurek and R. Michalski, “Combining machine learning and social network analysis to reveal the organizational structures,” Applied Sciences, vol. 10, no. 5, p. 1699, 2020.
View at: Publisher Site | Google Scholar
D. Barbucha and P. Szyman, “Detecting communities in organizational social network based on e-mail communication,” in Intelligent Decision Technologies, Springer, New York, NY, USA, 2021.
View at: Google Scholar
D. Barbucha and P. Szyman, “Identifying key actors in organizational social network based on e-mail communication,” in Proceedings of the International Conference on Computational Collective Intelligence, pp. 3–14, Springer, Rhodes, Greece, October 2021.
View at: Publisher Site | Google Scholar
T. Magelinski and K. M. Carley, “Community-based time segmentation from network snapshots,” Applied Network Science, vol. 4, no. 1, pp. 1–19, 2019.
View at: Publisher Site | Google Scholar
G. Costa and R. Ortale, “Mining overlapping communities and inner role assignments through bayesian mixed-membership models of networks with context-dependent interactions,” ACM Transactions on Knowledge Discovery from Data, vol. 12, no. 2, 2018.
View at: Publisher Site | Google Scholar
C. Kalinowski and M. Z. KURDI, “A topic-based forensic analysis and visualization of an email network: application to the enron dataset,” The Islamic University Journal of Applied Sciences, vol. 1, no. 2, pp. 1–22, 2019.
View at: Google Scholar
W. Cohen, “Enron data set,” 2004, https://hdl.loc.gov/loc.gdc/gdcdatasets.2018487913.
View at: Google Scholar
B. Klimt and Y. Yang, in Introducing the Enron Corpus, CEAS, Cincinnati, Ohio, 2004.
J. Shetty and J. Adibi, The Enron email dataset database schema and brief statistical report, Citeseerx, Princeton, NJ, USA, 2004.
United States Congress Joint Committee On Taxation, Report of investigation of Enron Corporation and related entities regarding federal Tax and Compensation issues, and policy recommendations, 108th Cong 1st. sess. 3 vols. Appendix D, Washington, DC, USA, 2003.
R. McCullough, “Memorandum related to reading enron‘s scheme accounting materials,” 2004, http://www.mresearch.com/pdfs/89.pdf.
View at: Google Scholar
United States Congress Joint Committee On Taxation, “Enron corp. Executive compensation stress test finding,” in Report of Investigation of Enron corporation and related entities regarding federal tax and compensation Issues, and policy recommendations. 108th Cong 1st. sess. 3 vols. Appendix D, T. Perrin, Ed., GPO, Washington, DC, USA, 2003.
View at: Google Scholar
C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an undirected graph,” Communications of the ACM, vol. 16, no. 9, pp. 575–577, 1973.
View at: Publisher Site | Google Scholar
L. Freeman, “Centrality in social networks conceptual clarification,” Social Networks, vol. 1, no. 3, pp. 215–39, 1979.
View at: Google Scholar
J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, 1999.
View at: Publisher Site | Google Scholar
P. Pons and M. Latapy, “Computing communities in large networks using random walks,” Journal of Graph Algorithms and Applications, vol. 10, no. 2, pp. 191–218, 2006.
View at: Publisher Site | Google Scholar
M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review. E, Statistical, nonlinear, and soft matter physics, vol. 69, Article ID 026113, 2004.
View at: Publisher Site | Google Scholar
S. Hershkop, “Behavior-based email analysis with application to Spam detection,” Columbia University, New York, NY, USA, 2006, PhD thesis.
View at: Google Scholar
W. Chung, R. Savell, J. peter Schütt, and G. Cybenko, “Identifying and tracking dynamic processes in social networks,” in Proceedings of the SPIE Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Security and Homeland Defense V, pp. 1–12, Orlando, FL, USA, May 2006.
View at: Publisher Site | Google Scholar
A. Kan, J. Chan, J. Bailey, and C. Leckie, “A query based approach for mining evolving graphs,” in Proceedings of the Eighth Australasian Data Mining Conference-, 101, pp. 139–150, Australian Computer Society, Inc, Melbourne, Australia, December 2009.
View at: Google Scholar
U.S. Securities and Exchange Commission, Staff report on equity and options market structure Conditions in early 2021, 2021.
J.-A. Verlaine and G. Banerji, Keith Gill drove the GameStop Reddit mania. He talked to the journal, Wall Street Journal, New York, NY, USA, 2021.

Copyright

Copyright © 2022 Germán G. Creamer et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1363

Downloads

975

Citations