#### Abstract

Despite mounting evidence that human learners are sensitive to community structure underpinning temporal sequences, this phenomenon has been studied using an extremely narrow set of network ensembles. The extent to which behavioral signatures of learning are robust to changes in community size and number is the focus of the present work. Here we present adult participants with a continuous stream of novel objects generated by a random walk along graphs of 1, 2, 3, 4, or 6 communities comprised of* N =* 24, 12, 8, 6, and 4 nodes, respectively. Nodes of the graph correspond to a unique object and edges correspond to their immediate succession in the stream. In short, we find that previously observed processing costs associated with community boundaries persist across an array of graph architectures. These results indicate that statistical learning mechanisms can flexibly accommodate variation in community structure during visual event segmentation.

#### 1. Introduction

Segmentation processes, such those involved in extracting words from continuous speech, are the backbone of much of human learning. Tasks essential to the language learner, such as mapping meaning onto sound or combining words into phrases and sentences, first require some understanding of the constituent parts of language. The parsing of sensory input into discrete units is of equal importance in other domains; for example, the perception of event boundaries in visual sequences has been shown to play a key role in active memory [1, 2]. Foundational work by Saffran and colleagues demonstrated that segmentation in the absence of semantic or acoustic cues to word boundaries is driven by the transition probabilities between syllables [3, 4]. More specifically, they found that the successful extraction of structure was due to the relative* difference* in transition probabilities throughout streams of nonsense syllables, characterized by high probabilities within words and low probabilities between words. This simple statistic has since been linked to parsing behavior in both visual and motor learning tasks, suggesting that sensitivity to transition probabilities, or* statistical learning*, extends beyond a single cognitive domain [5–8].

However, while pairwise predictive relationships are clearly a powerful statistic relevant to learning, they represent only one source of statistical information available to learners. As discussed more thoroughly in [9], tasks that demonstrate sensitivity to the central tendency of a distribution, such as in discriminating segments from a phonetic continuum [10], are also considered examples of statistical learning mechanisms at work. An examination of the full scope of statistical information exploited by the learner is a valuable endeavor, particularly when considering learning effects that are not solely explained by transition probabilities or distributional regularities. It has been demonstrated, for instance, that segmentation processes can be stymied by changes to stimulus structure such as varying the length of units to be parsed from continuous input [11]. In sum, there is insight to be gained from considering whether and when learners tune into multiple sources (or levels) of statistical information.

Though not typically framed in this way, recent developments in the field of network science have effectively extended statistical learning to encompass sensitivity to more complex information such as the global architecture of the environment (for a recent review see [12]). Evidence indicates that when certain broad-scale topological patterns are present, segmentation effects can be elicited even when transition probabilities have been equated between all pairs of elements [13–15]. In the most commonly used experimental design, nodes of the graph represent individual images and edges of the graph represent transitions from one image to another in time. By design, neighbors of each node are richly interconnected with one another, ensuring that the graph displays community structure. However, because degree (number of incident edges) is identical for each node, transition probabilities are stable. Learners exposed to a continuous stream of images generated by a random walk along such a graph* still show* sensitivity to the boundaries between communities. Building on these findings, new evidence suggests that the presence of community structure within temporal sequences might be a particularly privileged type of regularity. For example, increases in processing speed have been observed for motor sequences generated by random walks along modular relative to lattice graphs with an identical number of nodes, edges, and degree distribution [16].

As we develop and test hypotheses about why community structure might be particularly important for learning, it is necessary to clarify the extent to which this specific sensitivity generalizes to variations in graph topology. While the presence of community structure has been repeatedly linked to changes in processing at event boundaries, this phenomenon is nearly always studied using a very narrow ensemble of graphs with nodes of degree* k* = 4 and communities of nodes and edges [13–16]. Here, we expand on the limited set of graph architectures previously used by systematically assessing how variations in the number and size of communities impact learning. We note that while we vary community structure, we are careful to hold constant local statistics commonly associated with learning, such as the degree distribution of nodes within a graph (i.e., variations in pairwise transition probabilities). We ask whether previously reported increases in processing time at community boundaries, indicative of learners’ expectations that sequences tend to stay within communities, are affected when properties of those communities change.

As a secondary goal, we probe segmentation effects using images of 3-dimensional, clearly manipulable objects as opposed to the more commonly employed fractals or glyphs (but see [17]). Thus, the present series of experiments aims to raise the ecological validity of standard approaches to studying learners’ sensitivity to community structure. In doing so, we offer greater insight into how this sensitivity might relate to real-world contexts. For example, a rich research tradition on event segmentation in natural scenes has focused on how the confluence of top-down and bottom-up processes enables perceivers to determine the boundaries of visually-presented activities [18], with a particular focus on how the segmentation processes relate to the encoding of information in memory [19]. Because statistical learning mechanisms are proposed to operate in information-rich contexts, such as natural scenes, it is essential to demonstrate that they can handle complex sensory input [20]. Making use of manipulable, natural-looking objects, the work presented here is a step toward strengthening links to learning outside the laboratory.

#### 2. Materials and Methods

##### 2.1. Participants

Data were collected from 100 unique participants: 20 per each of the 5 experimental conditions in a between-subjects design. We used Amazon Mechanical Turk, an online marketplace in which adult workers complete tasks in exchange for financial compensation. Participants were paid at a rate of $0.10 per minute. To ensure that participants were attending to the task, they also received a completion bonus of $1.00 as well an additional $1.00 bonus if their performance on an orthogonal cover task exceeded 90% accuracy. Methods adhered to the guidelines and regulations of the Institutional Review Board (IRB) of the University of Pennsylvania, which approved all experimental protocols. Participants communicated informed consent prior to completing the experiment.

##### 2.2. Stimuli

Color images of objects used in this experiment were pulled from the edition of the Novel Object and Unusual Name (NOUN) Database [21]. Novel objects were employed to reduce the possibility that the degree to which an object was recognizable would influence participants’ processing times. To narrow down the full set of objects to the subset used here (Figure 1(a)), we selected from the database the 24 most distinct objects (highest mean distance scores based on the Spatial Arrangement Method [22]) that were considered familiar and nameable by 50% or fewer participants. Of the resulting list, we replaced three objects with slightly lower distance scores because their high degree of symmetry meant that participants would be unable to perform a rotation judgment task (see below).

**(a)**

**(b)**

Once the object images were selected, one unique continuous visual stream of 1400 trials was created for each participant. Streams were generated by first assigning an object to a node, and then by randomly walking along the edges comprising one of 6 graph types (Figure 1(b)). Object-to-node correspondence was randomized across subjects. All graphs consisted of an equal number of nodes (*N* = 24), and although the degree of each node differed by graph type (ranging from* k* = 23 in the fully connected graph to in the graph consisting of 6 communities), their relative distribution was matched. In other words, within a single graph type, the degree was equated for all nodes, roughly fixing the transition probabilities. Thus, the crucial manipulation was not local variations in pairwise statistics, but rather the number of communities (1, 2, 3, 4, or 6) and the number of nodes within each of those communities (*N* = 24, 12, 8, 6, and 4, respectively). Because we aimed to maintain dense community structure while also ensuring uniform degree across nodes* within* a graph, the total number of edges differed by graph type, ranging from* E* = 276 in the fully connected graph to in the graph consisting of 6 communities.

##### 2.3. Procedure

The experimental setup closely mirrored the procedures detailed in [14]; however, for clarity, we summarize our methods here. Participants were instructed to view a continuous stream of objects, and they were informed that over the course of the 35-minute stream, parts of it might become familiar to them. Prior to the initiation of the stream, they were trained to distinguish the canonical orientation of each object from a version that was rotated 90 degrees to the left, and they were tested on their knowledge before moving to the main phase of the experiment. Training trials were repeated until participants achieved an accuracy score of 100% (mean = 83.47 trials, SD = 21.19). The minimum possible number of training trials was 72 (3 trials per object).While viewing the full stream of objects, participants indicated whether each object appeared in its canonical orientation (by pressing 1 on their keyboard) or its rotated version (by pressing 2 on their keyboard). Thus, we were able to collect fine-grained measures of processing time for each object throughout the course of exposure to the stream. From the full set of objects, exactly 15% were rotated. Participants were instructed that they would hear a high-pitched tone if they responded incorrectly during the exposure phase and a low-pitch tone if they responded too slowly. Images of size 300x300 pixels were presented for 1.5 s with no interstimulus interval on a white background.

#### 3. Results

The dependent measure for this experiment was the reaction time (RT) for a canonical, non-rotated image in the stream. Before examining the influence of variation in community structure on this measure, the following steps were taken to clean the data: removal of incorrect or no response trials (7.4% data loss), removal of rotated trials (a further 12% data loss), removal of implausible reaction times (i.e., greater than 1500 ms or less than 100 ms, a further 0.2% data loss), and removal of outlier data points greater than 3 standard deviations from the average RT per subject (a further 1.7% data loss). These preprocessing steps were identical to those used in prior work [14], and we note that the pattern of significant results reported below holds without the removal of implausible and outlier data points. Next, we ran two regression models to answer the following questions: First, do previously reported increases in RTs at community boundaries vary by community size and number (**Model 1**)? Second, are general processing times, separate from the hypothesized cross-community RT increases, influenced by these same topological variations (**Model 2**)? The linear mixed effects modeling described below was performed with the* lmer()* function (library lme4, v. 1.1–19) in R v. 3.5.1.

##### 3.1. Model 1: Cross-Community Processing Costs

Model 1 was run specifically on data points corresponding to boundary nodes, defined as the nodes directly preceding entry into a new community (“pre-transition nodes”) and the nodes representative of that entry (“transition nodes”). Because the fully connected graph contained no boundary nodes, data from this condition were excluded from analysis. We focused specifically on boundary nodes for two reasons: (1) we could not rule out the possibility that learners might show a special sensitivity to boundary nodes regardless of whether they represented entry into a new community; and (2) this approach would ensure a relatively balanced dataset. For instances in which there was a forward and backward traversal of the same cross-community edge (e.g., 24-1-24), we counted only the first pre/transition node pair (24-1). RTs were regressed onto all main effects and interactions of Node Type (pretransition* versus* transition), Community (reverse Helmert coded to test the hypothesis that RTs would increase based on the number of communities) and Trial (continuous from 1–1400, centered to reduce multicollinearity). The model also included the fullest random effects structure that allowed the model to converge: a random intercept for each participant and by-participant random slopes for Trial, Node Type, and their interaction. Results are detailed in Table 1. We observe significant main effects of Node Type (*β* = 16.79, t = 8.46, and p < 0.001) and Trial (*β* = -27.35, t = -8.17, and < 0.001). The magnitude of the correlation among fixed effects was less than* r = *0.6.

To summarize, we find that images associated with transition nodes elicited significantly longer RTs than images occurring directly prior to that transition (Figure 2). As expected, we also find that RTs decreased significantly over time regardless of Node Type; that is, participants overall became faster at making orientation judgments. We observe no main effects of Community, and no interactions with this predictor, suggesting that previously reported cross-community RT increases are robust to fluctuations in community size and number. A subsequent simple effects analysis indicates the effect of Node Type for each level of the Community predictor. Significant effects of Node Type are revealed for the 2-community graph (*β* = 12.81, t = 2.31, and p = 0.021), the 3-community graph (*β* = 22.57, t = 5.58, and p < 0.001), the 4-community graph (*β* = 17.46, t = 5.51, and p < 0.001), and the 6-community graph (*β* = 14.34, t = 5.88, and p < 0.001). Numerically, the effect of Node Type is weakest for the graph consisting of 2 communities of 12 nodes, but we find no significant difference in cross-community RT increases for this graph relative to the others.

###### 3.1.1. Repetition Priming

Because walks often sampled densely from within a community, there was a higher probability (relative to transition nodes) that a pretransition node would have been viewed in the recent past. To disentangle perceptual priming effects from the top-down expectation that sequences should stay within communities, we followed the approach taken by [14]. Model 1 was rerun with the addition of two confound predictors: Lag10 and Recency. These two predictors indicated the number of times each image was seen in the previous 10 trials and the number of trials elapsed since each image was seen, respectively. Results reveal significant main effects of Lag10 (*β* = -13.68, t = -8.06, and p < 0.001) and Recency (*β* = 15.31, t = 9.61, and p < 0.001); however, we maintain our significant main effect of Node Type (*β* = 5.53, t = 2.66, and p = 0.008). We then subset our data to include only the 30.7% of boundary nodes that had* no*t been repeated within the previous 25 trials (any further constraint would have led to an extremely unbalanced dataset). Again, we maintain a significant main effect of Node Type (*β* = 11.00, t = 2.67, and p = 0.008).

##### 3.2. Model 2: General Processing Times Influenced by Community Size and Number

Procedures for Model 1 were also applied to Model 2. However, as we already confirmed from Model 1 the presence of a processing cost for transition nodes, here in Model 2 we probed RTs for all nodes* except* transition nodes analyzed in Model 1. Because we did not focus exclusively on boundary nodes, we were also able to include data from the fully connected graph. RTs were regressed onto all main effects and interactions of Community and Trial. We again included the fullest random effects structure that allowed the model to converge, which in this case consisted of a random intercept for each participant and by-participant random slopes for Trial. The magnitude of the correlation among fixed effects was less than* r = *0.3. In addition to the expected main effect of Trial (*β* = -24.59, t = -11.90, p < 0.001), we also observe a significant main effect of Community for the graph containing 4 communities of* N =* 6 nodes relative to graphs containing 1, 2, and 3 communities (*β* = -8.83, t = -2.09, p = 0.040; Table 1). Phrased another way, the general processing times after excluding cross-community nodes were most facilitated when learners were presented with sequences generated by a random walk along a graph consisting of 4 communities (Figure 3). Importantly, these effects are observed even when specifically accounting for inter-individual variation in general RTs through the random effects structure of the model. To be clear, when directly comparing general processing times for graphs of 4 communities relative to graphs of 6 communities, we find no significant main effect (*β* = 9.27, t = 0.84, and p = 0.41). Therefore, it may not be that participants had a particular preference for graphs of 4 communities (of 6 nodes) but that their processing times were generally influenced when information was organized according to many small communities. Finally, to make direct contact with Model 1 analyses, we also reran Model 2 with the inclusion of the Lag10 and Recency predictors described in Section 3.1.1. We note that the main effect of Community (4 v. 3, 2, 1) was marginal (*β* = -6.98, t = -1.65, p = 0.103). When subsetting to the 18.9% of nodes that had not been repeated within the previous 25 trials, the previously significant main effect of Community (4 v. 3, 2, 1) dropped to *β* = -4.22, t = -0.97, p = 0.336.

#### 4. Discussion

The current study serves to broaden our understanding of the scope of community-driven learning. We have begun with a replication of prior work demonstrating a processing cost associated with transitioning from one community of objects to another in a continuous sequence [13–16]. In doing so, we have taken the critical step of showing that previously reported effects generalize to novel stimuli that more closely approximate the physical features of manipulable, complex objects found in our real-world environment. The observed increase in reaction time at community boundaries signals that learners are indeed highly sensitive to modular temporal networks; processing costs for transition nodes indicate a violation of the expectation that sequences tend to stay within a given community (for extensive discussion of this point see [14]). The present report also finds no evidence to support the hypothesis that cross-community RT differences are significantly modulated by changes in community size and number. Compellingly, a simple effects analysis pointed to a significant effect of Node Type whether examining 2 communities of N = 12 nodes with degree* k =* 11 or 6 communities of N = 4 nodes with degree* k =* 3.

While the present work substantiates the link between community structure and event segmentation (i.e., by focusing on boundary nodes), it is also useful to consider the impact of this property on* general* processing separately from the RT signatures associated with violating learners’ expectations that sequences stay within a given community [16]. There is a rich history in cognitive science devoted to uncovering “sweet spots” associated with various cognitive capacities (e.g., [23]). Miller (1956) famously pronounced the limits of verbal working memory as seven plus or minus two [24], and similar constraints are described in tasks of numerical cognition. For example, adults typically have a subitizing range of 5 items beyond which they are unable to automatically determine the number of items in a visual array without counting [25]. While not a constraint* per se*, it could be the case that communities of a certain size or number lead to the most efficient use of processing resources. An analysis of intra-community reaction times (i.e., excluding nodes representative of a transition to a new community) starts to offer an answer to this question. Specifically, results reveal the greatest facilitation of object processing for sequences generated by walks along a graph comprised of 4 communities of* N =* 6 nodes (compared to graphs with 1, 2, or 3 communities). Given the present design, intended to evaluate segmentation effects while holding constant the total number of nodes and the within-graph degree distribution, it is not possible to disentangle whether the observed facilitation effects are due to the number or size of communities. Nonetheless, this pattern of results has intriguing connections to reports of visual working memory capacity at 4 items [26]. Might it be that community structure is most useful as a cue to underlying structure when the temporal environment is organized into 4 or more groupings? If, however, it is the size of community that affects processing times and not the total number of communities, then we would not find evidence in favor of that hypothesis. Clearly, additional work, using computational as well as behavioral approaches [27], is needed to pinpoint the nature of the relationship between constraints on cognitive capacities and the effects of community structure on sequential object processing.

The sum of these findings confirms that learners are strongly attuned to the presence of community structure and begs further study of the extent to which learning is robust to even more pronounced topological variation. Given that community structure pervades systems as diverse and noisy as linguistic, biological, and social networks [28–30] and that the human brain flexibly accommodates relational information in grid-like maps [17, 31], one would expect learning mechanisms to cope adequately with larger scale and/or sparser instantiations of this property. However, the extent to which learners exploit the full scope of network properties observed in natural systems (e.g., core-periphery structure, scale-free structure, and variations in community-size and density) gives rise to empirical questions to be tested. Specifying the boundary conditions of learning is an especially important area of ongoing and future research. For example, evidence already suggests that local statistics (transition probabilities), when they are sufficiently strong at community boundaries, override typically observed RT increases at event transitions [14]. This tension between local and global regularities, particularly when those regularities are embedded in noisier systems, will be an essential avenue to investigate in greater detail. To follow up on our earlier suggestion that learning in the laboratory should more closely reflect learning outside the laboratory, the focus here on uniform communities and uniform transition probabilities (within-graph) could be considered a limitation given that learners are less likely to encounter such rigidly organized input.

#### 5. Conclusions

We argue here for extending the commonly used definition of statistical learning to encompass learners’ sensitivity to the broader topology of their environment. We offer support for this argument by demonstrating that processing times at event boundaries are influenced by temporal community structure across a variety of scales (i.e., the set of graphs tested here), even when object-to-object transition probabilities do not display meaningful variation. Finally, we show that community structure also affects* overall* processing times, with specific facilitative impact observed for sequences comprised of 4 communities of* N =* 6 nodes relative to fewer communities comprised of a greater number of nodes. The present work marks an important step forward in our understanding of the influence of higher-order architectural properties on learning and processing.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Disclosure

The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the funding agencies.

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Acknowledgments

The authors wish to acknowledge Christopher Lynn for helpful comments on this manuscript. This work was supported by the National Science Foundation CAREER award to Danielle S. Bassett (PHY-1554488) and by the Army Research Laboratory through contract number W911NF-10-2-0022. The authors would also like to acknowledge support from the John D. and Catherine T. MacArthur Foundation, the Alfred P. Sloan Foundation, the Paul G. Allen Foundation, the Army Research Laboratory through contract number W911NF-10-2-0022, the Army Research Office through contract numbers W911NF-14-1-0679 and W911NF-16-1-0474, the National Institute of Health (2-R01-DC-009209-11, 1R01HD086888-01, R01-MH107235, R01- MH107703, R01MH109520, 1R01NS099348, and R21-M MH-106799), the Office of Naval Research, and the National Science Foundation (BCS-1441502, BCS-1631550, and CNS-1626008).