Abstract

Phasic activity of dopaminergic (DA) neurons in the ventral tegmental area or substantia nigra compacta (VTA/SNc) has been suggested to encode reward-prediction error signal for reinforcement learning. Recent studies have shown that the lateral habenula (LHb) neurons exhibit a similar response, but for nonrewarding or punishment signals. Hence, the transient signaling role of LHb neurons is opposite that of DA neurons and also that of several other brain nuclei such as the border region of the globus pallidus internal segment (GPb) and the rostral medial tegmentum (RMTg). Previous theoretical models have investigated the neural circuit mechanism underlying reward-based phasic activity of DA neurons, but the feasibility of a larger neural circuit model to account for the observed reward-based phasic activity in other brain nuclei such as the LHb has yet to be shown. Here, we propose a large-scale neural circuit model and show that parallel excitatory and inhibitory pathways underlie the learned neural responses across multiple brain regions. Specifically, the model can account for the phasic neural activity observed in the GPb, LHb, RMTg, and VTA/SNc. Based on sensitivity analysis, the model is found to be robust against changes in the overall neural connectivity strength. The model also predicts that striosomes play a key role in the phasic activity of VTA/SNc and LHb neurons by encoding previous and expected rewards. Taken together, our model identifies the important role of parallel neural circuit pathways in accounting for phasic activity across multiple brain areas during reward and punishment processing.

1. Introduction

The ability to adapt to uncertainty is critical for survival and key to wellbeing. To investigate the underlying neural correlates and mechanisms, many experimental and computational studies using stochastic scheduling of reward have been carried out [19]. Experimental studies have demonstrated that dopaminergic (DA) neurons in the ventral tegmental area or substantia nigra compacta (VTA/SNc) and neurons in the lateral habenula (LHb) play important roles in encoding uncertainty of reward and punishment [5, 8].

As illustrated schematically in Figure 1 (top row), given some unexpected reward (the presence of an unconditioned stimulus US such as food), DA (LHb) neurons exhibit a phasic peak (dip) upon the presence of the US [5, 8]. After several trials of learning in the presence of a cue/stimulus, conditioning takes place. The (expected) conditioned cue/stimulus (CS) becomes associated with reward, and the DA (LHb) neurons exhibit a phasic peak (dip) in activity upon the onset of the CS (Figure 1, second row) [5, 8]. Note that the DA and LHb neurons now do not respond to the unconditioned stimulus (US) with a rewarding outcome [5, 8]. One can view this as postreinforcement learning: the agent has learned to completely associate the cue/stimulus CS with the US (e.g., an auditory tone with food), and the latter is no longer needed for further learning. However, if there is an omission of reward (e.g., absence of food), there is an additional dip (peak) in activity for the DA (LHb) neurons (Figure 1, third row) [5, 8].

Instead of the unexpected rewarding outcome US, if we now replace it with an unexpected nonrewarding or aversive stimulus US (e.g., no food or mild electric shock), it has been observed that phasic dip (peak) in the DA (LHb) neurons occurs during the initial phase of the reinforcement learning [5, 8] (Figure 1, fourth row). After learning, this information is transferred to the CS, in which the DA (LHb) neurons exhibit a phasic dip (peak) activity upon CS presentation while staying at baseline activity level during US (Figure 1, fifth row). When there is a sudden unexpected omission of such US or when the US becomes rewarding, then there is a peak (dip) in activity of the DA (LHb) neurons [8, 10, 11] (Figure 1, bottom row). In summary, the phasic activities of DA and LHb neurons signal uncertainty in reward and punishment. Such signaling is also reflected in other brain regions such as the border region of the globus pallidus internal segment (GPb), the internal segment of the globus pallidus (GPi), and the rostral medial tegmentum (RMTg) [2, 3]. However, it is not clear how this information is transmitted within a larger neural circuit.

To understand the underlying computation, previous theoretical and computational studies have applied temporal difference learning [8, 1215] and neural circuit modeling to understand the phasic activity of DA neurons [1618] on the basis that the phasic activity of DA neurons acts as a form of reward-prediction error signal [8]. In particular, in the model by Brown et al. [16], there are parallel pathways: one pathway from the cortex through the striosome to VTA/SNc and the other pathway from the cortex through the ventral striatum (VS) to the pedunculopontine nucleus (PPTN) and VTA/SNc. These two pathways cooperatively control the activity of DA neurons (Figure 2). However, the phasic activity of LHb neurons has not been taken into consideration yet, especially given that LHb has substantial projections to DA neurons in the VTA/SNc [5].

In this work, we propose a large-scale neural circuit model by extending Brown et al.’s [16] model to investigate the phasic activity of not only DA and LHb neurons, but also the extended parts of the network such as the GPb, GPi, and RMTg. In addition to the neural circuit pathways in Brown et al. [16] that control DA signaling (see above), our model also included pathways from the striosome and the VS to the LHb and also one pathway from the LHb to the VTA/SNc via RMTg. These additional pathways are necessary to account for the observed phasic activity of LHb neurons (Figure 2). Further, the pathway from LHb to VTA/SNc via RMTg provides inhibition to the DA neural activity when expected reward was omitted or when there is an aversive outcome. This interareal connectivity is constrained by currently available knowledge from physiological studies (see below for supporting evidence).

Based on simulation results, our model can account for various experimental observations of phasic activation with rewarding or nonrewarding CS, together with or without reward outcomes. Specifically, the model can account for a shift of VTA/SNc and LHb neuron responses from outcome to CS, which agrees with experiments. In addition, the model can also account for the phasic activity of GPb and RMTg neurons, whose responses are similar to those of LHb neurons. Our model shed light on the mechanism of VTA/SNc and LHb phasic activity at the neural circuit level, with important roles from the parallel excitatory and inhibitory pathways in the learned responses; namely, (i) the VS-PPTN-VTA/SNc pathway excites DA, while the striosome-VTA/SNc pathway inhibits DA; (ii) the VS-VP-GPb-LHb pathway inhibits LHb, while the striosome-GPi-GPb-LHb pathway excites LHb; and (iii) the LHb-RMTg-VTA/SNc pathway magnifies the phasic activity of VTA/SNc. The model is also rather resilient to overall changes in the interregional connections. Finally, our model predicts that the striosome is important since it may remember the timing of the previous reward and provide the comparison signal with the present reward.

2. Materials and Methods

2.1. Model Architecture

Our proposed neural circuit model is schematically shown in Figure 2, which is an extended version of the model proposed by Brown et al. [16]. Namely, we included the GPb, LHb, and RMTg neural populations into the model based on more recent experimental findings [2, 3, 19, 20]. The details of each part of our model are described as follows.

2.1.1. LHb Inhibits SNC/VTA via RMTg

Most LHb neurons are glutamatergic [22], but experiments showed that LHb inhibits DA neurons. Firstly, in vivo recordings demonstrate that most LHb neurons are excited by a nonreward-predicting cue and are inhibited by a reward-predicting cue when rhesus monkeys perform a visually guided saccade task [5]. The phasic activity of LHb neurons is opposite that of DA neurons in terms of responding to outcome valence; LHb (DA) neurons are excited (inhibited) by nonreward/punishment outcome/cue and inhibited (excited) by reward outcome/cue [5, 8]. Secondly, LHb neurons respond to cues earlier than DA neurons in unrewarded trials [5]. Thirdly, stimulating LHb neurons will inhibit DA neurons [21]. The inhibition of LHb on DA neurons may arise from the direct projection from LHb neuron to inhibitory interneurons in the VTA/SNc [23] or indirectly through some inhibitory nucleus. In fact, experiments have revealed a path from the LHb to DA neurons through RMTg and neurons in the RMTg seem to encode aversive stimuli [19, 20]. At the same time, the RMTg transmits negative reward-prediction errors signal of LHb neuron to positive reward-prediction errors signal of DA neurons [3]. For simplicity, we only include the indirect path from LHb to DA neurons via GABAergic RMTg.

2.1.2. GPb Excites LHb

Low intensity electrical stimulation in GPb can evoke a short latency excitatory response in LHb neurons [21]. The excitation of GPb neurons on LHb neurons may be mediated by acetylcholine or glutamate [2], or by disinhibition through intra-LHb interneurons considering the complex microcircuitry within the GP [2, 24]. In addition, glutamatergic projections to LHb from rat’s entopeduncular or primate’s GPb neurons have been observed in experiments on nonhuman primates [25, 26]. In brief, there are excitatory projections from GPb to LHb which form a pathway from GPb to VTA/SNC via LHb and RMTg [19].

2.1.3. Conjectured Inputs to GPb from GPi

It has been demonstrated that GPb neurons receive input from the striatum, presumably from the striosome [27]. Hong and Hikosaka [21] have observed that typical neurons in the external and internal segments of the globus pallidus (GPe and GPi) are first inhibited by striatal stimulation but GPb neurons are often (but not always) excited or disinhibited by striatal stimulations. They proposed that signals to GPb should be mediated through inhibitory axon collaterals within the striatum [28] or GPe [24]. Based on these observations, we conjecture that striosome projects to LHb through GPi.

2.1.4. VP Inputs to GPb

In Brown et al.’s [16] model, VP neurons are inhibited by the expectation of reward. However, recent experiments observe that the majority of VP neurons are excited by the expectation of a large reward [21]. Thus, VP-LHb connections could possibly be inhibitory [21]. Therefore, we assume that reward-related signals are transmitted to the LHb through excitatory connections from the GPb and inhibitory connections from the VP.

2.1.5. Excitatory Inputs from VS to VP and PPTN

Although VS neurons are usually identified as GABAergic and inhibit downstream neurons, Hong and Hikosaka [21] showed that the striatal (GABAergic) neurons excite PPTN and VP neurons. The excitation by VS neurons can be mediated by substance [29, 30]. Thus, we assume that VS directly excites PPTN and VP.

2.2. Dynamical Equations, Input-Output Functions, and Numerical Method

We assume neuronal homogeneity within each brain region, such that each neural population’s firing-rate activity within a brain region or nucleus can be dynamically described by ordinary differential equations typically with a decay term plus a term with an input-output function: firing-rate type model (Wilson and Cowan, 1976; see Mathematics and Equations). Specifically, the neural population firing rate (output) is normalized, ranging from zero to one. The input includes constant background input to generate the spontaneous baseline firing activity for each neural population (and brain region) and synaptic terms in the form of coupling strengths to provide the interaction between different neural populations (see Mathematics and Equations). Some of the coupling strengths are subject to change (i.e., plastic) dependent on the presence of reward (see Figure 2). Further modeling details can be obtained from the original model of Brown et al. [16]. The model variables are summarized in Table 1. Parameters are adjusted to fit the observed responses of neurons. Parameter values used for simulations are given in Table 2. In all simulations, numerical integration of the ordinary differential equations was performed with fourth-order Runge-Kutta method [31] using a custom Python code. Codes are available upon request.

2.3. Simulation Protocol

We simulate 200 trials in one block (Figure 3(a)). Every trial lasts for 10 simulated seconds (Figures 3(b)3(e)). In each trial, we apply different inputs to simulate different conditions as follows. First, we simulate the first to the 99th trial with rewarding CS and rewarding US: learning trials. The network can associate the rewarding CS with the rewarding US. The 100th trial is a “test” trial and the network receives rewarding CS and nonrewarding US. We then simulate the unexpected reward condition, that is, nonrewarding CS and rewarding US. From the 101st trial to the 199th trial, the network receives nonrewarding CS and nonrewarding US. The network associates the nonrewarding CS with the nonrewarding US. At the 200th trial, the network receives nonrewarding CS but rewarding US. See Figure 3(a) for a summary of the learning protocol.

We implement different inputs from the cortex to the VS and striosome based on four conditions: reward CS, nonreward CS, reward US, and nonreward US. The rewarding/nonrewarding CS and US are shown in Figure 3 and their mathematical expressions are given in the Mathematics and Equations. Note that the inputs from the cortex are always larger than zero (firing-rate activity cannot be negative in value).

The motivation for such an implementation is based on some observed lines of evidence. First, neurons in the orbitofrontal cortex fire most strongly for cues that predict large reward (with small penalty) and least strongly for cues that predict large penalty (with small reward) relative to neutral conditions (small reward and small penalty) [32, 33]. Second, cortical neurons, including the frontal cortex, are known to exhibit flexibility and mixed response properties; that is, different cortical neurons could have different responses to identical stimuli [34, 35]. For instance, an identical tone could result in different responses from different cortical neurons which could in turn separately transmit information to the same neural “downstream” (e.g., in the midbrain). Third, the expectation values of cue signaling are stored in the cortex but not in the basal ganglia or LHb [36, 37]. The phasic activity of DA neurons can result in plasticity in the cortex and change the representation of cue signaling [38]. In fact, the activity profiles in Figures 3(d) and 3(e) look similar to that of DA release or nonrelease (as measured, e.g., in voltammetry [39]). Also, the sustained or persistent activity in Figure 3(b) could represent (working) memory of the cue, a commonly observed phenomenon in the frontal cortical neurons [36, 37, 40], while the suppressed activity in Figure 3(c) can be thought of as some inhibitory effect with respect to the response in Figure 3(b).

3. Results

3.1. Shift of Phasic Response from US to CS

Many experimental and theoretical studies have reported the shift of DA neurons response from US to CS [4143]. As discussed previously, in the initial phase of learning, DA neurons are phasically activated from the baseline upon the presentation of an unpredicted reward. An accompanying cue is associated with the rewarding outcome through a learning process. After learning, the phasic activity at reward outcome subsequently decreases to baseline, while a phasic activity now appears upon cue onset (Figure 1).

Our simulation can replicate this trend (Figure 4). When the network receives the rewarding CS and rewarding US (during the first 99 trials), DA neurons exhibit phasic activity upon the US in the first trial (Figure 4(a)). In the second and the subsequent trials, the peak appears upon the CS onset and the previous peak activity upon US onset disappears (Figures 4(b) and 4(c)).

The parallel pathways in our model can account for the shift in neural response from US to CS. At the beginning of the learning phase, CS-to-VS synaptic weights and CS input-to-striosomal synaptic weights are very small or near zero. Thus, the activity of the striosome is maintained at baseline level but the activity of VS has a peak upon US onset. The peak activity of VS then propagates to the LHb through the VS-VP-GPb-LHb pathway, which results in a dip of the LHb activity upon US. Meanwhile, a phasic input to DA neurons through the VS-VP-GPb-LHb-RMTg-VTA/SNc pathway and VS-PPTN-VTA/SNc pathway leads to a phasic activity of DA neurons upon reward US. The phasic activity of DA neurons upon reward US in turn enhances the positive reinforcement-learning signal N+ (see (7)) which leads to stronger synaptic strengths of afferent inputs to VS and striosome from the cortex: the increased synapse and will enhance CS signal pathways from VS to DA via the PPTN (VS-PPTN-VTA/SNc) and VP (VS-VP-GPb-LHb-RMTg-VTA/SNc), the pathway from striosome to DA (striosome-VTA/SNc), and the pathway from striosome to DA via GPb (striosome-GPi-GPb-LHb-RMTg-VTA/SNc).

The striosome in the model has an adaptive timing spectrum, encoding the timing and the amount of reward associated with the CS [16, 44, 45] (see (10)–(14)). Therefore, through the VS-PPTN-VTA/SNc pathway, rewarding CS can trigger phasic activity of DA neurons (Figures 4(a)-4(c)), while nonrewarding CS can trigger a dip in activity (Figures 5(c)-5(d)). The signal of rewarding US through the striosome inhibits DA neurons at the time when the rewarding US is expected to be present, but the excitation of reward US through the VS to VTA/SNc pathway via PPTN cancels the inhibition of the CS, leading to a baseline activity of DA neurons to reward US (Figures 4(c) and 5(a)). On the contrary, nonrewarding US cannot trigger enough excitation to cancel the inhibition caused by CS in DA neurons, leading to a dip in activity upon nonrewarding US onset (Figure 5(b)).

Experimental studies have shown that the phasic activity of LHb is opposite that of DA neurons in terms of response to reward valence, but with a similar shift in activity to DA phasic activity. In our model, LHb neurons are inhibited and show a dip in their activity upon rewarding US onset (Figure 4(d)). The dip of LHb neural activity shifts from US to rewarding CS in the following and subsequent trials (Figures 4(e)-4(f)). As mentioned previously, unexpected rewarding US can switch on the pathways striosome-GPi-GPb-LHb and VS-VP-GPb-LHb. However, before they are switched on, the rewarding US will inhibit LHb neurons through the VS-VP-GPb-LHb pathway (Figure 4(d)). Once the striosome-LHb and VS-LHb pathways are switched on, the reward CS will effectively inhibit LHb neurons through the VS-VP-GPb-LHb pathway, leading to a dip at the time of the rewarding CS. But the inhibition caused by the rewarding US will be canceled by excitation from the striosome-GPi-GPb-LHb pathway leading to a baseline activity of LHb neurons at the time of the rewarding US (Figure 4(f)).

3.2. Neural Pathways Underlying Learned Phasic Activity of DA Neurons

The phasic activity of DA neurons has been suggested to encode reward-prediction error and to play a pivotal role in reinforcement learning [8, 46, 47]. DA neural activity in our model shows reward-prediction error that is consistent with experimental observations (Figure 5(f)). For instance, after 99 trials of training, the network already can associate the rewarding CS with the rewarding US. The DA neurons show a phasic activity upon CS onset (at time 2 s in Figure 5(a)). But at the 100th trial, we simulate the condition where the expected reward is omitted. DA neurons are excited right after CS onset (2 s) and inhibited at US presentation (3.6 s) (Figure 5(b)). The network now reassociates the CS with the nonrewarding US after the training from the 101st to 199th trials. The activity of DA neurons then shows a dip at the time when nonrewarding CS is presented at 2 s and shows baseline activity when the nonrewarding US is presented at 3.6 s (Figure 5(c)). Finally, at the 200th trial, we present both the nonrewarding CS and rewarding US to simulate an unexpected reward condition. DA neurons are inhibited upon CS presentation (2 s) but excited at the time when rewarding US is presented once again (3.6 s) (Figure 5(d)). The overall activity profile of DA neurons is summarized in Figure 4(e), which is consistent with experimental observations (Figure 5(f)).

The above phasic responses of DA neural activity associated with the learned stimuli can be understood based on the two parallel pathways in the circuit: the VS-PPTN-VTA/SNc and the striosome-VTA/SNc pathways. It should be noted that, after the 1st trial, the synaptic strengths and are not zero, so VS responds to both rewarding CS and rewarding US. Then, the DA neurons are excited by the rewarding CS through the VS-PPTN-VTA/SNc pathway. When rewarding US is presented, the signal of rewarding CS triggers the activity of striosomal neurons and directly inhibits DA neurons. However, this inhibition is canceled out by the excitation from rewarding US through the VS-PPTN-VTA/SNc pathway. Thus, the activity of DA neurons is effectively maintained at baseline (Figure 5(a)). By the 99th trial, the network has already associated the rewarding CS with rewarding US.

Now, if the rewarding US is omitted (at the 100th trial), no excitation counterbalances the direct inhibition from the striosome, leading to a dip in the activity of DA neurons (Figure 5(b)). This continues until the 199th trial. When the network is presented with a nonrewarding CS followed by nonrewarding US, the direct inhibitory pathway from striosome to DA neurons has been turned off, DA neurons show phasic activity upon nonrewarding CS onset, and the activity of DA neurons is maintained at baseline at the time of nonrewarding US (Figure 5(c)). With a subsequently unexpected rewarding US in trial 200, DA neurons are now excited through the VS-PPTN-VTA/SNc pathway while the nonrewarding CS still causes a dip in the activity (Figure 5(d)).

3.3. Neural Pathways Underlying Learned Phasic Activity of LHb Neurons

Experimental studies have shown that phasic activity of LHb behaves in an opposite way to that of DA neurons [5]. Hence, it has been suggested that LHb neurons play a key role in the coding of the aversive/negative signals [48, 49]. Experiments have been carried out to investigate the activity of several brain nuclei, such as GPb [2] and RMTg [3], to explore the possible functional relationship with these brain regions.

Here, we simulate the activity of these nuclei and the results are consistent with the experimental observations. Our simulations show that the phasic responses of LHb neurons shift from US to CS. LHb neurons show a phasic dip when the unexpected rewarding US was presented in the first trial (Figure 4(d)). In the following trials, the dip shifts to the time when the rewarding CS presented (Figures 4(e)-4(f)) and baseline activity is observed with rewarding CS (Figure 6(a)) and a small phasic activity upon nonrewarding US (Figure 6(b)). After the training of nonrewarding CS from the 101st to the 199th trials, LHb neurons show a phasic activity upon nonrewarding CS (2 s) while maintaining a baseline level at the time of the nonrewarding US (Figure 6(c)). At the 200th trial, LHb neurons show a peak activity with the nonrewarding CS but a big dip in activity given an unexpected rewarding US (Figure 6(d)). The overall activity profile of LHb neurons (Figure 6(e)) agrees with the experimental observations (Figure 6(f)).

The above-mentioned learned phasic activity of LHb neurons can be explained with two parallel pathways: striosome-to-LHb pathway via GPi and GPb and the VS-to-LHb pathway via VP and GPb. For instance, at the 99th trial, the synaptic strengths WiS and Zij are not zero, which means that the network has already completely associated the rewarding CS with rewarding US. The rewarding CS can inhibit LHb neurons through the inhibitory striatum-VP-GPb-LHb pathway. When the rewarding US appears, the inhibition through the striatum-VP-GPb-LHb pathway will be canceled out by the excitation from the striosome-GPi-GPb-LHb pathway, resulting in a baseline level of LHb neural activity upon reward omission. At the 100th trial, LHb neurons show a dip in the presence of the rewarding CS. But the omission of reward implies that the excitation through striosome-GPb-LHb pathway cannot be canceled out, which leads to a small phasic activity of LHb neurons upon reward omission. At the same time, the synaptic strength Zij from the cortex to the striosome decreases to zero. When next the nonrewarding CS is paired with a nonrewarding US (from the 101st to the 200th trial), LHb neurons show a phasic activity at the time of the nonrewarding CS onset because of the inhibition through the striatum-VP-GPb-LHb pathway. In the 200th trial, unexpected rewarding signal switches on the inhibitory pathway striosome-GPb-LHb, which leads to a dip in activity of the LHb neurons.

3.4. Learned Phasic Activity of GPb and RMTg

Experiments have shown that the GPb and RMTg neurons display phasic responses to CS and US. In our model, the interaction between striosome-GPi-GPb pathway and VS-VP-GPb pathway leads to the phasic activity of GPb neurons upon CS and US presentation. In particular, the GPb, LHb, and RMTg are also connected with effectively excitatory synapses (Figure 2), and hence their phasic activities should be correlated with that of the LHb, with the same explanations of activity profiles as for the LHb (Figures 7 and 8). Moreover, the LHb-RMTg-VTA/SNc pathway only magnifies the phasic activity of DA neurons and does not qualitatively change the activity profile of DA neurons.

3.5. Robustness Analysis of Two Parallel Pathways’ Model

Having shown the important role of the parallel circuit pathways in reproducing the phasic activities observed in experiments, we next further investigate the robustness of the phasic activities in our model with respect to connectivity strength variation. Specifically, we increase or decrease all synaptic weights by 10% and monitor how the phasic activities change.

First, we found that the phasic activities of DA and LHb neurons did not change substantially when we increased or decreased the following synaptic weights by 10%: , , , , , , and (data not shown). Second, weights of synapses on the pathway VP-GPb-LHb-RMTg-VTA/SNc were found to influence the tonic baseline activity of DA neurons, which we define as . Hence, we change while maintaining the phasic activity of DA and LHb neurons when we increase or decrease the weights of the synapses along this pathway (see Table 3). In Figures 9 and 10, we show the activity of DA neurons and LHb neurons given three different sets of synaptic weights from VP to GPb and corresponding baseline activities . We can see that DA and LHb neurons continue to demonstrate their characteristic phasic activity profiles. In brief, our neural circuit model is robust to the variation of synaptic weights.

4. Discussion

We extended a previous neural circuit model [16] by incorporating the nuclei GPb, LHb, and RMTg, and the model could account for various experimental data from separate works. Specifically, the model could exhibit the shift of DA and LHb neural responses from US to CS presentation times. Our simulations also replicated the phasic activity of DA, LHb, GPb, and RMTg neurons observed in experiments. The DA (LHb) neurons exhibited a phasic peak (dip) upon reward CS and maintenance of baseline activity in response to a rewarding outcome but a phasic dip (peak) if the reward is omitted. By contrast, the DA (LHb) neurons exhibited a phasic dip (peak) in response to a nonrewarding CS or punishment CS and maintenance of baseline activity in response to the nonrewarding US, but a phasic peak (dip) if a reward occurs or the aversive US is omitted. The acquired responses of GPb and RMTg neurons are similar to that of LHb neurons. These acquired responses are consistent with experimental data [2, 3, 5, 8] and behavioral experiments [50].

Our model provides insights into the neural circuit mechanism of DA and LHb phasic activity. In particular, parallel excitatory and inhibitory pathways underlie the learned responses: striatum-to-PPTN-to-VTA/SNc pathway excites DA, while striosome-VTA/SNc pathway inhibits DA; striatum-to-VP-to-GPb-to-LHb pathway inhibits LHb, while striosome-to-GPb-to-LHb pathway excites LHb; LHb-to-RMTg-to-VTA/SNc pathway magnifies the phasic activity of DA. Under different task conditions, we apply different CS and US inputs to the model. The model has a feedback loop in which DA can modulate the corticostriatal synapses and the corticostriosome synapses. This will in turn affect the DA responses, closing the loop. After learning, the weights of these synapses stabilize and remain unchanged. This led to the emergent phasic activity profiles of the nuclei in the circuit, with the parallel pathways balancing out one another. In addition, we found striosome to be a key brain nucleus which remembers the timing of previous rewards and encodes the predicted rewards. In fact, there are recent experimental works [51] that support our model prediction.

In our model, we predict neurons in the striosome to encode expected reward, but there are alternative theories. For example, Cohen et al. [52] found that there were three types of VTA neurons and VTA GABAergic neurons may signal expected reward, which could be a key variable for dopaminergic neurons to calculate reward-prediction error. Recent works [5355] highlight the importance of VTA GABAergic neurons. Averbeck and Costa [56] proposed that the amygdala can learn and represent expected values like the striatum, and they predicted that the amygdala may play a central role in reinforcement learning and the ventral striatum may play less of a primary role. Wagner et al. [57] suggested that the cerebellar granule cells may encode the expectation of reward. Luo et al. [58], Li et al. [59], and Hayashi et al. [60] found that serotonin neurons in the dorsal raphe nucleus can encode reward signals. Some physiological and theoretical works [17, 18, 6163] focus on D1 and D2 receptors in the ventral striatum and suggested that they play an important role in computing reward-prediction error. Future neural circuit modeling effort would need to incorporate such findings.

To obtain the results consistent with experiments, we have adopted several assumptions. First, we assumed that the striatal neurons excite the PPTN and ventral pallidum. Striatal neurons are usually identified as GABAergic and inhibitory, but they may excite downstream neurons through disinhibitory effect or substance released by striatal neurons [29, 30]. In fact, it has been demonstrated that substance mediates the excitatory interaction between striatal neurons to VP neurons [29] and striatal projection neurons [30]. Second, we hypothesized that the striosome projects to the GPi which in turn projects to the GPb. Although we have no direct evidence, Hong and Hikosaka [21] have observed that typical GPe and GPi neurons are first inhibited by striatal stimulation and GPb neurons are often (but not always) excited by striatal stimulation. They proposed that inputs to GPb were mediated through inhibitory axon collaterals within the striatum [28] or GPe [24].

While developing the model, we have tried to add minimal features to the previous model of Brown et al. [16]. Hence, it is worthy of note that we have ignored several factors to simplify the model. Specifically, we ignored the connections between some brain nuclei, such as the cortex-to-GPb [2], VP-to-RMTg [3], LHb-to-LHb, cortex-to-LHb [48], and DA-to-striatum [64] pathways. We also did not consider the direct LHb-to-VTA [65] and VTA-to-LHb [66] connections in our simulation, but we mimicked the overall inhibition of LHb on VTA. We have also ignored the different types of activity of many brain nuclei. For instance, studies have suggested three types of GPb neurons: reward-positive type, reward-negative type, and direction selective type [2]. Our model only considers the reward-negative type since the majority of the neurons of Gpb are of the reward-negative type and this type of neurons may play a key role in reward-related information transmission.

Despite the assumptions in the model, our neural circuit model can still implement the computation for reward-based phasic signaling and reinforcement learning, as observed in a variety of experiments. The phasic activities in multiple brain regions represent prediction error signals, which not only associates the cue with outcome but also memorizes the specific time interval between the two. This requires the neural system to hold the information predicted by the cue, compare the information with the outcome, and report the result of the comparison. In our model, the time spectrum of the striosome and the parallel excitatory and inhibitory pathways provided the platform for such computation. The peak activity of DA and LHb neurons functions in complementary roles, encoding reward and nonreward/punishment information separately and alleviating any flooring (limiting) effect of the dip in activity of either neuron type. Our novel neural circuit model with parallel pathways provides an instantiation of such complex neural computation.

5. Mathematics and Equations

This section lists the mathematical equations of the model (Figure 2). We give the model circuit different inputs to simulate different conditions. We use differential equations to simulate the firing rates (or the activity levels) of the neurons in different brain nuclei. The model variables are summarized in Table 1, the fixed parameters are summarized in Table 2, and the mathematical expressions are below.

(i) Different Inputs in Each Trial (Figure 2). The cortex, especially the orbitofrontal cortex (OFC), encodes the expectation future outcome and their response reflects the value conveyed by the combination of reward and punishment of the cue [36, 37]. Furthermore, OFC neurons fired most strongly for cues that predict large reward or small penalty and least strongly for cues that predict large penalty or small reward relative to neutral conditions [32, 33]. Therefore, we set a larger value for rewarding cue and smaller but positive value for nonrewarding cue as follows.

Reward CS input is as follows:We set backgroundIC = 0.30 and .

When the network receives a reward CS, the inputs from the cortex increase abruptly and last until the time when the expected reward should be given. Then, the inputs decay exponentially to baseline activity level.

Nonreward CS input is as follows:Reward US input is as follows:We set .

When the network receives a reward US, the inputs from the lateral hypothalamus increase abruptly and last for a very short duration. Then, the inputs decay exponentially to baseline activity level.

Nonreward US input is as follows:If the network does not get reward or gets nonreward (aversion or punishment), we assume the inputs in this trial do not change, and the inputs remain at baseline level.

(ii) Differential Equations. First, the changes of activation level of ventral striatal cells are governed by [16]The activity level of striatal cells changes in the wake of its passive decay and excitation from CS inputs and US inputs. The weight is fixed while the weight can be changed.

The weight is governed by [17, 18]The synaptic weight changes are induced by phasic dopamine burst or dip signal, and (defined in (7) and (8)). Learning is gated by delayed release of a second messenger and calcium signal is governed by (9) and (11) at a rate .

The positive reinforcement-learning signal derives from excitatory phasic fluctuations of the dopamine signal above the baseline:The complementary negative reinforcement-learning signal derives from inhibitory phasic fluctuations of the dopamine signal below baseline:Second, striosomes play an important role in the phasic activities of DA neurons and LHb neurons because of its timing spectrum mechanism: a spectrum of striosomal MSPN second messenger activities responds to the th input at rates :where the second messenger buildup rates are given byThe activities induce intracellular calcium dynamics within a given spine at delays determined by . The intracellular calcium spike is represented by the quantity , whereIn (11), is a step function:In the brief interval when the calcium concentration at a particular spine exceeds a threshold activity , CS-striosomal weight at that particular spine becomes eligible for change that may be induced by dopaminergic bursts or dips .Third, the changes in the level of PPTN are described by the following differential equations:where and can be regarded as the effect of substance and GABA on PPTN. Ventral striatum neurons can secrete substance and GABA. Substance excites the following neurons, while GABA inhibits the following neurons; denotes the net effect of substance and GABA. The authors believe that this explanation is more realistic, but it needs more physiological experiments to be testified. The changes of the activity level of PPTN neurons depend on the background inputs, its decay, and the net effect from the striatum.

Fourth, the changes in the level of ventral pallidum (VP) are described by the following differential equations:whereThe explanation is similar to (15)~(18). The changes of the activity level of VP neurons result from the background inputs, its decay, and the net effect from the striatum.

Fifth, changes in the level of GPb neurons are described by the following differential equation:The changes of the activity level of GPb neurons are determined by the background inputs, its decay, and the inhibitory effect from VP neurons and the disinhibitory input from striosomes.

Sixth, changes in the level of LHb neural activity are described by the following differential equation:The changes of the activity level of LHb neurons result from the background inputs, its decay, and the excitatory input from the GPb.

Seventh, changes in the level of RMTg neurons are described by the following differential equation:The changes of the activity level of RMTg neurons depend on the background inputs, its decay, and the excitatory input from the LHb.

Eighth, changes in the level of dopaminergic neurons are described by the following differential equation:The changes of the activity level of dopaminergic neurons depend on the background inputs, its decay, the inhibitory effect from RMTg neurons and striosomes, and the excitatory input from the PPTN.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this work.

Acknowledgments

Da-Hui Wang was supported by NSFC under Grants nos. 31271169 and 31671077, the Fundamental Research Funds for the Central Universities, and BMSTC (Beijing Municipal Science and Technology Commission) under Grant no. Z171100000117007. KongFatt Wong-Lin was supported by BBSRC (BB/P003427/1), COST Action Open Multiscale Systems Medicine (OpenMultiMed) supported by COST (European Cooperation in Science and Technology), and Northern Ireland Functional Brain Mapping Facility (1303/101154803) funded by Invest NI and the University of Ulster. Da-Hui Wang and KongFatt Wong-Lin were also supported by the Royal Society-NSFC International Exchanges Scheme-Cost Share Programme (31511130066, IE141307).