Abstract

We investigate the effect of a memory parameter on the performance of adaptive decision making using a tug-of-war method with the chaotic oscillatory dynamics of a semiconductor laser. We experimentally generate chaotic temporal waveforms of the semiconductor laser with optical feedback and apply them for adaptive decision making in solving a multiarmed bandit problem that aims at maximizing the total reward from slot machines whose hit probabilities are dynamically switched. We examine the dependence of making correct decisions on different values of the memory parameter. The degree of adaptivity is found to be enhanced with a smaller memory parameter, whereas the degree of convergence to the correct decision is higher for a larger memory parameter. The relations among the adaptivity, environmental changes, and the difficulties of the problem are also discussed considering the requirement of past decisions. This examination of ultrafast adaptive decision making highlights the importance of memorizing past events and paves the way for future photonic intelligence.

1. Introduction

Artificial intelligence based on deep learning, as a type of supervised learning, has been rapidly deployed in society. Reinforcement learning is another branch of machine learning that involves trial-and-error processes to accommodate unknown agents in environments [1, 2]. Its areas of application include, but are not limited to, robotics [3] and computer gaming [4]. The multiarmed bandit (MAB) problem is a fundamental problem in reinforcement learning wherein the total reward (e.g., the number of coins) from multiple slot machines with unknown hit probabilities needs to be maximized [1, 2, 5]. In order to solve the MAB problem, it is important to estimate the slot machine that may exhibit the highest hit probability by playing a series of slot machines (called exploration) and to use the estimation to gain more rewards (called exploitation). A certain amount of exploration action is necessary to estimate the best slot machine, because insufficient exploration results in the failure of the estimation of the best slot machine. Further, overexploration reduces the extent of exploitation and results in the loss of the total reward. Therefore, the trade-off of the exploration-exploitation dilemma [1, 2] has been known in the MAB problem.

A variety of techniques have been proposed to solve the MAB problem, such as ε-greedy [2], soft-max [2, 6], and upper confidence bound algorithms [7]. One of the promising approaches adopted to solve the MAB problem, the tug-of-war (TOW) method, was proposed by Kim et al. [8, 9]. The idea of the TOW method has been inspired by the behavior of the unicellular amoeba of the true slime mold [10]. The volume of the amoeba’s body remains constant when the amoeba oscillates its branches to collect environmental information, and it realizes nonlocal correlation under fluctuation [9]. These characteristics can be used for exploration and exploitation purposes to solve the MAB problem. It has been shown that the TOW method is superior to the soft-max algorithm for adaptive decision making when the hit probabilities of multiple slot machines are changed [8].

Physical implementation of the TOW method has been demonstrated in photonic systems with quantum dots at the nanoscale [11, 12] and single photons [13]. The nonlocality and fluctuation required for the TOW result from the physical attributes of the wave nature of an exciton-polariton and a single photon. However, the speed of the fluctuations is limited by the experimental measurement systems in the order of Hertz. Recently, ultrafast adaptive decision making based on the TOW method has been proposed using fast chaotic laser outputs over the gigahertz range [14, 15]. Decision making with zero-prior knowledge has been achieved at 1-GHz rate, and the sampling of temporal waveforms with negative correlation improves the performance of adaptive decision making [14].

For adaptive decision making where the hit probabilities of multiple slot machines are changed over time, it is important to determine how much of the knowledge accumulated via past exploration needs to be incorporated to make the present decision. For this purpose, the memory parameter (also known as the forgetting parameter [8]) has been introduced in the TOW method to include past exploration results. The tuning of the memory parameter is crucial for adaptive decision making because, for example, past exploration results may be useless after the hit probabilities are changed. However, no systematic investigation has been provided regarding the effect of the memory parameter for adaptive decision making using the TOW method.

In this study, we investigate the effect of the memory parameter on the performance of adaptive decision making using the TOW method with the chaotic output of a semiconductor laser. We experimentally generate chaotic temporal waveforms of the semiconductor laser with optical feedback, and we apply them for the TOW method for adaptive decision making. We investigate the performances of decision making for different values of the memory parameter.

2. Tug-of-War Method Using a Chaotic Semiconductor Laser

We consider an MAB problem under the following conditions as a simple case. We assume two slot machines and (referred to as and ) with unknown hit probabilities (referred to as and ). We consider that a player selects one of the slot machines each time, and the player either earns or loses the reward (e.g., coins) if the result of the selected slot machine is “hit” (or WIN) or “miss” (or LOSE), respectively. The player intends to maximize the reward by a good strategy of selecting one of the two slot machines. We assume that the total number of trials to play the slot machines is fixed. Besides, the sum of the hit probabilities of the two slot machines are supposed to be fixed at 1 () as prior knowledge [14].

We use chaotic temporal waveforms generated experimentally from a semiconductor laser with optical feedback for decision making. Fast chaotic laser outputs have been used for applications in fast physical random number generation [1618], secure key distribution [19, 20], and reservoir computing [21, 22]. Figure 1 shows the decision-making scheme based on the TOW method using chaotic temporal waveforms of the semiconductor laser. First, we use a semiconductor laser with an external mirror to generate chaotic laser outputs. Chaotic temporal waveforms of the laser output are measured with a digital oscilloscope. We sample the chaotic temporal waveform by an analog-to-digital converter (ADC) with 8-bit vertical resolution. Meanwhile, we prepare a “threshold” value to be compared with the sampled data.

We introduce the following rule for selecting one of the two slot machines based on the sampled data of the chaotic temporal waveform. We decide to select the slot machine () if the sampled data is larger than the threshold, and we decide to select the slot machine () if the sampled data is smaller than the threshold.

We update the threshold value based on the betting results. For example, if we select and the result is “hit,” the threshold value decreases so that the probability of selecting (i.e., the range above the threshold on the chaotic waveform) may increase in the next step. On the contrary, the threshold value increases if we select and the result is “miss” so that the probability of selecting may decrease. The threshold needs to be shifted to the opposite direction if we select ; that is, the threshold increases or decreases if the result is “hit” or “miss,” respectively. After repeating this procedure by selecting one of the slot machines and changing the threshold value, the threshold value saturates to the bottom or top of the amplitude of the chaotic temporal waveform, which corresponds to an equilibrium of selecting or , respectively.

The threshold value, denoted by , is changed with respect to the chaotic temporal waveform as follows:where is the width of the threshold step and is the number of threshold levels. The threshold step ranges from to (i.e., the total number of the threshold levels of ). These two parameters and limit the range of the threshold adjustment. In this study, we set and for 8-bit chaotic data (28 = 256 levels), and the total number of the threshold levels is . is the threshold adjuster variable that is determined by the result of the past decisions. The threshold adjuster variable is defined aswhere is the variable determined by the result of “hit” or “miss” from the selected slot machine, defined in Table 1. Here, α is the memory (or forgetting) parameter for weighting past threshold adjuster variables and ranges from 0 to 1. is determined only by the present value of for α = 0, whereas depends on the results of all of the past values of if α = 1. We investigate the influence of α on the performance of decision making in this study.

3. Experimental Setup for the Measurement of Chaotic Temporal Waveforms

We conducted a laser experiment to acquire chaotic temporal waveforms of the semiconductor laser output for decision making, as shown in Figure 2. We used a commercial semiconductor laser (NTT Electronics, KELD1C5GAAA, optical wavelength of ~1548 nm) for optical communication. The injection current of the semiconductor laser was set to 58.5 mA (5.0 , where = 11.7 mA is the lasing threshold). The wavelength of the laser was set to 1547.782 nm. The output light of the semiconductor laser was reflected by an external mirror (Reflector) and fed back to the laser to generate the chaotic fluctuation of the laser output. The feedback light power was set to 210 μW. The chaotic laser output was injected into a photodetector (PD, New Focus 1474A, 35 GHz bandwidth) and converted into an electric signal. The chaotic temporal waveforms were acquired by a high-speed digital oscilloscope (Tektronix, DPO73304D, 33 GHz bandwidth, 100 GigaSamples/s, 8-bit vertical resolution). The temporal waveforms are stored in the memory of the oscilloscope and used for decision making.

Figure 3 shows a temporal waveform of the laser output and the histogram of the amplitude of the temporal waveform. From Figure 3(a), we observe that a fast chaotic irregular oscillation is obtained with the dominant oscillation period of 0.15 ns. The corresponding dominant frequency of chaotic oscillation is 6.6 GHz. In Figure 3(b), the histogram shows an asymmetric distribution with a long tail for the negative value of the amplitude with 8-bit resolution (from −128 to 128). The data of chaotic temporal waveforms are taken for 5 M points and used for decision making.

We set a sampling time of 10 ps for each decision, to demonstrate the adaptive decision making at the fastest rate. In our previous work [14], we found that the sampling time of 50 ps shows the best performance; however, this is only in the case of zero-prior knowledge (i.e., the sum of the two hit probabilities is unknown). Under the assumption of the presence of prior knowledge in this work (i.e., + = 1), the effect of sampling rate is not significant, at least within the problem investigated in this study.

4. Decision Making

4.1. Evaluation of Correct Decision Rate

We emulate the TOW method in numerical calculations with experimentally obtained chaotic temporal waveforms of the laser output (Figure 3). The parameter values used for the computation are summarized in Table 2. The number of trials for selecting machines and are set to = 0.4 and = 0.6, respectively. These two probabilities are switched every 1000 times to examine the adaptive characteristics of decision making (flip interval, FI = 1000). For example, = 0.4 and = 0.6 are used for the first 1000 trials, and = 0.6 and = 0.4 are used for the second 1000 trials, then = 0.4 and = 0.6 are used again for the third 1000 trials. We define 5000 trials of selecting the slot machines as one cycle, and we test 1000 cycles () to evaluate the average performance of decision making.

We evaluate the correct decision rate (CDR) defined as the ratio of the number of times of selecting the slot machine with the highest hit probability (best slot machine selection) to the total number of times of selecting a slot machine and is given bywhere if the slot machine with the highest hit probability is selected; otherwise, at the th trial and the th cycle. is the number of cycles ().

Figure 4(a) shows the temporal evolutions of CDR with respect to two different memory parameters ( = 0.990 and 0.999). With a smaller memory parameter = 0.990, the value of CDR quickly approaches 1.0 after the hit probabilities are switched (around = 1000 and 2000); decisions are made adaptively to the swapping of the hit probabilities and (adaptation to environmental change). In contrast, with a larger memory parameter = 0.999, the recovery of the CDR value is slow after the switching of the hit probabilities; that is, the adaptivity of decision making is inferior to the former case.

Figure 4(b) shows an enlarged view of Figure 4(a) to investigate the convergence of correct decision making. We found that the value of the CDR fluctuates around 1 in case of a small memory parameter ( = 0.990) whereas the CDR converges to the value of 1 without fluctuations for a large memory parameter ( = 0.999). This result indicates that correct decisions are made in a stable manner for a larger memory parameter after the transient of the change of the hit probabilities.

From Figures 4(a) and 4(b), we observe the trade-off between the adaptivity and the convergence regarding the CDR values. For a smaller , a decision can be promptly made after environmental changes, but it suffers from a certain degree of instability. On the contrary, for a larger , the adaptation to environmental changes is slow while the correct decision exhibits stable performances once the adaptation is made. Therefore, an adequate tuning of is important to accomplish the demanded performances of decision making.

Figure 5 shows the time evolutions of the threshold adjuster (TA) variable (defined by (2)) for = 0.990 and = 0.999. The red line indicates TA for 1000 cycles at the th trial, and the black line indicates the average TA over 1000 cycles. As can be seen in Figure 5(a) regarding the small , the average TA (black line) saturates after the switching of hit probabilities where the TA value quickly reacted after the switching, leading to fast adaptation. However, the TA values for the 1000 cycles (red lines) show large deviations where some TA values exhibit the opposite sign (+ or −) below or above 0, corresponding to wrong selections of the slot machine. Such a deviation of TA values results in the fluctuation of the CDR value after the change of the hit probabilities (Figure 4(b)). In contrast, for the large in Figure 5(b), the average TA (black line) increases rapidly and monotonically before the occurrence of swapping of hit probabilities. As a result, it takes a certain duration to accomplish the change the sign of the TA value, leading to slow adaptation of decision making. At the same time, it is noteworthy that all the TA values (red lines) for 1000 cycles remain in either positive or negative, leading to the stable decision making observed in Figure 4(b). Therefore, we understand that the adaptivity and the convergence of decision making is determined by the evolution of TA values for different .

To obtain further insights into the effect of , we derive the following equation from (2):which indicates that the TA value results from the past decision making with the exponential weight of the memory parameter . Figure 6 shows the time evolution of the TA value described in (4) assuming for all . For , the TA value saturates before the 1000 trials, and the effect of the past results of the decision making decreases rapidly in the present TA value during 1000 trials. On the contrary, for , the TA value increases monotonically over the 1000 trials, and most past results of the decisions are included in the present TA value. The memory parameter indicates how many of the past results are taken into account in updating the TA value; (4) (also Figure 6) provides a measure of how to determine an appropriate value of the memory parameter.

4.2. Evaluation of Average Hit Rate

Next, we investigate the memory parameter to maximize the total reward of decision making. We introduce the average hit rate (AHR), defined as follows:where if the selected slot machine is “hit” whereas if the selected slot machine is “miss” at the th trial and the th cycle. The AHR represents the total reward acquisition rate that is defined as the ratio of the number of “hits” (e.g., the number of acquired coins) to the total number of trials and cycles.

Figure 7 shows the AHR as the memory parameter is continuously changed when = 0.4 and = 0.6 are switched every 1000 trials. The upper limit of the AHR is 0.6 because it is the maximum value of the hit probabilities. From Figure 7, we found that an optimal value of the memory parameter exists that maximizes the AHR. The maximum AHR is 0.589 when the memory parameter is set to . For larger than , a longer transient time after the switching of and appears; hence AHR is reduced. For smaller than , the transient time is shortened but incorrect decisions appear after the transient time due to the insufficient inclusion of past decision results, thereby reducing the AHR.

We investigate to maximize the AHR for different conditions of decision making. We first change the flip interval (FI) and investigate its impact to the maximum AHR and the corresponding optimal memory parameter . Figure 8 shows the maximum AHR and for different flip intervals. From Figure 8(a), the maximum AHR increases monotonically as the flip interval increases since the number of “hits” increases during a larger flip interval. From Figure 8(b), the optimal memory parameter for maximizing the AHR also increases as the flip interval increases. For a larger flip interval, a longer memory is useful to determine the correct decision; the transient time after switching of the hit probabilities needs to be sufficient. Therefore, a larger is obtained when the flip interval increases.

Next, we change the hit probabilities of the two slot machines. Figure 9 shows the AHRs for three different combinations of the hit probabilities of the two slot machines: (i) = 0.8 and = 0.2, (ii) = 0.7 and = 0.3, and (iii) = 0.6 and = 0.4. That is, the difficulty of the given decision-making problem is configured. The maximum hit rates are, by definition, given as 0.8, 0.7, and 0.6, respectively. We found that an optimal memory parameter exists for all these cases; however, the values of are different. A larger is observed for case (iii) ( = 0.988) where the difference of the hit probabilities is small. This is a reasonable consequence that the increased difficulties of the given problem (i.e., smaller differences of the hit probabilities) require more memory of the past results to maximize the reward. In contrast, a smaller is obtained for case (i) ( = 0.910) when the problem is easier (large difference of the hit probabilities) because a smaller number of trials yield good decisions. When the hit probability of the slot machines is switched, it is important to estimate the correct slot machine quickly and change the decision rather than sticking to the old decision. In the case when the difference of the hit probabilities is large, the player often suffers from “miss” just after the switching; hence, the player can rather easily recognize the change of the slot machines, meaning that memorizing the past events is less important. In contrast, when the difference of the hit probabilities is small, it is difficult to immediately recognize the change of the slot machines. A larger memory parameter is needed to figure out the correct decision.

These results show that using past results for decision making by using the memory parameter is crucial for improving the decision-making performance. The past decisions and their results are necessary for accurate decision making. However, we should note that they should not be taken into account too much concerning the environmental changes (e.g., switching of the hit probabilities). Taking account of the trade-off between the adaptation speed and the correctness of decision making, configuring the memory parameter is important to maximize the reward.

5. Conclusions

We investigated adaptive decision making based on the TOW method using the temporal waveforms of a chaotic semiconductor laser. We experimentally generated chaotic temporal waveforms of the semiconductor laser with optical feedback and applied them for adaptive decision making of the MAB problem aiming at maximizing the total reward from slot machines. We highlighted the requirements of memorizing the past in solving the MAB problem. We examined decision making in an uncertain environment, namely, the problem of choosing one of the two slot machines whose hit probabilities are dynamically switched and evaluated the effect of the memory parameter on the performance of adaptive decision making. Fast adaptation to the change in the hit probabilities can be obtained for a small memory parameter; however, the correct decision rate does not converge. In contrast, the correct decision rate converges perfectly for a large memory parameter, but the adaptation is slow. Thus, a trade-off exists between the adaptation speed and the convergence of the correct decision rate. An optimal memory parameter is found to maximize the average hit rate. We found that a larger memory parameter is needed for a larger flip interval (or slow environmental changes) and a smaller difference between the hit probabilities of the slot machines (or difficulties of the decision-making problem).

Decision making using fast chaotic temporal waveforms generated from a semiconductor laser can be used for applications requiring an arbitration of resources at data centers [23] and resource allocation in wireless communications among others where decision making at the milliseconds level is required [24]. The use of chaotic lasers for decision making leads to a new research field of photonic intelligence.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported in part by Grants-in-Aid for Scientific Research (JSPS KAKENHI Grant nos. JP16H03878 and JP17H01277), Core-to-Core Program, A. Advanced Research Networks, from Japan Society for the Promotion of Science, and JST CREST Grant no. JPMJCR17N2, Japan.