This paper discusses adaptation policies for information systems
that are subject to dynamic and stochastic contexts such as mobile
access to multimedia web sites. In our approach, adaptation agents
apply sequential decisional policies under uncertainty. We focus on
the modeling of such decisional processes depending on whether the
context is fully or partially observable. Our case study is a movie
browsing service in a mobile environment that we model by using
Markov decision processes (MDPs) and partially observable MDP
(POMDP). We derive adaptation policies for this service, that take
into account the limited resources such as the network bandwidth. We
further refine these policies according to the partially observable
users_ interest level estimated from implicit feedback. Our
theoretical models are validated through numerous simulations.
1. Introduction
Access alternatives to computer services continue to progress,
facilitating our interaction with family, friends, or workplace. These new
access alternatives encompass a wide range of mobile and distributed devices
that our technological environment becomes truly pervasive. The execution
contexts in which these devices operate are naturally heterogeneous. The
resources offered by wireless networks vary with the number and the position of
connected users. The available memory and the processing power also fluctuate
dynamically. Last but not least, the needs and expectations of users can change
at any instant. As a consequence, there are numerous research projects that aim
to provide modern information systems with adaptation capabilities according to
context variability.
In order to handle highly dynamic contexts, the
approach that we propose in this paper is based on an adaptation agent. The
agent perceives the successive states of the context, thanks to observations,
and carries out adaptation actions. Often, the adaptations approaches proposed
in literature suppose that the contextual data is easy to perceive or at least
that there is no possible ambiguity to identify the state of the current
context. One calls this an observable context. In this work, we relax
this hypothesis and therefore deal with partially observable contexts.
Our case study is an information system for browsing
multimedia descriptions of movies on mobile devices. The key idea is to show
how a given adaptation strategy can be refined according to the estimation of
user interest. User interest is clearly not directly observable by the system.
We build upon research on “implicit
feedback” in order to allow the adaptation agent to estimate the user
interest level while interacting with the context [1, 2]. The first section of this
paper reviews important elements of the state of the art and details our
adaptation approach. Next, we introduce the two formalisms used by our model:
the Markov decision processes (MDPs) and the partially observable MDP (POMDP).
The following section presents our case study and establishes the operational
principles of this information system. Thanks to an MDP, we formalize an
adaptation policy for our information system seen as an observable context.
Then we show how to refine this policy according to user interest using a POMDP
(refined itself from an MDP). Various experiments validate this approach and
give a practical view of the behavior of an adaptation agent. We conclude this
paper with some perspectives on this work.
2. Related Work
This section introduces useful current literature in
the field of adaptation to dynamic execution contexts which helps to position
our adaptation approach. Adaptive systems commonly provide adaptation
capabilities and therefore, these systems can be categorized according to
available resources, user preferences, or more generally, to the context.
2.1. Resource-Based Adaptation
Given the heterogeneous nature of modern networks and
mobile devices, there is an obvious need for adaptation to limited resources.
Networks' QoS parameters vary in terms of available bandwidth, loss rate, or
latency. The capabilities of the terminal are also very heterogeneous in terms
of memory size, processing power, and display area.
To manage these limitations, one can adapt the content
to be displayed or the access/distribution modalities. When considering content
adaptation, several authors propose classifications [3] where the elementary
components of the content (a media, e.g.) or the entire document's
structure is to be transformed. A media can thus be transcoded [4], converted into another
modality [5], or
summarized [6]. The
distribution or the access can also be adapted, for example, by optimizing the
streaming [7] or by
modifying the degree of interactivity of the service.
2.2. User-Aware Adaptation
In addition to adaptation capabilities to the
available resources, one should also consider an application's adaptation
according to human factors which are a matter of user preferences and
satisfaction. Henceforth, we describe three main research directions as given
by the literature.
The first research direction consists of switching the
adaptation mechanisms for maximizing the quality of the service perceived by
the user. A typical scenario is the choice of the transcoding strategy of a
stream (e.g., a video stream) in order to maximize the perceptual quality given
a limited bandwidth [8]. What is the best parameter to adapt: the size of the
video, its chromatic resolution, or the frame-rate? Models had been proposed [9, 10] to assess quality variation
both from technical and user perspectives. They are organized on three distinct
levels: network, media, and content levels. For this line of research, the key
factor for consideration is how variation in objective multimedia quality
impacts on user perception.
A second active direction is related to user modeling.
Here, the idea is to customize an application by modeling user profiles in
order to recognize them later. For example, adaptive hypermedia contents or
services [11] provide
a user with navigation support for “easier/better learning using an
on-line educational service” or support for “more efficient selling
on an e-commerce site” according to the user profile. Very often, these
systems use data mining techniques to analyze access patterns and discover
interesting relations in usage data [12]. Such knowledge may be useful to recognize profiles and
select the most appropriate modifications to improve content effectiveness.
The third research direction finds its motivation in
the first two. In order to learn a user model or to evaluate the perceptual
impact of a content adaptation solution, it is necessary to either explicitly
ask users for evaluations or to obtain implicit feedback information. Research
aiming to evaluate “implicit feedback” (IF) is experiencing a growing
interest, since it avoids bringing together significant collections of explicit
returns (which is intrusive and expensive) [1]. These IF methods are used in particular to decode
user reactions in information search systems [2]. The idea is to measure the
user interest for a list of query results, in order to adapt the search
function. Among the studied implicit feedback signals one can consider: the
total browsing time, the number of clicks, the scrolling interactions, and some
characteristic sequences of interactions. In our work, we estimate user
interest using IF by interpreting interaction sequences [2, 13]. Moreover, from a metadata
perspective, IF can provide implicit descriptors like user interest descriptor
as shown in [14].
2.3. Mixing Resources and User-Aware Adaptation
More general adaptation mechanisms can be obtained by
combining resource-based with user-based adaptation. The characteristics of
users and resources are mixed to design an adaptation strategy for a given context.
For example, streaming of a heavy media content can be adapted by prefetching
while considering both users characteristics and resource constraints [15].
For mobile and pervasive systems, the link between
resources and users starts by taking into account the geolocalization of the
user, that can be traced in time and even predicted [16].
In the MPEG-21 digital item adaptation (DIA) standard,
the context descriptors group the network’s and the terminal’s capabilities together with the user’s preferences and the authors’ recommendations to adapt multimedia
productions. Given this complexity, the normative works only propose
tools simply for describing the running context as a set of carefully chosen
and extensible descriptors [17]. This is an approach by metadata that leaves free the
conception of adaptation components while authorizing a high level of
interoperability [18].
Naturally, the elements of the context vary in time.
Therefore, one speaks of a dynamic context and, by extension, of a dynamic
adaptation. It is important to note that static adaptation to static context
elements is possible as well: one can negotiate once for all and always in the
same manner the favorite language of a user at the moment of access to a
multilingual service. On the contrary, the adaptation algorithm itself and/or
its parameters can be dynamically changed according to the context state
[19]. Our adaptation
approach is in line with the latter case.
An important element of research in context adaptation
is also the distinction between the adaptation decision and its effective
implementation [18].
In a pervasive system, one can decide that a document must be transcoded into another
format, but some questions still need to be answered. Is a transcoding
component available? Where can it be found? Should one compose the
transcoding service? In order to find solutions to these questions, many
authors propose to use artificial learning techniques to select the right
decision and/or the appropriate implementation of adaptation mechanisms (see
[20] for a review). In
this case, a description of the running context is given as input to a
decision-making agent that predicts the best adaptation actions according to
what it has previously learned. We extend this idea in line with a
reinforcement learning principle.
We model the context dynamics by a Markov decision
process whose states are completely or partially observable. This approach
provides means to find the optimal decision (adaptation action) according to
the current context. Next section introduces our MDP-based adaptation
approach.
3. Markov Decision Processes-Our Formal Approach
Figure 1 summarizes our adaptation approach that has
been introduced in [21] and is further refined in this article. In this
paper, an adaptation strategy for dynamic contexts is applied by an adaptation
agent. This agent perceives sequentially, over a discrete temporal axis, the
variations of the context through observations.
Figure 1: Context-based adaptation agent.
From its observations, the agent will compute the
context state in order to apply an adaptation policy. Such a policy is simply a
function that maps context states to adaptation decisions. Therefore, the agent
acts on the context while deciding an adaptation action: it consumes bandwidth,
influences the future user's interactions, increases, or reduces the user's
interest. It is therefore useful to measure its effect by associating a reward
(immediate or delayed) with the adaptation action decided in a given context
state. The agent can thus learn from its interaction with the context and
perform a “trial-and-error” learning called reinforcement learning
[22]. It attempts to
reinforce the actions resulting in a good accumulation of rewards and,
conversely, avoids renewing fruitless decisions. This process represents a
continuous improvement of its “decision policy.”
This dynamic adaptation approach is common to
frameworks of sequential decisional
policies under uncertainty. In these frameworks, the uncertainty comes from
two sources. On the one hand, the dynamic of the context can be random as a
consequence of available resources' variability (e.g., the bandwidth); on the
other hand, the effect of an agent's decision can be itself random. For
example, if an adaptation action aims to anticipate user interactions, the
prediction quality is obviously uncertain and subject to the user's behavior
variations.
In this situation, by adopting a Markov definition of
the context state, the agent's dynamics can be modeled as a Markov decision
process (MDP). This section introduces this formalism.
We initially assume that context state variables are
observable by the agent which makes it a sufficient condition to identify the
decision state without any ambiguity. This paper takes a step forward by
refining adaptation policies according to user interest. We estimate
sequentially this hidden information through user behavior as suggested by
research on the evaluation of “implicit feedback.” Therefore, the new
decision-making state contains at the same time observable variables as well as
a hidden element associated with user interest.
We then move on from an MDP to a partially observable
Markov decision process (POMDP). To the best of our knowledge, the application
of the POMDP to the adaptation problem in partially observable contexts has not
been studied before. To give concrete expression to this original idea, a case
study will be presented in Section 4.
3.1. MDP Definition
An MDP is a stochastic controlled process that assigns
rewards to transitions between states [23]. It is defined as a quintuple where is the state space, is the action space, is the discrete temporal axis of instants when
actions are taken, are the probability distributions of the
transitions between states, and is a function of reward on the transitions. We
rediscover in a formal way the ingredients necessary to understand Figure 1: at
each instant ,
the agent observes its state ,
applies the action that brings the system (randomly, according to ) to a new state ,
and receives a reward .
As previously mentioned, we are looking for the best
policy with respect to the accumulated rewards. A policy is a function that associates an action with each state .
Our aim is to find the best one: .
The MDP theoretical framework assigns a value
function to each policy .
This value function associates each state with a global reward ,
obtained by applying beginning with .
Such a value function allows to compare policies. A policy outperforms another policy ifThe expected sum of rewards obtained by applying starting from is weighted by a parameter in order to limit the influence of infinitely
distant rewards,In brief, for each state, this value function gives
the expected sum of future rewards that can be obtained if the policy is applied from this state on. This value
function allows to formalize the research of the optimal policy which is the one associated with the best
value function .
Bellman's optimality equations characterize the optimal value function and an optimal policy that can be obtained from it. In the case of
the -weighted criterion and stationary rewards,
they can be written as follows:
3.2. Resolution and Reinforcement Learning
When considering to
solve an MDP, we can distinguish between two cases, according to
whether the model is known or unknown. When the model (probabilities ) and the rewards are known, a dynamic
programing solution can be found.
The operator verifying according to is a contraction. The Bellman equation in can be solved by using a fixed point iterative
method while choosing randomly ,
then applying repeatedly the operator that improves the current policy associated to .
If the rewards are bounded, the sequence converges to and allows to compute .
If the model is unknown, we can solve the MDP using a
reinforcement learning algorithm [22]. The reinforcement learning approach aims to find an
optimal policy through iterative estimations of the optimal value function. The Q-learning algorithm is a reinforcement learning method that is able to
solve the Bellman equations for the -weighted criterion. It uses simulations to
iteratively estimate the value function ,
based on the observations of instantaneous transitions and their associated
reward. For this purpose, Puterman [23] introduced a function ,
that carries a significance similar to that of but makes it easier to extract the associated
policy because it does not need transition probabilities any more. We can
express the “Q-value” as a function of a given policy and its value function,Therefore, it is easy to see that, in spite of the
lack of transition probabilities, we can trace back to the optimal
policy,The principle of the Q-learning Algorithm 1 says that after each observed transition the current value function for the couple is updated, where represents the current state, the chosen and applied action, the resulted state, and the immediate reward.
Algorithm 1: The Q-learning algorithm.
In this algorithm, is an initial parameter that represents the
number of iterations. The learning rate is particular to each pair state
action, and decreases toward 0 at each iteration. The function
“” returns a new state and its associated reward according to
the dynamics of the system. The choice of the current state and of the action
to execute is made by the functions “” and
“.” The function “” is used to
initialize the values to .
The convergence of this algorithm has been thoroughly
studied and is now well established. We assume the
following.
(i) and are finite, . (ii)Each pair is visited an infinite number of times. (iii).
Under these hypotheses, the function converges almost surely to .
Let us recall that the almost-sure convergence means that for all the sequence converges to with a probability equal to 1. Practically,
the sequence is often defined as follows:where represents the number of times the state was visited and the decision was made.
3.3. Partial Observation and POMDP Definition
In many cases, the observations that a decision agent
is able to capture (see Figure 1) are only partial and do not allow the
identification of the context state without ambiguity. Therefore, a new class
of problems needs to be solved: partially observable Markov decision processes.
The states of the underlying MDP are hidden and only the observation process
will help to rediscover the running state of the process.
A partially observable Markov decision process
is defined by:
(i) the underlying MDP;(ii) a set of observations;(iii) an observation function that maps every state to a probability distribution on the
observations' space. The probability to observe knowing the agent's state will be referred to as follows: .
Non-Markovian Behavior
It is worth to note that, in this new model, we loose
a widely used property for the resolution of the MDPs, namely that the
observation process is Markovian. The probability of the next observation may depend not only on the current observation
and action taken, but also on previous observations and
actions,
Stochastic Policy
It has been proved that the results obtained for the and convergence using MDP resolution algorithms
are not applicable anymore. The POMDPs will need the use of stochastic policies
and not deterministic ones, as for MDP [24].
3.4. Resolution
The POMDP classic methods attempt to bring back the
resolution problem to the underlying MDP. Two situations are possible. If the
MDP model is known, one cannot determine the exact state of the system but a
distribution probability on the set of the possible states (a belief state).
In the second situation, without knowing the model parameters, the agent
attempts to construct the MDP model relying only on observations' history.
Our experimental test bed uses the resolution software
package provided by Cassandra et al. [25] that works in the potentially infinite space of
belief states using linear programing methods.
4. Case Study: a Movie Presentation System for Mobile Terminals
We introduce here a system for browsing movie
descriptions on mobile devices. For this system, our strategy aims to adapt the
presentation of a multimedia content (i.e., movie description) and not to
transform the media itself. This case study is intended to be both simple and
pedagogical, while integrating a degree of realistic interactivity.
4.1. Interactive Access to a Movie Database
Figure 2 introduces an information system accessible
from mobile terminals such as PDAs. A keyword search allows the user to obtain
an ordered list of links to various movie descriptions. Within this list, the
user can follow a link toward an interesting movie (the associated interaction
will be referred to as clickMovie);
then, he or she can consult details regarding the movie in question. This
consultation will call on a full screen interactive presentation and a
navigation scenario detailed below. Having browsed the details for one movie,
the user is able to come back to the list of query results (interaction back in Figure 2). It is then possible to
access the description of a second interesting film. The index of the accessed
movie description will be referred to as .
Figure 2: Information system of movie descriptions.
To simplify the context modeling, we choose to
consider the browsing sequence indexed by .
Our problem becomes one that aims at adapting the content (movie descriptions)
presented during this sequence. Our execution environment is dynamic because of
the bandwidth's () variability, a very frequent problem in
mobile networks. For simplicity reasons, we do not take into account other
important parameters of mobile terminals such as signal strength, user's
mobility, and power constraints.
As we consider the browsing session at a high level,
we do not need to provide special specifications for the final goal of the
service that can be renting/buying a DVD, downloading a media, and so forth.
Properly managing the download or the streaming of the whole media is a
separate problem and is not considered here.
4.2. From the Simplest to the Richest Descriptions
To present the details of a movie, three forms of
descriptions are possible (see Figure 3). The poor “textual” version
(referred to as ) groups together with a small poster image, a
short text description, and links pointing to more production photos as well as
a link to the video trailer. The intermediary version () provides a slideshow of still photos and a
link to the trailer. The richest version () includes, in addition, the video trailer.
Figure 3: Basic (T), intermediary (I), and rich (V) versions of movie details.
As the available bandwidth () is variable, the usage of the three versions
is not equivalent. The bandwidth required to download the content increases
with the complexity of the versions (). In other words, for a given bandwidth, the
latencies perceived by the user during the download of the different versions
grow proportionally with the size of the content.
More precisely, we now point out two problems
generated by the inexistence of dynamic adaptation of the content when the
available bandwidth varies. The adaptation strategy could systematically select
only one of the three possible alternatives mentioned above. If it always
selects the richest version (), this impacts the behavior of the user who
experiences bad network conditions (low bandwidth). Although strong latencies
could be tolerated while browsing the first query results (small index ), it becomes quickly unacceptable if grows. If the adaptation strategy selects
systematically the simplest version (), this would also have a harmful impact on
the behavior of the user. Despite the links toward the other resources ()mages and ()ideo, the lack of these visual components,
which normally stimulate interest, will not encourage further browsing. An
important and a legitimate question to be raised is what can be called an
“appropriate” adaptation policy.
4.3. Properties of Appropriate Adaptation Policies
The afore-mentioned two examples of policies (one
“too ambitious,” the other “too modest”) show how
complex is the relationship among the versions, the number of browsed films,
the time spent on the service, the quality of service, the available bandwidth,
and the user interest. An in-depth analysis of these relationships can
represent a research project in itself. We do not claim to deliver such an
analysis in this paper, but we simply want to show how a policy and an
adaptation agent can be generated automatically from a model where the context
state is observable or partially observable.
Three properties of a good adaptation policy can be
identified as follows.
(1) The version chosen for presenting the content
must be simplified if the available bandwidth decreases ( is simpler than ,
itself simpler than (2) The version must be simplified if increases: it is straightforward to choose
rich versions for the first browsed movie descriptions that are normally the
most pertinent ones (as we have already mentioned,
we should avoid large latencies for big values of and small .(3) The version must be enriched if the user shows
a high interest for the query results. The simple underlying idea is that a
very interested user is more likely to be patient and to tolerate more easily
large downloading latencies.
The first two properties are related to the variation
of the context parameters, that we consider observable ( and ), while the third one is related to a hidden
element, namely, user interest. At this stage, given these three properties, an
adaptation policy for our case study can be expressed: the selection of the
version (T, I,
or V) knowing and and having a way to estimate the interest.
4.4. On Navigation Scenarios
This paragraph introduces by examples some possible
navigation scenarios. Figure 4 illustrates different possible steps during
navigation and introduces different events that are tracked. In this figure,
the user chooses a film (event clickMovie), the presentation in version T
is downloaded (event pageLoad)
without the user interrupting this download. Interested in this film, the user
requests the production photos, following the link toward the pictures (event linkI). In the one case, the downloading
seems too long and the user interrupts it (event stopDwl means stopDownload) then returns
to the movie list (event back).
In the other case, the user waits for the downloading of the pictures to
finish, then starts viewing the slideshow (event startSlide). Either this slideshow is
shown completely and then an event EI (short for EndImages) is raised, or the visualization is incomplete, leading to
the event stopSlide (not represented
in the figure). Next, the link to the trailer can be followed (event linkV); here again an impatient user can
interrupt the downloading (stopDwl) or start playing the video (play). Then the video can be watched completely (event EV for EndVideo) or stopped (stopVideo),
before a return (event back).
Figure 4: Example of navigations and interactions.
Obviously, this example does not introduce all the
possibilities, especially if the video is not downloaded but streamed.
Streaming scenarios introduce different challenges and require a playout buffer
that enriches the set of possible interactions
(e.g., stopBuffering). Meanwhile,
the user may choose not to interact with the proposed media: we introduce a
sequence of events pageLoad, noInt (no interaction), back. Similarly, a back is possible just after a pageLoad, a stopDwl may occur immediately after the
event clickMovie, watching the
video before the pictures is also possible.
5. Problem Statement
5.1. Rewards for Well-Chosen Adaptation Policies
From the previous example and the definitions of
associated interactions, it is possible to propose a simple mechanism aiming at
rewarding a pertinent adaptation policy. A version (, ,
or ) is considered well chosen in a given
context, if it is not questioned by the user. The reassessment of a version as being too simple is suggested, for example,
by the full consumption of the pictures. In the same way, the reassessment of a
version as being too rich is indicated by a partial
consumption of the downloaded video. Four simple principles that guide our
rewarding system are as follows.
(i) We reward the event EI for versions and .(ii) We reward the event EV if the chosen version was .(iii) We penalize upon arrival of interruption
events (“stops”).(iv) We favor the simpler versions for no or little
interaction.
Thus, a version is sufficient if the user does not request (or
at least does not completely consume) the pictures.
A version is preferable if the user is interested enough
and has access to enough resources to download and view the set of pictures
(rewards EI). Similarly, a
version is adopted if the user views all the pictures
(reward EI) and, trying to
download the video, is forced to interrupt it because of limited bandwidth.
Finally, a rich version is adopted if the user is in good condition to
consume the video completely (reward EV). The following decision-making models
formalize these principles.
5.2. Toward an Implicit Measure of the Interest
The previously introduced navigations and interactions
make it possible to estimate the interest of the user. We proceed by evaluating
“implicit feedback” and use the sequences of events to estimate the
user's interest level. Our approach is inspired by [26] and is based on the two
following ideas.
The first idea is to identify two types of
interactions according to what they suggest: either an increasing interest (linkI, linkV, startSlide, play, EI, EV) or a decreasing interest (stopSlide, stopVideo, stopDwl, noInt). Therefore, the event distribution
(seen as the probability of occurrence) depends on the user's
interest in the browsed movie.
The second idea is to consider not only a single
running event to update the estimation of user interest but also to regard an
entire sequence of events as being more significant. In fact, it has been
recently established that the user actions on a response page to a search
(e.g., on Google) depend not only on the relevance of the current response but
also on the global relevance of the set of the query results [2].
Following the work of [26], it is natural to model the
sequences of events or observations produced by a hidden Markov model (HMM) for
which we do not detail here the definition (e.g., see [27]). One can simply translate
the two previous ideas by using an HMM with several (hidden) states of
interest. The three states of interest shown in Figure 5 are referred as S, M,
and B, respectively, for a small, medium, or big interest. The three
distributions of observable events in every state are different as stressed in
the first idea mentioned above. These differences explain the occurrences of
different sequences of observations in terms of sequential interest evolutions
(second idea). These evolutions are encoded thanks to transition probabilities
(stippled) between hidden states of interest. Given a sequence of observations,
an HMM can thus provide the most likely underlying sequence of hidden states or
the most likely running hidden state. At this point, the characteristics of our
information system are rich enough to define an adaptation agent applying
decision policies under uncertainty. These policies can be formalized in the
framework presented in Section 3.1.
Figure 5: A hidden Markov model.
6. Modeling Content Delivery Policies
In this section, we model the dynamic context of our
browsing system (Section 4) in order to obtain the appropriate adaptation
agents. Our goal is to characterize the adaptation policies in terms of Markov
decision processes (MDPs or POMDP).
6.1. MDP Modeling
Firstly, an observable context is considered. Let us
introduce the proposed MDP that models it. The aim is to characterize
adaptation policies which verify properties 1 and 2 described in Section 4.3:
the presented movie description must be simplified if the bandwidth available decreases or if increases.
A state (observable) of the context is a tuple with being the rank of the film consulted, the bandwidth available, the version proposed by the adaptation agent,
and the running event (see Figure 6). With ,
and (where clickMovie, stopDwl, pageLoad, noInt, linkI, startSlide stopSlide, EI, linkV, play, stopVideo, EV, back ).
Figure 6: MDP dynamics illustration.
To obtain a finite and reasonable number of such
states (limiting thus the MDP size), we will quantize the variables according
to our needs. Thus (resp., ) can be quantized according to three levels meaning begin, middle, and end (resp., for low, average, and high) while segmenting
in three regions the interval (resp., ).
The temporal axis of MDP is naturally represented by the sequence of events, every event
implying a change of state.
The dynamics of our MDP is constrained by the dynamics of the context, especially by the
user navigation. Thus, a transition from a movie index to is not possible. Similarly, every is followed by an event .
The bandwidth's own dynamics will have also an impact (according to quantized
levels) on the dynamics between the states of the MDP.
The choice of the movie description version (, ,
or ) proposed by the adaptation agent is done
when the user follows the link to the film. This is encoded in the model by the
event .
The states of the MDP can be classified in:
(i) decision states () in which the agent executes a real action
(it effectively chooses among T, I,
or V);(ii) nondecision or intermediary () states where the agent does not execute any
action.
In an MDP framework, the agent decides an action in
every single state. Therefore, the model needs to be enriched with an
artificial action () as well as an absorbent state of strong
penalty income (). Thus, any valid action chosen in an intermediary state brings the
agent in the absorbent state where it will be strongly penalized. Similarly,
the agent will avoid deciding in a decision-making state where a valid
action is desired. Thus, the valid actions mark out the visit of the decision
states while the dynamics of the context (subject to user navigation and
bandwidth variability) are captured by the transitions between intermediary
states for which the action (the nonaction) is carried out. These properties
are clearly illustrated in Figure 6.
In other words, there is no change of version during
the transitions between intermediary states. The action (representing the proposed version) chosen in
a decision-making state is therefore, memorized () in all the following intermediary states,
until the next decision state. Thus, the MDP captures the variation of the
context dynamics according to the chosen version. Therefore, it will be
able to identify which are the good choices of versions (to reproduce later in
similar conditions), if it is rewarded for them.
The rewards are associated with the decision states according to the chosen action.
Intermediary states corresponding to the occurrences of the events EI and EV are rewarded as well, according to
Section 5.1. The rewards (other formulations are
possible as well including, e.g., negative rewards for interruption
events) are
defined as follows:To favor simpler versions for users who do not
interact with the content and do not view any media (c.f.
Section 5.1), let us choose .
To summarize, the model behaves in the following manner: the agent starts with
a decision state ,
where it decides a valid action for which it receives an “initial”
reward ;
the simpler the version, the bigger is the reward. According to the transitions
probabilities based on context dynamics, the model goes through intermediary
states where it can receive new rewards or at the time of the occurrences of EI (resp., EV), if the taken action was or ,
(resp., ). As these occurrences are more frequent for
small and high ,
while the absence of interactions is more likely if is big and low, then the MDP
(i) will favor the richest version for small and high ;(ii) will favor the simplest version for big and low ;(iii) will establish a tradeoff (optimum according
to the rewards) for all the other cases.
The best policy given by the model is obviously
related to the chosen values for .
In order to control this choice in the experimental section, a simplified
version of the MDP will be defined.
A simplified MDP can be obtained by memorizing the occurrence of the events and during the navigation between two events .
Thus, we can delay the rewards or .
This simplified model does not contain non decision-making states, if two
booleans ( and ) are added to the state structure (Figure 7).
The boolean (resp., ) passes to 1 if the event (resp., ) is observed between two states. The
simplified MDP is defined by its states (), the actions ,
the temporal axis given by the sequence of events ,
and the rewards redefined as follows:This ends the presentation of our observable model and
we continue by integrating user interest in a richer POMDP model.
Figure 7: Simplified MDP.
6.2. POMDP Modeling
The new partially observable model adds a hidden
variable (It) to the state. The value of It represents the user's interest
quantized on three levels (Small, Average, Big). To be able to estimate user interest, we
follow the principles described in Section 5.2 and Figure 5. The events
(interactions) are taken out from the previous MDP state to become observations
in the POMDP model. These observations are distributed according to It (the interest level). A sequence of
observations provides an implicit measure of It, following the same principle described
for the HMM in Figure 5. Therefore, it becomes possible for the adaptation
agent to refine its decisions according to the probability of the running
user's interest: small, average, big.
In other words, this refinement is done according to a belief state. The
principle of this POMDP is illustrated in Figure 8.
Figure 8: POMDP dynamics between hidden states.
A hidden state of our POMDP becomes a tuple .
The notations are unchanged including the booleans and .
The temporal axis and the actions are unchanged.
The dynamics of the model. When an event occurs, the adaptation agent is in a decision
state .
It chooses a valid action and moves, according to the model's random
transitions, to an intermediary state where and are equal to 0. The version proposed by the
agent is memorized in the intermediary states during the browsing of the current film. The
booleans and become 1, if the events or, respectively, are observed and preserve this value until the
next decision state .
During the browsing of the running film, and remain constant while the other factors (, ,
and the booleans) can change.
The observations are the occurred events: .
They are distributed according to the states. In Figure 8, the event can be observed in and (probability 1.0) and cannot be observed
elsewhere ( and ).
In every intermediary state, the event distribution
characterizes the value of the interest. Thus, just as the HMM of Figure 5, the
POMDP will know how to evaluate, from the sequence of events, the current
belief state. The most likely interest value will evolve therefore, along with
the events occurred; increase if , , , decrease in case of .
To preserve the interest level throughout the decision states, the interest of
the current receives the value corresponding to the last (Figure 8).
The rewards associated with the actions taken in a decision-making state are collected in the following decision-making
state where we have all necessary information: , ,
and ;
7. Experimental Results
Simulations are used in order to experimentally
validate the models. The developed software simulates navigations such as the
one illustrated in Figure 4. Every transition probability between two
successive states of navigation is a stochastic function of three parameters: , ,
and .
The bandwidth is simulated as a random variable uniformly
distributed in an interval compatible with today mobile networks. represents a family of random variables, whose
expectation decreases with .
The parameter is the movie version proposed to the user.
Meanwhile, other experimental setups involving different distribution lows
(e.g., normal distribution) for bandwidth dynamics or user's interest conduct
to similar results.
7.1. MDP Validation for Observable Contexts
To validate the MDP model of Section 6.1, let us
choose a problem with and .
Initially, the intervals of and are quantized on 2 granularity levels: and .
Rather than proceeding to an arbitrary choice of values , , , , that define the rewards, we can look for the
ones driving to the optimal policy shown in Table 1. In fact, this policy respects the principles formulated in Section
4.3 and could be proposed beforehand by an expert (Table 1 gives only for the pairs since .)
Table 1: Policy stated for two-level granularity and ).
The value functions corresponding
to the simplified MDP, estimated over on a 1 length horizon, (between two
decision-making states and ) can be written as follows: because, for all does not depend on action . where and represent the probabilities to observe the
events ,
respectively, ,
knowing the version .
For every pair we have computed, based on simulations, the
probabilities , , .
The respect of the policy is assured if and only ifWriting these inequalities for the 4 pairs from Table 1 and using the estimations for ,
we obtain a 12-linear inequations system in the variables , , , , .
Two solutions of the system among an infinity are as follows: Starting from these values, we can experimentally
check the correct behavior of our MDP model. Table 2 shows the policy obtained
automatically by dynamic programing or Q-learning algorithm, with 4 granularity
levels for and and the rewards .
This table refines the previous coarse-grained policy; this is not a simple
copy of actions (e.g., see the pairs :
change from to , :
change from to ,
etc.). This new policy is optimal with respect to the rewards ,
for this finer granularity level.
Table 2: Policy refinement for rewards.
Resolving the MDP for the second set of rewards () gives a different refinement (Table 3) that
shows richer versions (underlined) comparing to .
The explanation stays in the growth of the rewards associated to the events , that induce the
choice of a more complex versions, for a long time ( lasts for 3 classes of ,
when ).
Table 3: Policy refinement for rewards.
7.2. POMDP Validation: Interest-Refined Policies
Once MDPs are calibrated and return appropriate
adaptation policies, their rewards can be reused to solve the POMDP models. The
goal is to refine the MDP policies for the observable case by estimating user
interest.
Two experimental steps are necessary. The first step
consists of learning the POMDP model and the second in solving the
decision-making problem.
For the learning process, the simpler method consists
of empirically estimating the transitions and observations probabilities from
the simulator's traces. Starting from these traces, the probabilities are
obtained from the frequencies' computationHaving a POMDP model, the resolution is the next step.
Solving a POMDP is notoriously delicate and computationally intensive (e.g.,
see the tutorial proposed at www.pomdp.org). We used the software package pomdp-solve 5.3 in combination with CPLEX (with the more recent strategy
called finite grid).
The results returned by pomdp-solve is an automaton
that implements a “near optimal” deterministic policy, represented by
a decision-making graph (policy graph). The nodes of the graph contain
the actions () while the transitions are done according to
the observations. Only the transitions made possible by the navigation process
are to be exploited.
To illustrate this form of result, let us show one of
the automata that is small enough to be displayed on an A4 page (Figure 9). We
choose a single granularity level for and and three levels for .
Additionally, we consider that the consumption of the slideshow precedes the
consumption of the video. The obtained adaptation policy therefore takes into
account only the variation of the estimated user interest ( and do not play any role).
Figure 9: Decision-making automaton (policy graph),
POMDP solution. Please note the different stopDwl, stopDwl(Img), and stopDwl(Video).
Figure 9 shows that the POMDP agent learns to react
in a coherent way. For example, starting from a version ,
and observing pageLoad, linkI,
startSlide, EI, noInt, back the following version decided by the POMDP
agent is ,
which translates the sequence into an interest rise. This rise is even stronger
if, after the event EI, the user
follows the link linkV. This is
enough to make the agent select the version further.
Conversely, starting from version ,
an important decrease in interest can be observed on the sequence startSlide, stopSlide, play, stopVideo, back,
so the system decides .
A smaller decrease in interest can be associated with the sequence startSlide, stopSlide, play, EV, back, the
next version selected being .
These examples show that there exists a natural correlation between the wealth
of the selected versions and the implicit user interest. For this problem,
where and are not involved, the version given by the policy graph translates the estimation of
the running interest (growing with ). For each movie, the choice of version is
therefore based only on the events observed while browsing the previous movies.
Other sequences cause the decisions to be less
intuitive or harder to interpret. For example, the sequence pageLoad, linkI, startSlide, stopSlide, noInt,
back leaving leads to the decision .
In this sequence, a compromise between interest rise (suggested by linkI, startSlide) and decrease (suggested
by stopSlide, noInt) must be
established. Thus, a decision would not be illegitimate. The POMDP trades
off this decision according to its dynamics and its rewards. To obtain a
modified graph leading to a decision for this sequence, it would be sufficient that
the product decreases, where represents the probability to observe EI in the version ,
for a medium interest. In this case, stopSlide, instead of provoking a loopback
on the node 5, would bring the agent to the node 1. Then the agent would decide since the expectation of the gains associated
to would be smaller.
In general, the decision-making automaton depends on and .
When , ,
and vary, the automaton becomes too complex to be
displayed. The results of the POMDP require a different presentation.
Henceforth, working with 3 granularity levels on ,
2 on ,
3 on and the set of rewards leads to a p olicy graph of more than
100 nodes. We apply it during numerous sequences of simulated navigations.
Table 4 gives the statistics on the decisions that have been taken. For every
triplet (, , ), the decisions—the agent not knowing —are counted and translated into
percentages.
Table 4: Actions' distribution for the POMDP solution policy.
We notice that the proposed content becomes
statistically richer when the interest increases, proving again that the
interest estimation from the previous observations is as expected. Let us take
an example and consider the bottom-right part of Table 4 (corresponding to and ). The probability of the policy proposing
version increases with the interest: from 0% (small
interest) to 2% (average interest) then 10% (big interest).
Moreover, when and/or increase, the interest trend is correct. For
example, for a given set of and ( and ), the proposed version becomes richer with
the bandwidth's increase from (1%T, 99%I, 0%V) to (0%T, 51%I, 49%V).
The POMDP capacity to refine adaptation policies
according to the user interest is thus validated. Once the POMDP model is
solved (offline resolution), the obtained automaton is easily put into
practice online by encoding it into an adaptation agent.
8. Conclusion
This paper has shown that sequential decision
processes under uncertainty are well suited for defining adaptation mechanisms
for dynamic contexts. According to the type of the context state (observable or
partially observable), we have shown how to characterize adaptation policies by
solving Markov decision processes (MDPs) or partially observable MDP (POMDP).
These ideas have been applied to adapt a movie browsing service. In particular,
we have proposed a method for refining a given adaptation policy according to
user interest. The perspectives of this work are manifold. Our approach can be
applied to cases where rewards are explicitly related to the service (e.g., to
maximize the number of rented DVDs). It will also be interesting to extend our
model by coupling it with functionalities from recommendation systems and/or
from multimedia search systems. In the latter case, we would benefit a lot from
a collection of real data, that is, navigation logs. These are the research
directions that will guide our future work.