International Journal of Digital Multimedia Broadcasting

Volume 2008, Article ID 171385, 13 pages

http://dx.doi.org/10.1155/2008/171385

## Adapting Content Delivery to Limited Resources and Inferred User Interest

^{1}Computer Science Department, Military Technical Academy, 050141 Bucharest, Romania^{2}Computer Science Department, National Polytechnic Institute of Toulouse, 31071 Toulouse, France

Received 3 March 2008; Accepted 1 July 2008

Academic Editor: Harald Kosch

Copyright © 2008 Cezar Plesca et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper discusses adaptation policies for information systems that are subject to dynamic and stochastic contexts such as mobile access to multimedia web sites. In our approach, adaptation agents apply sequential decisional policies under uncertainty. We focus on the modeling of such decisional processes depending on whether the context is fully or partially observable. Our case study is a movie browsing service in a mobile environment that we model by using Markov decision processes (MDPs) and partially observable MDP (POMDP). We derive adaptation policies for this service, that take into account the limited resources such as the network bandwidth. We further refine these policies according to the partially observable users_ interest level estimated from implicit feedback. Our theoretical models are validated through numerous simulations.

#### 1. Introduction

Access alternatives to computer services continue to progress, facilitating our interaction with family, friends, or workplace. These new access alternatives encompass a wide range of mobile and distributed devices that our technological environment becomes truly pervasive. The execution contexts in which these devices operate are naturally heterogeneous. The resources offered by wireless networks vary with the number and the position of connected users. The available memory and the processing power also fluctuate dynamically. Last but not least, the needs and expectations of users can change at any instant. As a consequence, there are numerous research projects that aim to provide modern information systems with adaptation capabilities according to context variability.

In order to handle highly dynamic contexts, the
approach that we propose in this paper is based on an adaptation agent. The
agent perceives the successive states of the context, thanks to observations,
and carries out adaptation actions. Often, the adaptations approaches proposed
in literature suppose that the contextual data is easy to perceive or at least
that there is no possible ambiguity to identify the state of the current
context. One calls this an *observable context*. In this work, we relax
this hypothesis and therefore deal with *partially observable contexts*.

Our case study is an information system for browsing multimedia descriptions of movies on mobile devices. The key idea is to show how a given adaptation strategy can be refined according to the estimation of user interest. User interest is clearly not directly observable by the system.

We build upon research on “implicit feedback” in order to allow the adaptation agent to estimate the user interest level while interacting with the context [1, 2]. The first section of this paper reviews important elements of the state of the art and details our adaptation approach. Next, we introduce the two formalisms used by our model: the Markov decision processes (MDPs) and the partially observable MDP (POMDP). The following section presents our case study and establishes the operational principles of this information system. Thanks to an MDP, we formalize an adaptation policy for our information system seen as an observable context. Then we show how to refine this policy according to user interest using a POMDP (refined itself from an MDP). Various experiments validate this approach and give a practical view of the behavior of an adaptation agent. We conclude this paper with some perspectives on this work.

#### 2. Related Work

This section introduces useful current literature in the field of adaptation to dynamic execution contexts which helps to position our adaptation approach. Adaptive systems commonly provide adaptation capabilities and therefore, these systems can be categorized according to available resources, user preferences, or more generally, to the context.

##### 2.1. Resource-Based Adaptation

Given the heterogeneous nature of modern networks and mobile devices, there is an obvious need for adaptation to limited resources. Networks' QoS parameters vary in terms of available bandwidth, loss rate, or latency. The capabilities of the terminal are also very heterogeneous in terms of memory size, processing power, and display area.

To manage these limitations, one can adapt the content to be displayed or the access/distribution modalities. When considering content adaptation, several authors propose classifications [3] where the elementary components of the content (a media, e.g.) or the entire document's structure is to be transformed. A media can thus be transcoded [4], converted into another modality [5], or summarized [6]. The distribution or the access can also be adapted, for example, by optimizing the streaming [7] or by modifying the degree of interactivity of the service.

##### 2.2. User-Aware Adaptation

In addition to adaptation capabilities to the available resources, one should also consider an application's adaptation according to human factors which are a matter of user preferences and satisfaction. Henceforth, we describe three main research directions as given by the literature.

The first research direction consists of switching the adaptation mechanisms for maximizing the quality of the service perceived by the user. A typical scenario is the choice of the transcoding strategy of a stream (e.g., a video stream) in order to maximize the perceptual quality given a limited bandwidth [8]. What is the best parameter to adapt: the size of the video, its chromatic resolution, or the frame-rate? Models had been proposed [9, 10] to assess quality variation both from technical and user perspectives. They are organized on three distinct levels: network, media, and content levels. For this line of research, the key factor for consideration is how variation in objective multimedia quality impacts on user perception.

A second active direction is related to user modeling. Here, the idea is to customize an application by modeling user profiles in order to recognize them later. For example, adaptive hypermedia contents or services [11] provide a user with navigation support for “easier/better learning using an on-line educational service” or support for “more efficient selling on an e-commerce site” according to the user profile. Very often, these systems use data mining techniques to analyze access patterns and discover interesting relations in usage data [12]. Such knowledge may be useful to recognize profiles and select the most appropriate modifications to improve content effectiveness.

The third research direction finds its motivation in the first two. In order to learn a user model or to evaluate the perceptual impact of a content adaptation solution, it is necessary to either explicitly ask users for evaluations or to obtain implicit feedback information. Research aiming to evaluate “implicit feedback” (IF) is experiencing a growing interest, since it avoids bringing together significant collections of explicit returns (which is intrusive and expensive) [1]. These IF methods are used in particular to decode user reactions in information search systems [2]. The idea is to measure the user interest for a list of query results, in order to adapt the search function. Among the studied implicit feedback signals one can consider: the total browsing time, the number of clicks, the scrolling interactions, and some characteristic sequences of interactions. In our work, we estimate user interest using IF by interpreting interaction sequences [2, 13]. Moreover, from a metadata perspective, IF can provide implicit descriptors like user interest descriptor as shown in [14].

##### 2.3. Mixing Resources and User-Aware Adaptation

More general adaptation mechanisms can be obtained by
combining resource-based with user-based adaptation. The characteristics of
users and resources are mixed to design an adaptation strategy for a given *context*.
For example, streaming of a heavy media content can be adapted by prefetching
while considering both users characteristics and resource constraints [15].

For mobile and pervasive systems, the link between resources and users starts by taking into account the geolocalization of the user, that can be traced in time and even predicted [16].

In the MPEG-21 digital item adaptation (DIA) standard, the context descriptors group the network’s and the terminal’s capabilities together with the user’s preferences and the authors’ recommendations to adapt multimedia productions. Given this complexity, the normative works only propose tools simply for describing the running context as a set of carefully chosen and extensible descriptors [17]. This is an approach by metadata that leaves free the conception of adaptation components while authorizing a high level of interoperability [18].

Naturally, the elements of the context vary in time. Therefore, one speaks of a dynamic context and, by extension, of a dynamic adaptation. It is important to note that static adaptation to static context elements is possible as well: one can negotiate once for all and always in the same manner the favorite language of a user at the moment of access to a multilingual service. On the contrary, the adaptation algorithm itself and/or its parameters can be dynamically changed according to the context state [19]. Our adaptation approach is in line with the latter case.

An important element of research in context adaptation is also the distinction between the adaptation decision and its effective implementation [18]. In a pervasive system, one can decide that a document must be transcoded into another format, but some questions still need to be answered. Is a transcoding component available? Where can it be found? Should one compose the transcoding service? In order to find solutions to these questions, many authors propose to use artificial learning techniques to select the right decision and/or the appropriate implementation of adaptation mechanisms (see [20] for a review). In this case, a description of the running context is given as input to a decision-making agent that predicts the best adaptation actions according to what it has previously learned. We extend this idea in line with a reinforcement learning principle.

We model the context dynamics by a Markov decision process whose states are completely or partially observable. This approach provides means to find the optimal decision (adaptation action) according to the current context. Next section introduces our MDP-based adaptation approach.

#### 3. Markov Decision Processes-Our Formal Approach

Figure 1 summarizes our adaptation approach that has been introduced in [21] and is further refined in this article. In this paper, an adaptation strategy for dynamic contexts is applied by an adaptation agent. This agent perceives sequentially, over a discrete temporal axis, the variations of the context through observations.

From its observations, the agent will compute the context state in order to apply an adaptation policy. Such a policy is simply a function that maps context states to adaptation decisions. Therefore, the agent acts on the context while deciding an adaptation action: it consumes bandwidth, influences the future user's interactions, increases, or reduces the user's interest. It is therefore useful to measure its effect by associating a reward (immediate or delayed) with the adaptation action decided in a given context state. The agent can thus learn from its interaction with the context and perform a “trial-and-error” learning called reinforcement learning [22]. It attempts to reinforce the actions resulting in a good accumulation of rewards and, conversely, avoids renewing fruitless decisions. This process represents a continuous improvement of its “decision policy.”

This dynamic adaptation approach is common to
frameworks of *sequential decisional
policies under uncertainty*. In these frameworks, the uncertainty comes from
two sources. On the one hand, the dynamic of the context can be random as a
consequence of available resources' variability (e.g., the bandwidth); on the
other hand, the effect of an agent's decision can be itself random. For
example, if an adaptation action aims to anticipate user interactions, the
prediction quality is obviously uncertain and subject to the user's behavior
variations.

In this situation, by adopting a Markov definition of the context state, the agent's dynamics can be modeled as a Markov decision process (MDP). This section introduces this formalism.

We initially assume that context state variables are observable by the agent which makes it a sufficient condition to identify the decision state without any ambiguity. This paper takes a step forward by refining adaptation policies according to user interest. We estimate sequentially this hidden information through user behavior as suggested by research on the evaluation of “implicit feedback.” Therefore, the new decision-making state contains at the same time observable variables as well as a hidden element associated with user interest.

We then move on from an MDP to a partially observable Markov decision process (POMDP). To the best of our knowledge, the application of the POMDP to the adaptation problem in partially observable contexts has not been studied before. To give concrete expression to this original idea, a case study will be presented in Section 4.

##### 3.1. MDP Definition

An MDP is a stochastic controlled process that assigns rewards to transitions between states [23]. It is defined as a quintuple where is the state space, is the action space, is the discrete temporal axis of instants when actions are taken, are the probability distributions of the transitions between states, and is a function of reward on the transitions. We rediscover in a formal way the ingredients necessary to understand Figure 1: at each instant , the agent observes its state , applies the action that brings the system (randomly, according to ) to a new state , and receives a reward .

As previously mentioned, we are looking for the best policy with respect to the accumulated rewards. A policy is a function that associates an action with each state . Our aim is to find the best one: .

The MDP theoretical framework assigns a *value
function * to each policy .
This value function associates each state with a global reward ,
obtained by applying beginning with .
Such a value function allows to compare policies. A policy outperforms another policy ifThe expected sum of rewards obtained by applying starting from is weighted by a parameter in order to limit the influence of infinitely
distant rewards,In brief, for each state, this value function gives
the expected sum of future rewards that can be obtained if the policy is applied from this state on. This value
function allows to formalize the research of the optimal policy which is the one associated with the best
value function .

*Bellman's optimality equations* characterize the optimal value function and an optimal policy that can be obtained from it. In the case of
the -weighted criterion and stationary rewards,
they can be written as follows:

##### 3.2. Resolution and Reinforcement Learning

When considering to solve an MDP, we can distinguish between two cases, according to whether the model is known or unknown. When the model (probabilities ) and the rewards are known, a dynamic programing solution can be found.

The operator verifying according to is a contraction. The Bellman equation in can be solved by using a fixed point iterative method while choosing randomly , then applying repeatedly the operator that improves the current policy associated to . If the rewards are bounded, the sequence converges to and allows to compute .

If the model is unknown, we can solve the MDP using a
reinforcement learning algorithm [22]. The reinforcement learning approach aims to find an
optimal policy through iterative estimations of the optimal value function. The *Q-learning* algorithm is a reinforcement learning method that is able to
solve the Bellman equations for the -weighted criterion. It uses simulations to
iteratively estimate the value function ,
based on the observations of instantaneous transitions and their associated
reward. For this purpose, Puterman [23] introduced a function ,
that carries a significance similar to that of but makes it easier to extract the associated
policy because it does not need transition probabilities any more. We can
express the “*Q*-value” as a function of a given policy and its value function,Therefore, it is easy to see that, in spite of the
lack of transition probabilities, we can trace back to the optimal
policy,The principle of the *Q-learning* Algorithm 1 says that after each observed transition the current value function for the couple is updated, where represents the current state, the chosen and applied action, the resulted state, and the immediate reward.

In this algorithm, is an initial parameter that represents the
number of iterations. The *learning rate * is particular to each pair state
action, and decreases toward 0 at each iteration. The function
“” returns a new state and its associated reward according to
the dynamics of the system. The choice of the current state and of the action
to execute is made by the functions “” and
“.” The function “” is used to
initialize the values to .

The convergence of this algorithm has been thoroughly studied and is now well established. We assume the following.

(i) and are finite, . (ii)Each pair is visited an infinite number of times. (iii). Under these hypotheses, the function converges almost surely to . Let us recall that the almost-sure convergence means that for all the sequence converges to with a probability equal to 1. Practically, the sequence is often defined as follows:where represents the number of times the state was visited and the decision was made.

##### 3.3. Partial Observation and POMDP Definition

In many cases, the observations that a decision agent is able to capture (see Figure 1) are only partial and do not allow the identification of the context state without ambiguity. Therefore, a new class of problems needs to be solved: partially observable Markov decision processes. The states of the underlying MDP are hidden and only the observation process will help to rediscover the running state of the process.

A partially observable Markov decision process is defined by:

(i) the underlying MDP;(ii) a set of observations;(iii) an observation function that maps every state to a probability distribution on the observations' space. The probability to observe knowing the agent's state will be referred to as follows: .

*Non-Markovian Behavior*

It is worth to note that, in this new model, we loose
a widely used property for the resolution of the MDPs, namely that the
observation process is Markovian. The probability of the next observation may depend not only on the current observation
and action taken, but also on previous observations and
actions,

*Stochastic Policy*

It has been proved that the results obtained for the and convergence using MDP resolution algorithms
are not applicable anymore. The POMDPs will need the use of stochastic policies
and not deterministic ones, as for MDP [24].

##### 3.4. Resolution

The POMDP classic methods attempt to bring back the
resolution problem to the underlying MDP. Two situations are possible. If the
MDP model is known, one cannot determine the exact state of the system but a
distribution probability on the set of the possible states (a *belief state*).
In the second situation, without knowing the model parameters, the agent
attempts to construct the MDP model relying only on observations' history.

Our experimental test bed uses the resolution software package provided by Cassandra et al. [25] that works in the potentially infinite space of belief states using linear programing methods.

#### 4. Case Study: a Movie Presentation System for Mobile Terminals

We introduce here a system for browsing movie descriptions on mobile devices. For this system, our strategy aims to adapt the presentation of a multimedia content (i.e., movie description) and not to transform the media itself. This case study is intended to be both simple and pedagogical, while integrating a degree of realistic interactivity.

##### 4.1. Interactive Access to a Movie Database

Figure 2 introduces an information system accessible
from mobile terminals such as PDAs. A keyword search allows the user to obtain
an ordered list of links to various movie descriptions. Within this list, the
user can follow a link toward an interesting movie (the associated interaction
will be referred to as *clickMovie*);
then, he or she can consult details regarding the movie in question. This
consultation will call on a full screen interactive presentation and a
navigation scenario detailed below. Having browsed the details for one movie,
the user is able to come back to the list of query results (interaction *back* in Figure 2). It is then possible to
access the description of a second interesting film. The index of the accessed
movie description will be referred to as .

To simplify the context modeling, we choose to consider the browsing sequence indexed by . Our problem becomes one that aims at adapting the content (movie descriptions) presented during this sequence. Our execution environment is dynamic because of the bandwidth's () variability, a very frequent problem in mobile networks. For simplicity reasons, we do not take into account other important parameters of mobile terminals such as signal strength, user's mobility, and power constraints.

As we consider the browsing session at a high level, we do not need to provide special specifications for the final goal of the service that can be renting/buying a DVD, downloading a media, and so forth. Properly managing the download or the streaming of the whole media is a separate problem and is not considered here.

##### 4.2. From the Simplest to the Richest Descriptions

To present the details of a movie, three forms of descriptions are possible (see Figure 3). The poor “textual” version (referred to as ) groups together with a small poster image, a short text description, and links pointing to more production photos as well as a link to the video trailer. The intermediary version () provides a slideshow of still photos and a link to the trailer. The richest version () includes, in addition, the video trailer.

As the available bandwidth () is variable, the usage of the three versions is not equivalent. The bandwidth required to download the content increases with the complexity of the versions (). In other words, for a given bandwidth, the latencies perceived by the user during the download of the different versions grow proportionally with the size of the content.

More precisely, we now point out two problems generated by the inexistence of dynamic adaptation of the content when the available bandwidth varies. The adaptation strategy could systematically select only one of the three possible alternatives mentioned above. If it always selects the richest version (), this impacts the behavior of the user who experiences bad network conditions (low bandwidth). Although strong latencies could be tolerated while browsing the first query results (small index ), it becomes quickly unacceptable if grows. If the adaptation strategy selects systematically the simplest version (), this would also have a harmful impact on the behavior of the user. Despite the links toward the other resources ()mages and ()ideo, the lack of these visual components, which normally stimulate interest, will not encourage further browsing. An important and a legitimate question to be raised is what can be called an “appropriate” adaptation policy.

##### 4.3. Properties of Appropriate Adaptation Policies

The afore-mentioned two examples of policies (one “too ambitious,” the other “too modest”) show how complex is the relationship among the versions, the number of browsed films, the time spent on the service, the quality of service, the available bandwidth, and the user interest. An in-depth analysis of these relationships can represent a research project in itself. We do not claim to deliver such an analysis in this paper, but we simply want to show how a policy and an adaptation agent can be generated automatically from a model where the context state is observable or partially observable.

Three properties of a good adaptation policy can be identified as follows.

(1) The version chosen for presenting the content
must be simplified if the available bandwidth decreases ( is simpler than ,
itself simpler than (2) The version must be simplified if increases: it is straightforward to choose
rich versions for the first browsed movie descriptions that are normally the
most pertinent ones (as we have already mentioned,
we should avoid large latencies for big values of and small .(3) The version must be enriched if the user shows
a high interest for the query results. The simple underlying idea is that a
very interested user is more likely to be patient and to tolerate more easily
large downloading latencies.
The first two properties are related to the variation
of the context parameters, that we consider observable ( and ), while the third one is related to a hidden
element, namely, user interest. At this stage, given these three properties, an
adaptation policy for our case study can be expressed: the selection of the
version (*T*, *I*,
or *V*) knowing and and having a way to estimate the interest.

##### 4.4. On Navigation Scenarios

This paragraph introduces by examples some possible
navigation scenarios. Figure 4 illustrates different possible steps during
navigation and introduces different events that are tracked. In this figure,
the user chooses a film (event *clickMovie*), the presentation in version *T*
is downloaded (event *pageLoad*)
without the user interrupting this download. Interested in this film, the user
requests the production photos, following the link toward the pictures (event *linkI*). In the one case, the downloading
seems too long and the user interrupts it (event *stopDwl* means stopDownload) then returns
to the movie list (event *back*).
In the other case, the user waits for the downloading of the pictures to
finish, then starts viewing the slideshow (event *startSlide*). Either this slideshow is
shown completely and then an event *EI* (short for EndImages) is raised, or the visualization is incomplete, leading to
the event *stopSlide* (not represented
in the figure). Next, the link to the trailer can be followed (event *linkV*); here again an impatient user can
interrupt the downloading (*stopDwl*) or start playing the video (*play*). Then the video can be watched completely (event *EV* for EndVideo) or stopped (*stopVideo*),
before a return (event *back*).

Obviously, this example does not introduce all the
possibilities, especially if the video is not downloaded but streamed.
Streaming scenarios introduce different challenges and require a playout buffer
that enriches the set of possible interactions
(e.g., *stopBuffering*). Meanwhile,
the user may choose not to interact with the proposed media: we introduce a
sequence of events *pageLoad*, *noInt* (no interaction), *back*. Similarly, a *back* is possible just after a *pageLoad*, a *stopDwl* may occur immediately after the
event *clickMovie*, watching the
video before the pictures is also possible.

#### 5. Problem Statement

##### 5.1. Rewards for Well-Chosen Adaptation Policies

From the previous example and the definitions of associated interactions, it is possible to propose a simple mechanism aiming at rewarding a pertinent adaptation policy. A version (, , or ) is considered well chosen in a given context, if it is not questioned by the user. The reassessment of a version as being too simple is suggested, for example, by the full consumption of the pictures. In the same way, the reassessment of a version as being too rich is indicated by a partial consumption of the downloaded video. Four simple principles that guide our rewarding system are as follows.

(i) We reward the event *EI* for versions and .(ii) We reward the event *EV* if the chosen version was .(iii) We penalize upon arrival of interruption
events (“stops”).(iv) We favor the simpler versions for no or little
interaction.
Thus, a version is sufficient if the user does not request (or
at least does not completely consume) the pictures.
A version is preferable if the user is interested enough
and has access to enough resources to download and view the set of pictures
(rewards *EI*). Similarly, a
version is adopted if the user views all the pictures
(reward *EI*) and, trying to
download the video, is forced to interrupt it because of limited bandwidth.
Finally, a rich version is adopted if the user is in good condition to
consume the video completely (reward *EV*). The following decision-making models
formalize these principles.

##### 5.2. Toward an Implicit Measure of the Interest

The previously introduced navigations and interactions make it possible to estimate the interest of the user. We proceed by evaluating “implicit feedback” and use the sequences of events to estimate the user's interest level. Our approach is inspired by [26] and is based on the two following ideas.

The first idea is to identify two types of
interactions according to what they suggest: either an increasing interest (*linkI, linkV, startSlide, play, EI, EV*) or a decreasing interest (*stopSlide, stopVideo, stopDwl, noInt*). Therefore, the event distribution
(seen as the probability of occurrence) depends on the user's
interest in the browsed movie.

The second idea is to consider not only *a single
running event* to update the estimation of user interest but also to regard *an
entire sequence of events* as being more significant. In fact, it has been
recently established that the user actions on a response page to a search
(e.g., on Google) depend not only on the relevance of the current response but
also on the global relevance of the set of the query results [2].

Following the work of [26], it is natural to model the
sequences of events or observations produced by a hidden Markov model (HMM) for
which we do not detail here the definition (e.g., see [27]). One can simply translate
the two previous ideas by using an HMM with several (hidden) states of
interest. The three states of interest shown in Figure 5 are referred as *S*, *M*,
and *B*, respectively, for a small, medium, or big interest. The three
distributions of observable events in every state are different as stressed in
the first idea mentioned above. These differences explain the occurrences of
different sequences of observations in terms of sequential interest evolutions
(second idea). These evolutions are encoded thanks to transition probabilities
(stippled) between hidden states of interest. Given a sequence of observations,
an HMM can thus provide the most likely underlying sequence of hidden states or
the most likely running hidden state. At this point, the characteristics of our
information system are rich enough to define an adaptation agent applying
decision policies under uncertainty. These policies can be formalized in the
framework presented in Section 3.1.

#### 6. Modeling Content Delivery Policies

In this section, we model the dynamic context of our browsing system (Section 4) in order to obtain the appropriate adaptation agents. Our goal is to characterize the adaptation policies in terms of Markov decision processes (MDPs or POMDP).

##### 6.1. MDP Modeling

Firstly, an observable context is considered. Let us introduce the proposed MDP that models it. The aim is to characterize adaptation policies which verify properties 1 and 2 described in Section 4.3: the presented movie description must be simplified if the bandwidth available decreases or if increases.

*A state* (observable) of the context is a tuple with being the rank of the film consulted, the bandwidth available, the version proposed by the adaptation agent,
and the running event (see Figure 6). With ,
and (where *clickMovie, stopDwl, pageLoad, noInt, linkI, startSlide** stopSlide, EI, linkV, play, stopVideo, EV, back*).

To obtain a finite and reasonable number of such states (limiting thus the MDP size), we will quantize the variables according to our needs. Thus (resp., ) can be quantized according to three levels meaning begin, middle, and end (resp., for low, average, and high) while segmenting in three regions the interval (resp., ).

*The temporal axis* of MDP is naturally represented by the sequence of events, every event
implying a change of state.

*The dynamics* of our MDP is constrained by the dynamics of the context, especially by the
user navigation. Thus, a transition from a movie index to is not possible. Similarly, every is followed by an event .
The bandwidth's own dynamics will have also an impact (according to quantized
levels) on the dynamics between the states of the MDP.

The choice of the movie description version (, , or ) proposed by the adaptation agent is done when the user follows the link to the film. This is encoded in the model by the event . The states of the MDP can be classified in:

(i) decision states () in which the agent executes a real action
(it effectively chooses among *T*, *I*,
or *V*);(ii) nondecision or intermediary () states where the agent does not execute any
action.
In an MDP framework, the agent decides an action in
every single state. Therefore, the model needs to be enriched with an
artificial action () as well as an absorbent state of strong
penalty income (). Thus, any valid action chosen in an intermediary state brings the
agent in the absorbent state where it will be strongly penalized. Similarly,
the agent will avoid deciding in a decision-making state where a valid
action is desired. Thus, the valid actions mark out the visit of the decision
states while the dynamics of the context (subject to user navigation and
bandwidth variability) are captured by the transitions between intermediary
states for which the action (the nonaction) is carried out. These properties
are clearly illustrated in Figure 6.

In other words, there is no change of version during
the transitions between intermediary states. The action (representing the proposed version) chosen in
a decision-making state is therefore, memorized () in all the following intermediary states,
until the next decision state. Thus, the MDP captures the variation of the
context dynamics *according to the chosen version*. Therefore, it will be
able to identify which are the good choices of versions (to reproduce later in
similar conditions), if it is rewarded for them.

*The rewards* are associated with the decision states according to the chosen action.
Intermediary states corresponding to the occurrences of the events *EI* and *EV* are rewarded as well, according to
Section 5.1. The rewards (other formulations are
possible as well including, e.g., negative rewards for interruption
events) are
defined as follows:To favor simpler versions for users who do not
interact with the content and do not view any media (c.f.
Section 5.1), let us choose .
To summarize, the model behaves in the following manner: the agent starts with
a decision state ,
where it decides a valid action for which it receives an “initial”
reward ;
the simpler the version, the bigger is the reward. According to the transitions
probabilities based on context dynamics, the model goes through intermediary
states where it can receive new rewards or at the time of the occurrences of *EI* (resp., *EV*), if the taken action was or ,
(resp., ). As these occurrences are more frequent for
small and high ,
while the absence of interactions is more likely if is big and low, then the MDP

(i) will favor the richest version for small and high ;(ii) will favor the simplest version for big and low ;(iii) will establish a tradeoff (optimum according to the rewards) for all the other cases. The best policy given by the model is obviously related to the chosen values for . In order to control this choice in the experimental section, a simplified version of the MDP will be defined.

*A simplified MDP* can be obtained by memorizing the occurrence of the events and during the navigation between two events .
Thus, we can delay the rewards or .
This simplified model does not contain non decision-making states, if two
booleans ( and ) are added to the state structure (Figure 7).
The boolean (resp., ) passes to 1 if the event (resp., ) is observed between two states. The
simplified MDP is defined by its states (), the actions ,
the temporal axis given by the sequence of events ,
and the rewards redefined as follows:This ends the presentation of our observable model and
we continue by integrating user interest in a richer POMDP model.

##### 6.2. POMDP Modeling

The new partially observable model adds a hidden
variable (*It*) to the state. The value of *It* represents the user's interest
quantized on three levels (Small, Average, Big). To be able to estimate user interest, we
follow the principles described in Section 5.2 and Figure 5. The events
(interactions) are taken out from the previous MDP state to become observations
in the POMDP model. These observations are distributed according to *It* (the interest level). A sequence of
observations provides an implicit measure of *It*, following the same principle described
for the HMM in Figure 5. Therefore, it becomes possible for the adaptation
agent to refine its decisions according to the probability of the running
user's interest: s*mall, average, big*.
In other words, this refinement is done according to a belief state. The
principle of this POMDP is illustrated in Figure 8.

*A hidden state* of our POMDP becomes a tuple .
The notations are unchanged including the booleans and .

*The temporal axis and the actions* are unchanged.

*The dynamics of the model*. When an event occurs, the adaptation agent is in a decision
state .
It chooses a valid action and moves, according to the model's random
transitions, to an intermediary state where and are equal to 0. The version proposed by the
agent is memorized in the intermediary states during the browsing of the current film. The
booleans and become 1, if the events or, respectively, are observed and preserve this value until the
next decision state .
During the browsing of the running film, and remain constant while the other factors (, ,
and the booleans) can change.

*The observations* are the occurred events: .
They are distributed according to the states. In Figure 8, the event can be observed in and (probability 1.0) and cannot be observed
elsewhere ( and ).

In every intermediary state, the event distribution characterizes the value of the interest. Thus, just as the HMM of Figure 5, the POMDP will know how to evaluate, from the sequence of events, the current belief state. The most likely interest value will evolve therefore, along with the events occurred; increase if ,?,?,? decrease in case of . To preserve the interest level throughout the decision states, the interest of the current receives the value corresponding to the last (Figure 8).

*The rewards* associated with the actions taken in a decision-making state are collected in the following decision-making
state where we have all necessary information: ,?,
and ;

#### 7. Experimental Results

Simulations are used in order to experimentally validate the models. The developed software simulates navigations such as the one illustrated in Figure 4. Every transition probability between two successive states of navigation is a stochastic function of three parameters: ,?, and . The bandwidth is simulated as a random variable uniformly distributed in an interval compatible with today mobile networks. represents a family of random variables, whose expectation decreases with . The parameter is the movie version proposed to the user. Meanwhile, other experimental setups involving different distribution lows (e.g., normal distribution) for bandwidth dynamics or user's interest conduct to similar results.

##### 7.1. MDP Validation for Observable Contexts

To validate the MDP model of Section 6.1, let us choose a problem with and . Initially, the intervals of and are quantized on 2 granularity levels: and . Rather than proceeding to an arbitrary choice of values ,?,?,?,? that define the rewards, we can look for the ones driving to the optimal policy shown in Table 1. In fact, this policy respects the principles formulated in Section 4.3 and could be proposed beforehand by an expert (Table 1 gives only for the pairs since .)

The value functions corresponding to the simplified MDP, estimated over on a 1 length horizon, (between two decision-making states and ) can be written as follows: because, for all does not depend on action . where and represent the probabilities to observe the events , respectively, , knowing the version .

For every pair we have computed, based on simulations, the
probabilities ,?,?.
The respect of the policy is assured if and only ifWriting these inequalities for the 4 pairs from Table 1 and using the estimations for ,
we obtain a 12-linear inequations system in the variables , , ,?,?.
Two solutions of the system among an infinity are as follows: Starting from these values, we can experimentally
check the correct behavior of our MDP model. Table 2 shows the policy obtained
automatically by dynamic programing or *Q*-learning algorithm, with 4 granularity
levels for and and the rewards .
This table refines the previous coarse-grained policy; this is not a simple
copy of actions (e.g., see the pairs :
change from to , :
change from to ,
etc.). This new policy is optimal with respect to the rewards ,
for this finer granularity level.

Resolving the MDP for the second set of rewards () gives a different refinement (Table 3) that shows richer versions (underlined) comparing to . The explanation stays in the growth of the rewards associated to the events , that induce the choice of a more complex versions, for a long time ( lasts for 3 classes of , when ).

##### 7.2. POMDP Validation: Interest-Refined Policies

Once MDPs are calibrated and return appropriate adaptation policies, their rewards can be reused to solve the POMDP models. The goal is to refine the MDP policies for the observable case by estimating user interest.

Two experimental steps are necessary. The first step consists of learning the POMDP model and the second in solving the decision-making problem.

For the learning process, the simpler method consists
of empirically estimating the transitions and observations probabilities from
the simulator's traces. Starting from these traces, the probabilities are
obtained from the frequencies' computationHaving a POMDP model, the resolution is the next step.
Solving a POMDP is notoriously delicate and computationally intensive (e.g.,
see the tutorial proposed at www.pomdp.org). We used the software package *pomdp-solve 5.3* in combination with *CPLEX* (with the more recent strategy
called finite grid).

The results returned by pomdp-solve is an automaton
that implements a “near optimal” deterministic policy, represented by
a decision-making graph (*policy graph*). The nodes of the graph contain
the actions () while the transitions are done according to
the observations. Only the transitions made possible by the navigation process
are to be exploited.

To illustrate this form of result, let us show one of the automata that is small enough to be displayed on an A4 page (Figure 9). We choose a single granularity level for and and three levels for . Additionally, we consider that the consumption of the slideshow precedes the consumption of the video. The obtained adaptation policy therefore takes into account only the variation of the estimated user interest ( and do not play any role).

Figure 9 shows that the POMDP agent learns to react
in a coherent way. For example, starting from a version ,
and observing *pageLoad, linkI,
startSlide, EI, noInt, back* the following version decided by the POMDP
agent is ,
which translates the sequence into an interest rise. This rise is even stronger
if, after the event *EI*, the user
follows the link *linkV*. This is
enough to make the agent select the version further.

Conversely, starting from version ,
an important decrease in interest can be observed on the sequence *startSlide, stopSlide, play, stopVideo, back*,
so the system decides .
A smaller decrease in interest can be associated with the sequence *startSlide, stopSlide, play, EV, back*, the
next version selected being .
These examples show that there exists a natural correlation between the wealth
of the selected versions and the implicit user interest. For this problem,
where and are not involved, the version given by the *policy graph* translates the estimation of
the running interest (growing with ). For each movie, the choice of version is
therefore based only on the events observed while browsing the previous movies.

Other sequences cause the decisions to be less
intuitive or harder to interpret. For example, the sequence *pageLoad, linkI, startSlide, stopSlide, noInt,
back* leaving leads to the decision .
In this sequence, a compromise between interest rise (suggested by *linkI, startSlide*) and decrease (suggested
by *stopSlide, noInt*) must be
established. Thus, a decision would not be illegitimate. The POMDP trades
off this decision according to its dynamics and its rewards. To obtain a
modified graph leading to a decision for this sequence, it would be sufficient that
the product decreases, where represents the probability to observe *EI* in the version ,
for a medium interest. In this case, *stopSlide*, instead of provoking a loopback
on the node 5, would bring the agent to the node 1. Then the agent would decide since the expectation of the gains associated
to would be smaller.

In general, the decision-making automaton depends on and .
When , ,
and vary, the automaton becomes too complex to be
displayed. The results of the POMDP require a different presentation.
Henceforth, working with 3 granularity levels on ,
2 on ,
3 on and the set of rewards leads to a *p* olicy graph of more than
100 nodes. We apply it during numerous sequences of simulated navigations.
Table 4 gives the statistics on the decisions that have been taken. For every
triplet (, , ), the decisions—the agent not knowing —are counted and translated into
percentages.

We notice that the proposed content becomes statistically richer when the interest increases, proving again that the interest estimation from the previous observations is as expected. Let us take an example and consider the bottom-right part of Table 4 (corresponding to and ). The probability of the policy proposing version increases with the interest: from 0% (small interest) to 2% (average interest) then 10% (big interest).

Moreover, when and/or increase, the interest trend is correct. For
example, for a given set of and ( and ), the proposed version becomes richer with
the bandwidth's increase from (1%*T*, 99%*I*, 0%*V*) to (0%*T*, 51%*I*, 49%*V*).

The POMDP capacity to refine adaptation policies
according to the user interest is thus validated. Once the POMDP model is
solved (*offline* resolution), the obtained automaton is easily put into
practice *online* by encoding it into an adaptation agent.

#### 8. Conclusion

This paper has shown that sequential decision processes under uncertainty are well suited for defining adaptation mechanisms for dynamic contexts. According to the type of the context state (observable or partially observable), we have shown how to characterize adaptation policies by solving Markov decision processes (MDPs) or partially observable MDP (POMDP). These ideas have been applied to adapt a movie browsing service. In particular, we have proposed a method for refining a given adaptation policy according to user interest. The perspectives of this work are manifold. Our approach can be applied to cases where rewards are explicitly related to the service (e.g., to maximize the number of rented DVDs). It will also be interesting to extend our model by coupling it with functionalities from recommendation systems and/or from multimedia search systems. In the latter case, we would benefit a lot from a collection of real data, that is, navigation logs. These are the research directions that will guide our future work.

#### References

- D. Kelly and J. Teevan, “Implicit feedback for inferring user preference: a bibliography,”
*ACM SIGIR Forum*, vol. 37, no. 2, pp. 18–28, 2003. View at Publisher · View at Google Scholar - T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, “Accurately interpreting clickthrough data as implicit feedback,” in
*Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05)*, pp. 154–161, Salvador, Brazil, August 2005. View at Publisher · View at Google Scholar - T. Lemlouma and N. Layaïda, “Media resources adaptation for limited devices,” in
*Proceedings of the 7th International Conference on Electronic Publishing (ICCC/IFIP '03)*, pp. 209–218, Minho, Portugal, June 2003. - M. Margaritidis and G. C. Polyzos, “Adaptation techniques for ubiquitous internet multimedia,”
*Wireless Communications and Mobile Computing*, vol. 1, no. 2, pp. 141–163, 2001. View at Publisher · View at Google Scholar - T. C. Thang, Y. J. Jung, and Y. M. Ro, “Dynamic programming based adaptation of multimedia contents in UMA,” in
*Proceedings of the 5th Pacific Rim Conference on Advances in Multimedia Information Processing (PCM '04)*, vol. 3332 of*Lecture Notes in Computer Science*, pp. 347–355, Springer, Tokyo, Japan, November-December 2004. - A. Divakaran, K. A. Peker, R. Radhakrishnan, Z. Xiong, and R. Cabasson,
*Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors in Video Mining*, Kluwer Academic Publishers, Dordrecht, The Netherlands, 2003. - B. Girod, M. Kalman, Y. J. Liang, and R. Zhang, “Advances in channel-adaptive video streaming,” in
*Proceedings of IEEE International Conference on Image Processing (ICIP '02)*, vol. 1, pp. 9–12, Rochester, NY, USA, September 2002. View at Publisher · View at Google Scholar - G. Ghinea and G. Magoulas, “Quality of service for perceptual considerations: an integrated perspective,” in
*Proceedings of IEEE International Conference on Multimedia and Expo (ICME '01)*, pp. 571–574, Tokyo, Japan, August 2001. - S. R. Gulliver, T. Serif, and G. Ghinea, “Pervasive and standalone computing: the perceptual effects of variable multimedia quality,”
*International Journal of Human Computer Studies*, vol. 60, no. 5-6, pp. 640–665, 2004. View at Publisher · View at Google Scholar - S. R. Gulliver and G. Ghinea, “Defining user perception of distributed multimedia quality,”
*ACM Transactions on Multimedia Computing, Communications and Applications*, vol. 2, no. 4, pp. 241–257, 2006. View at Publisher · View at Google Scholar - P. Brusilovsky and E. Millán, “User models for adaptive hypermedia and adaptive educational systems,” in
*The Adaptive Web: Methods and Strategies of Web Personalization*, vol. 4321 of*Lecture Notes in Computer Science*, pp. 3–53, Springer, Berlin, Germany, 2007. View at Publisher · View at Google Scholar - C. Romero, S. Ventura, and P. De Bra, “Knowledge discovery with genetic programming for providing feedback to courseware authors,”
*User Modelling and User-Adapted Interaction*, vol. 14, no. 5, pp. 425–464, 2004. View at Publisher · View at Google Scholar - T. Syeda-Mahmood and D. Ponceleon, “Learning video browsing behavior and its application in the generation of video previews,” in
*Proceedings of the ACM International Multimedia Conference and Exhibition (Multimedia '01)*, vol. 9, pp. 119–128, Ottawa, Canada, September-October 2001. View at Publisher · View at Google Scholar - C. Pleşca, V. Charvillat, and R. Grigoras, “User-aware adaptation by subjective metadata and inferred implicit descriptors,” in
*Multimedia Semantics—The Role of Metadata*, vol. 101 of*Studies in Computational Intelligence*, pp. 127–147, Springer, Berlin, Germany, 2008. View at Publisher · View at Google Scholar - R. Grigoras, V. Charvillat, and M. Douze, “Optimizing hypervideo navigation using a Markov decision process approach,” in
*Proceedings of the 10th ACM International Conference on Multimedia*, pp. 39–48, Juan-les-Pins, France, December 2002. View at Publisher · View at Google Scholar - G. Yavaş, D. Katsaros, Ö. Ulusoy, and Y. Manolopoulos, “A data mining approach for location prediction in mobile environments,”
*Data and Knowledge Engineering*, vol. 54, no. 2, pp. 121–146, 2005. View at Publisher · View at Google Scholar - H. Kosch, L. Böszörményi, M. Döller, M. Libsie, P. Schojer, and A. Kofler, “The life cycle of multimedia metadata,”
*IEEE Multimedia*, vol. 12, no. 1, pp. 80–86, 2005. View at Publisher · View at Google Scholar - C. Timmerer and H. Hellwagner, “Interoperable adaptive multimedia communication,”
*IEEE Multimedia*, vol. 12, no. 1, pp. 74–79, 2005. View at Publisher · View at Google Scholar - O. Layaïda, S. B. Atallah, and D. Hagimont, “A framework for dynamically configurable and reconfigurable network-based multimedia adaptations,”
*Journal of Internet Technology*, vol. 5, no. 4, pp. 363–372, 2004. View at Google Scholar - P. M. Ruiz, J. A. Botía, and A. Gómez-Skarmeta, “Providing QoS through machine-learning-driven adaptive multimedia applications,”
*IEEE Transactions on Systems, Man, and Cybernetics B*, vol. 34, no. 3, pp. 1398–1411, 2004. View at Publisher · View at Google Scholar - V. Charvillat and R. Grigoras, “Reinforcement learning for dynamic multimedia adaptation,”
*Journal of Network and Computer Applications*, vol. 30, no. 3, pp. 1034–1058, 2007. View at Publisher · View at Google Scholar - R. S. Sutton and A. G. Barto,
*Reinforcement Learning: An Introduction*, MIT Press, Cambridge, Mass, USA, 1998. - M. Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic Programming*, Wiley-Interscience, New York, NY, USA, 1994. - S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning without state-estimation in partially observable markovian decision processes,” in
*Proceedings of the 11th International Conference on Machine Learning (ICML '94)*, pp. 284–292, New Brunswick, NJ, USA, July 1994. - A. R. Cassandra, L. P. Kaelbling, and M. L. Littman, “Acting optimally in partially observable stochastic domains,” in
*Proceedings of the 12th National Conference on Artificial Intelligence (AAAI '94)*, vol. 2, pp. 1023–1028, Seattle, Wash, USA, July-August 1994. - T. Syeda-Mahmood, “Learning and tracking browsing behavior of users using hidden markov models,” in
*Proceedings of IBM Make It Easy Conference*, San Jose, Calif, USA, June 2001. - R. O. Duda, P. E. Hart, and D. G. Stork,
*Pattern Classification*, Wiley-Interscience, New York, NY, USA, 2nd edition, 2000.