Reinforcement Learning

SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
D5.3 Reinforcement learning
3rd review meeting
Cartif-Brussels 17/12/2014
D5.3 Reinforcement Learning
1
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Contents

The SandS system
◦ The intended role of Reinforcement Learning

The actor-critic model

Some results and conclusions
D5.3 Reinforcement Learning
2
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
The SandS system

The role of the NI is to provide recipe recommendations to new
task proposed by the eahouker

The role of the Reinforcement Learning is to tune the recipes to
the eahouker preferences
◦ Recipe individualization
◦ Learning implies (many) repetition of
• recipe application
• Eahouker feedback
D5.3 Reinforcement Learning
3
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
The SandS System
D5.3 Reinforcement Learning
4
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
The SandS system (explanation)

Eahouker forwards a new task to NI (through DI and ESN)

NI generates a new recipe for the task, sending it to the
appliance, adn RL for future processing

Eahouker receives the result, forwards satisfaction feedback to
RL for future processing

Sometime in the future, Eahouker needs to do the same task, NI
detects that fact, sending it to RL

RL generates a refined recipe according to past satisfaction,
sending it to NI which sends it to the appliance

Eahouker generates a satisfaction feedbak for RL.
D5.3 Reinforcement Learning
5
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Reinforcement Learning

Elements
◦ Action selection: recipe generation
◦ State of the system: appliance and task
◦ Reward: eahouker feedback
◦ Learning goal: optimal action selection policies
• Evaluation of policy quality: critic
• Generation of optimal actions: actor
D5.3 Reinforcement Learning
6
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Actor critic model
D5.3 Reinforcement Learning
7
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Actor Critic model

Policies and actions are modelled by Radial Basis Functions

Actor updates the action generation parameters by

Critic updates the valuation function by
D5.3 Reinforcement Learning
8
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Some results

We need to resort to simulation

Case study: washing machine (abstract)

We generate hidden target task-recipe pairs
◦ An initial random recipe is provided
◦ applying RL to tune initial recipe
◦ Measuring evolution by the difference between the hidden recipe and the
one generated by the actor
D5.3 Reinforcement Learning
9
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Some results
D5.3 Reinforcement Learning
10
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Some results
D5.3 Reinforcement Learning
11
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Some results

Increase in satisfaction in 6 experiments (5 is maximum)
D5.3 Reinforcement Learning
12
SOCIAL HOUSEKEEPING THROUGH INTERCOMMUNICATING APPLIANCES AND SHARED RECIPES MERGING IN A PERVASIVE WEB-SERVICES INFRASTRUCTURE
Conclusions

Reinforcement learning may be applied to the recipe tuning

Requires repetitive experimentation by the eahouker

Shows good convergence, meaning few trials will show
significant improvement of the eahouker experience.

Real life application may be achieved when the system is fully
deployed,
◦ software available and modular
D5.3 Reinforcement Learning
13