Synthesis of Provably Correct Autonomy Protocols

for Shared Control

Murat Cubuktepe, Nils Jansen, Mohammed Alsiekh, Ufuk Topcu

Abstract—We develop algorithms to synthesize shared control

protocols subject to probabilistic temporal logic specifications.

More specifically, we develop a framework in which a human

and an autonomy protocol can issue commands to carry out

a certain task. We blend these commands into a joint input

to the robot. We model the interaction between the human

and the robot as a Markov decision process that represents

the shared control scenario. Additionally, we use randomized

strategies in a Markov decision process to incorporate potential

randomness in human’s decisions, which is caused by factors such

as complexity of the task specifications and imperfect interfaces.

Using inverse reinforcement learning, we obtain an abstraction

of the human’s behavior as a so-called randomized strategy. We

design the autonomy protocol to ensure that the robot behavior—

resulting from the blending of the commands—satisfies given

safety and performance specifications in probabilistic temporal

logic. Additionally, the resulting strategies generate behavior as

similar to the behavior induced by the human’s commands as

possible. We formulate the underlying problem as a quasiconvex

optimization problem, which is solved by checking feasibility

of a number of linear programming problems. We show the

applicability of the approach through case studies involving

autonomous wheelchair navigation and unmanned aerial vehicle

mission planning.

I. INTRODUCTION

We study the problem of shared control, where a robot

is to execute a task according to the goals of a human

operator while adhering to additional safety and performance

requirements. Applications of such human-robot interaction

include remotely operated semi-autonomous wheelchairs 13,

robotic teleoperation 16, and human-in-the-loop unmanned

aerial vehicle mission planning 9. Basically, the human

operator issues a command through an input interface, which

maps the command directly to an action for the robot. The

problem is that a sequence of such actions may fail to

accomplish the task at hand, due to limitations of the interface

or failure of the human in comprehending the complexity of

the problem. Therefore, a so-called autonomy protocol provides

assistance for the human in order to complete the task according

to the given requirements.

At the heart of the shared control problem is the design of

an autonomy protocol. In the literature, there are two main

directions, based on either switching the control authority

M. Cubuktepe and U. Topcu are with the Department of Aerospace Engi-

neering and Engineering Mechanics, University of Texas at Austin, 201 E 24th

St, Austin, TX 78712, USA. Nils Jansen is with the Department of Software

Science, Radboud University Nijmegen, Comeniuslaan 4, 6525 HP Nijmegen,

Netherlands. Mohammed Alsiekh is with the Institute for Computational

Engineering and Sciences, University of Texas at Austin, 201 E 24th St,

Austin, TX 78712, USA. email:({mcubuktepe,malsiekh,utopcu}@utexas.edu,

[email protected]).

between human and autonomy protocol 26, or on blending

their commands towards joined inputs for the robot 7, 15.

One approach to switching the authority first determines

the desired goal of the human operator with high confidence,

and then assists towards exactly this goal 8, 18. In 12,

switching the control authority between the human and au-

tonomy protocol ensures satisfaction of specifications that are

formally expressed in temporal logic. In general, the switching

of authority may cause a decrease in human’s satisfaction, who

usually prefers to retain as much control as possible 17.

The second direction is to provide another command in

addition to the one of the human operator. To introduce a

more flexible trade-off between the human’s control authority

and the level of autonomous assistance, both commands are

then blended to form a joined input for the robot. A blending

function determines the emphasis that is put on the autonomy

protocol in the blending, that is, regulating the amount of

assistance provided to the human. Switching of authority can

be seen as a special case of blending, as the blending function

may assign full control to the autonomy protocol or to the

human. In general, putting more emphasis on the autonomy

protocol in blending leads to greater accuracy in accomplishing

the task 6, 7, 20. However, as before, humans may prefer

to retain control of the robot 16, 17. None of the existing

blending approaches provide formal correctness guarantees

that go beyond statistical confidence bounds. Correctness here

refers to ensuring safety and optimizing performance according

to the given requirements. Our goal is to design an autonomy

protocol that admits formal correctness while rendering the

robot behavior as close to the human’s inputs as possible.

We model the behavior of the robot as a Markov decision

process (MDP) 23, which captures the robot’s actions inside a

potentially stochastic environment. Problem formulations with

MDPs typically focus on maximizing an expected reward (or,

minimizing the expected cost). However, such formulations may

not be sufficient to ensure safety or performance guarantees in

a task that includes a human operator. Therefore, we design

the autonomy protocol such that the resulting robot behavior

is formally verified with respect to probabilistic temporal logic

specifications. Such verification problems have been extensively

studied for MDPs 2.

Take as an example a scenario involving a semi-autonomous

wheelchair 13 whose navigation has to account for a randomly

moving autonomous vacuum cleaner, see Fig. 1. The wheelchair

needs to navigate to the exit of a room, and the vacuum

cleaner moves in the room according to a probabilistic transition

function. The task of the wheelchair is to reach the exit gate

while not crashing into the vacuum cleaner. The human may not

(a) Autonomy perspective

0.4

0.4

0.2

0.2

(b) Human perspective

Fig. 1. A wheelchair in a shared control setting.

fully perceive the motion of the vacuum cleaner. Note that the

human’s commands, depicted with the solid red line in Fig 1(a),

may cause the wheelchair to crash into the vacuum cleaner.

The autonomy protocol provides another set of commands,

which is indicated by the solid red line in Fig 1(b), to carry

out the task safely without crashing. However, the autonomy

protocol’s commands deviate highly from the commands of

the human. The two sets of commands are then blended into

a new set of commands, depicted using the dashed red line

in Fig 1(b). The blended commands perform the task safely

while generating behavior as similar to the behavior induced

by the human’s commands as possible.

In 15, we formulated the problem of designing the

autonomy protocol as a nonlinear programming problem. How-

ever, solving nonlinear programs is generally intractable 3.

Therefore, we proposed a greedy algorithm that iteratively

repairs the human strategy such that the specifications are

satisfied without guaranteeing optimality, based on 22. Here,

we propose an alternative approach for the blending of the two

strategies. We follow the approach of repairing the strategy

of the human to compute an autonomy protocol. We ensure

that the resulting behavior induced by the repaired strategy (1)

deviates minimally from the human strategy and (2) satisfies

given temporal logic specifications after blending. We formally

define the problem as a quasiconvex optimization problem,

which can be solved efficiently by checking feasibility of a

number of convex optimization problems 4.

A human may be uncertain about which command to issue

in order to accomplish a task. Moreover, a typical interface

used to parse human’s commands, such as a brain-computer

interface, is inherently imperfect. To capture such uncertainties

and imperfections in the human’s decisions, we introduce

randomness to the commands issued by humans. It may not

be possible to blend two different deterministic commands. If

the human’s command is “up” and the autonomy protocol’s

command is “right”, we cannot blend these two commands

to obtain another deterministic command. By introducing

randomness to the commands of the human and the autonomy

protocol, we therefore ensure that the blending is always well-

defined. In what follows, we call a formal interpretation of a

sequence of the human’s commands the human strategy, and

the sequence of commands issued by the autonomy protocol

the autonomy strategy.

The question remains how to obtain the human strategy in

the first place. It may be unrealistic that a human can provide

the strategy for an MDP that models a realistic scenario. To this

end, we create a virtual simulation environment that captures

the behavior of the MDP. We ask humans to participate in two

case studies to collect data about typical human behavior. We

use inverse reinforcement learning to get a formal interpretation

as a strategy based on human’s inputs 1, 28. We model

a typical shared control scenario based on an autonomous

wheelchair navigation 13 in our first case study. In our second

case study, we consider an unmanned aerial vehicle mission

planning scenario, where the human operator is to patrol certain

regions while keeping away from enemy aerial vehicles.

In summary, the main contribution this paper is to synthesize

the autonomy protocol such that the resulting blended or

repaired strategy meets all given specifications while only

minimally deviating from the human strategy. We introduce all

formal foundations that we need in Section II. We provide an

overview of the general shared control concept in Section III.

We present the shared control synthesis problem and provide

a solution based on linear programming in Section IV. We

indicate the applicability and scalability of our approach on

experiments in Section V and draw a conclusion and critique

of our approach in Section VI.

II. PRELIMINARIES

In this section, we introduce the required formal models and

specifications that we use to synthesize the autonomy protocol,

and we give a short example illustrating the main concepts.

1) Markov Decision Processes: A probability distribution

over a finite set X is a function µ : X ? 0, 1 ? R with?

x?X µ(x) = µ(X) = 1. The set X of all distributions is

Distr(X).

Definition 1 (MDP). A Markov decision process (MDP)M =

(S, sI ,A,P) has a finite set S of states, an initial state sI ? S,

a finite set A of actions, and a transition probability function

P : S ×A? Distr(S).

MDPs have nondeterministic choices of actions at the

states; the successors are determined probabilistically via the

associated probability distribution. We assume that all actions

are available at each state and that the MDP contains no

deadlock states. A cost function C : S ×A? R?0 associates

cost to transitions. If there one single action available at each

state, the MDP reduces to a discrete-time Markov chain (MC).

We use strategies resolve the choices of actions in order to

define a probability measure and expected cost on MDPs.

Definition 2 (Strategy). A randomized strategy for an MDPM

is a function ? : S ? Distr(A). If ?(s, ?) = 1 for ? ? A and

?(s, ?) = 0 for all ? ? A {?}, the strategy is deterministic.

The set of all strategies over M is StrM.

Resolving all the nondeterminism for an MDP M with a

strategy ? ? StrM yields an induced Markov chain M? .

s0 s1 s2

s3 s4

a

b

c

d

0.6

0.4

0.4

0.6

0.6

0.4

0.4

0.6

11

1

(a) MDP M

s0 s1 s2

s3 s4

0.5

0.5

0.5

0.5

11

1

(b) Induced MC M?1

Fig. 2. MDP M with target state s2 and induced MC for strategy ?unif

Definition 3 (Induced MC). For an MDP M = (S, sI ,A,P)

and strategy ? ? StrM, the MC induced by M and ? is

M? = (S, sI ,A,P?) where

P?(s, s?) =

?

??A(s)

?(s, ?) · P(s, ?, s?) for all s, s? ? S .

The occupancy measure of a strategy ? gives the expected

number of taking an action at a state in an MDP. In our

formulation, we use the occupancy measure of a strategy to

compute an autonomy protocol.

Definition 4 (Occupancy Measure). The occupancy measure

x? of a strategy ? is defined as

x?(s, ?) = E

??

t=0

P (?t = ?|st = s)

(1)

where st and ?t denote the state and action in an MDP at

time step t. The occupancy measure x?(s, ?) is the expected

number of taking the action ? at state s with the strategy ?.

2) Specifications: A probabilistic reachability specification

P??(?T ) with the threshold ? ? 0, 1 ? Q and the set of

target states T ? S puts an upper bound ? on the probability

to reach T from sI in M. Similarly, expected cost properties

E??(?G) restrict the expected cost to reach the set G ? S

of goal states by an upper bound ? ? Q. We also use until

properties of the form Pr??(¬T U G), which assert that the

probability of reaching the set of states G while not reaching

the set of target states T beforehand is at least ?.

The synthesis problem is to find one particular strategy ?

for an MDP M such that given specifications ?1, . . . , ?n are

satisfied in the induced MC M? , written ? |= ?1, . . . , ?n.

Example 1. Fig. 2(a) depicts an MDP M with initial state

s0. In state s0, the avaliable actions are a and b. Similarly

for state s1, the two avaliable actions are c and d. If action a

is selected in state s0, the agent transitions to s1 and s3 with

probabilities 0.6 and 0.4. For states s2, s3 and s4 we omit

actions, because of the self loops.

For a safety specification ? = P?0.21(?s2), the deterministic

strategy ?1 ? StrM with ?1(s0, a) = 1 and ?1(s1, c) =

1 induces the probability 0.36 of reaching s2. Therefore,

the specification is not satisfied, see the induced MC in

Fig. 2(b). Likewise, the randomized strategy ?unif ? StrM

with ?unif(s0, a) = ?unif(s0, b) = 0.5 and ?unif(s1, c) =

Human

Strategy

Autonomy

Strategy

Blended

Strategy

Robot

execution

command

command

blended command

Blending

function b

Formal

model Mr

Specifications

?1, . . . , ?n

Human

strategy

Fig. 3. Shared control architecture.

?unif(s1, d) = 0.5 violates the specification, as the probability

of reaching s2 is 0.25. However, the deterministic strategy

?safe ? StrM with ?safe(s0, b) = 1 and ?safe(s1, d) = 1 induces

the probability 0.16, thus ?safe |= ?.

III. CONCEPTUAL DESCRIPTION OF SHARED CONTROL

We now detail the general shared control concept adopted in

this paper. Consider the setting in Fig. 3. As inputs, we have a

set of task specifications, a model Mr for the robot behavior,

and a blending function b. The given robot task is described by

certain performance and safety specifications ?1, . . . , ?n. For

example, it may not be safe to take the shortest route because

there may be too many obstacles in that route. In order to

satisfy performance considerations, the robot should prefer to

take the shortest route possible while not violating the safety

specifications. We model the behavior of the robot inside a

stochastic environment as an MDP Mr.

In out setting, a human issues a set of commands for the

robot to execute. It may be unrealistic that a human can

grasp an MDP that models a realistic shared control scenario.

Indeed, a human will likely have difficulties interpreting a

large number of possibilities and the associated probability

of paths and payoffs 11, and it may be impractical for the

human to provide the human strategy to the autonomy protocol,

due to possibly large state space of the MDP. Therefore,

we compute a human strategy ?h as an abstraction of a

sequence of human’s commands, which we obtain using inverse

reinforcement learning 1, 28.

We design an autonomy protocol that provides another

strategy ?a, which we call the autonomy strategy. Then, we

blend the two strategies according to the blending function b

into the blended strategy ?ha. The blending function reflects

preference over the human strategy or the autonomy strategy.

We ensure that the blended strategy deviates minimally from

the human strategy.

At runtime, we can then blend commands of the human with

commands of the autonomy strategy. The resulting “blended”

commands will induce same behavior as with the blended

strategy ?ha, and the specifications are satisfied. This procedure

requires to check feasibility of a number of linear programming

problems, which can be expensive, but it is very efficient to

implement at run time.

The shared control synthesis problem is then the synthesis

of the repaired strategy such that, for the repaired strategy ?ha,

it holds ?ha |= ?1, . . . , ?n while deviating minimally from ?h.

The deviation between the human strategy ?h and the repaired

strategy ?ha is measured by the maximal difference between

the two strategies.

IV. SYNTHESIS OF THE AUTONOMY PROTOCOL

A. Strategy blending

Given the human strategy ?h ? StrMr and the autonomy

strategy ?a ? StrMr , a blending function computes a weighted

composition of the two strategies by favoring one or the other

strategy in each state of the MDP 16, 6, 7.

7 argues that the weight of blending shows the confidence

in how well the autonomy protocol can assist to perform the

human’s task. Put differently, the blending function should

assign a low confidence to the actions of the human, if the

actions may lead to a violation of the specifications. Recall

Fig. 1 and the example in the introduction. In the cells of the

gridworld where some actions may result in a collusion with

the vacuum cleaner with high probability, it makes sense to

assign a high weight in the autonomy strategy.

We pick the blending function as a state-dependent function

that weighs the confidence in both the human strategy and the

autonomy strategy at each state of an MDP Mr 16, 6, 7.

Definition 5 (Linear blending). Given an MDP Mr =

(S, sI ,A,P), two strategies ?h, ?a ? StrMr , and a blending

function b : S ? 0, 1, the blended strategy ?ha ? StrMr for

all states s ? S, and actions ? ? A is

?ha(s, ?) = b(s) · ?h(s, ?) + (1? b(s)) · ?a(s, ?) .

For each s ? S, the value of b(s) represents the “weight” of

?h at s, meaning how much emphasis the blending function

puts on the human strategy at state s. For example, referring

back to Fig. 1, the critical cells of the gridworld correspond

to certain states of the MDP Mr. At these states, we assign

a very low confidence in the human strategy. For instance at

such a state s ? S, we might have b(s) = 0.1, meaning the

blended strategy in state s puts more emphasis on the autonomy

strategy ?a.

B. Strategy repair

In this section, we describe our approach to the shared control

synthesis problem. We formulate the specific problem of repair-

ing the human strategy based on quasiconvex programming,

which can be solved by checking feasibility of a number of

convex (here even linear) programming problems. We show

that the repaired strategy satisfies the task specifications while

deviating minimally from the human strategy. Finally, we

compute the autonomy strategy ?a that may then be subject

to blending.

We start by formalizing the concept of strategy perturbation.

Then, we introduce the dual linear programming (LP) formula-

tion to synthesize a strategy in a MDP. Finally, we present an

LP formulation of the strategy repair problem and show that

the solution indeed satisfies the specifications while deviating

from the human’s strategy minimally.

1) Perturbation of strategies: As mentioned in the introduc-

tion, the blended strategy should deviate minimally from the

human strategy. To measure the quantity of such a deviation,

we introduce the concept of perturbation, which was used

in 5. To modify a (randomized) strategy, we employ additive

perturbation by increasing or decreasing the probabilities of

action choices in each state. We also ensure that for each state,

the resulting strategy is a well-defined distribution over the

actions.

Definition 6 (Strategy perturbation). Given an MDP Mr and

a strategy ? ? StrMr , an (additive) perturbation ? is a function

? : S ×A? ?1, 1 with?

??A

?(s, ?) = 0 ?s ? S.

The perturbation value at state s for action ? is ?(s, ?).

Overloading the notation, the perturbed strategy ?(?) is given

by

?(?)(s, ?) = ?(s, ?) + ?(s, ?) ?s ? S.? ? A. (2)

2) Dual linear programming formulation for MDPs: In this

section, we refer to the LP formulation to compute a policy

that maximizes the probability of satisfying a reachability

specification P??(?T ) in a MDP 24, 10.

The LP formulation builds on the set Var of variables

including the following:

• x(s, ?) ? 0,?) for each s ? S T and ? ? A defines

the occupancy measure of a state-action pair.

maximize

?

s?ST

?

??A

?(s, ?)x(s, ?) (3)

subject to

?s ? S T.?

??A

x(s, ?) =

?

s??ST

?

??A

P(s?, ?, s)x(s?, ?) + ?s (4)?

s?ST

?

??A

?(s, ?)x(s, ?) ? ? (5)

where ?(s, ?) =

?

s??T

P(s, ?, s?), and ?s = 1 if s = sI and

?s = 0 if s 6= sI .

For any optimal solution x to the LP in (3)–(5), then

?(s, ?) =

x(s, ?)?

??A

x(s, ?)

(6)

is an optimal strategy, and x is the occupancy measure of ?,

see 24 and 10 for details.

After finding an optimal strategy to the LP in (3)–(5), we

can compute the probability of satisfying a specification by

?

s?ST

?

??A

?(s, ?)x(s, ?). (7)

3) Strategy repair using quasiconvex programming: Given

the human strategy, ?h ? StrMr , the aim of the autonomy

protocol is to compute the blended strategy, or the repaired

strategy ?ha that induces a similar behavior to the human

strategy while satisfying the specifications. We compute the

repaired strategy by perturbing the human strategy, which is

introduced in Definition 6. As in Definition 6, we perturb the

human strategy ?h to the repaired strategy ?ha by

?s ? S.? ? A. ?ha(s, ?) = ?h(s, ?) + ?(s, ?). (8)

We cannot add this constraint to the LP in (3)–(5) because

the variables in the LP are for the occupancy measures for

the strategy. To perturb the strategy, we use the definition of

occupancy measure to translate the constraint in (8) into the

constraint

?s ? S.? ? A. xha(s, ?)?

??A

xha(s, ?)

= ?h(s, ?) + ?(s, ?) (9)

or equivalently to the constraint

?s ? S.? ? A.

xha(s, ?) =

?

??A

xha(s, ?) (?h(s, ?) + ?(s, ?)) . (10)

Note that the constraint in (10) is not a linear constraint

because of multiplication of the perturbation variables ? and

occupancy measure variables x. Since we are interested in

minimizing the maximal deviation, we can assign a common

variable ?? ? 0, 1 for all state-action pairs to put an upper

bound on the deviation by

?s ? S.? ? A.

|xha(s, ?)?

?

??A

xha(s, ?)?h(s, ?)| ? ??

?

??A

xha(s, ?).

(11)

The constraint in (11) is also not linear constraint. In fact, it

is a quadratic constraint due to multiplication of ?? and xha.

However, the feasible set of policies scales monotonically with

??, i. e., more policies are feasible with increasing ??. Note that

for a fixed ??, the above constraint is linear, and therefore it is

a convex constraint.

We show that the formulation for the shared control synthesis

problem is given by a quasiconvex optimization problem (QCP).

The QCP formulation incorporates the occupancy variables

for the repaired strategy ?ha, and an upper bound for the

perturbation ?? of the human strategy. We show the formulation

for a single reachability specification ? = P??(?T ). Then,

we discuss how to add additional safety and expected cost

specifications in the QCP formulation.

The QCP formulation builds on the set Var of variables

including the following:

• xha(s, ?) ? 0,?) for each s ? S T and ? ? A

defines the occupancy measure of a state-action pair for

the repaired strategy.

• ?? ? 0, 1 gives the upper bound of the perturbation of

the strategy ?h.

The shared control synthesis problem can be formulated as the

following QCP:

minimize ?? (12)

subject to

?s ? S T.?

??A

xha(s, ?) =

?

s??ST

?

??A

P(s?, ?, s)xha(s?, ?) + ?s

(13)

?s ? S.? ? A.?

s?ST

?

??A

?(s, ?)xha(s, ?) ? ? (14)

|xha(s, ?)?

?

??A

xha(s, ?)?h(s, ?)| ? ??

?

??A

xha(s, ?).

(15)

Let us walk through the optimization (12)–(15). We minimize

the infinity norm of the perturbation in (12). The constraints

in (13) ensures that the resulting strategy ?ha from the

occupancy variables xha is well-defined. The constraint in (14)

constrains the probability of reaching the set of target states

T is smaller than or equal to ? to satisfy the specification

P??(?T ). We perturb the human strategy ?h to the repaired

strategy ?ha as in Definition 6 in (15) by putting an upper

bound on the maximal perturbation.

The problem in (12)–(15) not a convex optimization problem

because of the nonconvex constraint in (15). However, for a

fixed ??, the problem in (12)–(15) is a LP and the set of feasible

policies scales monotonically in ??. Therefore, the problem

in (12)–(15) is a quasiconvex optimization problem. We can

solve the QCP in (12)–(15) by employing a bisection method

on ??. A method to solve quasiconvex optimization problems is

given in 4, Algorithm 4.1. We adopt the Algorithm 4.1 in 4

as follows:

Algorithm 1: Bisection method for quasiconvex optimiza-

tion problem.

given l = 0, u = 1, tolerance ? ; 0.

repeat

1. Set ?? = (l + u)/2.

2. Solve the convex feasibility problem in (13)–(15).

3. if the problem in (13)–(15) is feasible, u := ??; else

l := ??.

until u? l ? ?.

The above algorithm requires exactly

?

log2(

1

?

)

?

iterations,

and we can compute a policy that induces a minimal deviation

up to any given accuracy of ?.

Theorem 1. The strategy obtained from the QCP in (12)–(15)

satisfies the task specifications and it is optimal with respect

to the maximal deviation.

Proof. From a satisfying assignment to the constraints in (12)–

(15), we can compute a policy that satisfies the specification

using (6). Using Algorithm (1), we can compute the repaired

policy ?ha that deviates minimally from the human strategy

?h up to ? accuracy in

?

log2(

1

?

)

?

iterations. Therefore, by

solving the QCP (12)–(15), we can compute an optimal policy

for the shared control synthesis problem.

We note that the solution given by the QCP in (12)–(15)

computes the minimally deviating repaired strategy ?ha using

strategy repair. On the one hand, reference 15 considers

repairing the strategy with a greedy approach. Their approach

requires solving possibly an unbounded number of LPs to

compute a feasible strategy that is not necessarily optimal. On

the other hand, we only need to check feasibility of a bounded

number of LPs to compute an optimal strategy. Note that

we do not compute the autonomy strategy ?a with the QCP

in (12)–(15) directly. After computing the repaired strategy

?ha, we compute the autonomy strategy ?a according to the

Definition 5.

4) Additional specifications: The QCP in (12)–(15) com-

putes an optimal policy for a single reachability specification.

Suppose that we are given another reachability specification

?? = P???(?T ?) with T ? 6= T in addition to the reachability

specification ? = P??(?T ). We can handle this specification

by the adding ?

s?ST ?

?

??A

??(s, ?)x(s, ?) ? ? (16)

to the QCP in (12)–(15), where ??(s, ?) =

?

s??T ?

P(s, ?, s?).

The constraint in (16) ensures that the probability of reaching

T ? is less than ??.

We handle an expected cost specification E??(?G) for G ?

S, by adding ?

s?SG

?

??A

C(s, ?)x(s, ?) ? ? (17)

to the QCP in (12)–(15).

Similar to the probability computation for reachability

specifications, the expected cost for reaching G is given by?

s?SG

?

??A

C(s, ?)x(s, ?). (18)

V. CASE STUDY AND EXPERIMENTS

We require an abstract representation of the human’s commands

as a strategy to use our synthesis approach in a shared control

scenario, We now discuss how such strategies may be obtained

using inverse reinforcement learning and report on case study

results.

A. Experimental setting

We consider two scenarios, the first of which is the wheelchair

scenario from Fig. 1. We model the wheelchair scenario inside

an interactive Python environment.

In the second scenario, we use a tool called AMASE1, which

is used to simulate multi-unmanned aircraft vehicles (UAV)

missions. Its graphical user interfaces allow humans to send

commands to the one or multiple vehicles at run time. It

includes three main programs: a simulator, a data playback

tool, and a scenario setup tool.

We use the model checker PRISM 19 to verify if the

computed strategies satisfy the specification. We use the LP

solver Gurobi 14 to check the feasibility of the LP problems

that is given in Section IV. We also implemented the greedy

approach for strategy repair in 15.

B. Data collection

We asked five participants to accomplish tasks in the wheelchair

scenario. The goal is moving the wheelchair to a target cell in

gridworld while never occupying the same cell as the moving

obstacle. Similarly, three participants performed the surveillance

task in the AMASE environment.

From the data obtained from each participant, we compute

an individual randomized human strategy ?h via maximum-

entropy inverse reinforcement learning (MEIRL) 28. Refer-

ence 16 uses inverse reinforcement learning to reason about

the human’s commands in a shared control scenario from

human’s demonstrations. However, they lack formal guarantees

on the robot’s execution. In 25, inverse reinforcement learning

is used to distinguish human intents with respect to different

tasks. We show the work flow of the case study in Figure 4.

In our setting, we denote each sample as one particular

command of the participant, and we assume that the participant

issues the command to satisfy the specification. Under this

assumption, we can bound the probability of a possible

deviation from the actual intent with respect to the number of

samples using Hoeffding’s inequality for the resulting strategy,

see 27 for details. Using these bounds, we can determine

the required number of command to get an approximation

of a typical human behavior. The probability of a possible

deviation from the human behavior is given by O(exp(?n?2)),

where n is the number of commands from the human and ?

is the upper bound on the deviation between the probability

of satisfying the specification with the true human strategy

and the probability obtained by the strategy that is computed

by inverse reinforcement learning. For example, to ensure an

upper bound ? = 0.05 on the deviation of the probability

1https://github.com/afrl-rq/OpenAMASE

Fig. 4. The setting of the case study for the shared control simulation.

of satisfying the specification with a probability of 0.99, we

require 1060 demonstrations from the human.

We design the blending function by assigning a lower weight

in the human strategy at states where the human strategy

yields a lower probability of reaching the target set. Using

this function, we create the autonomy strategy ?a and pass

(together with the blending function) back to the environment.

Note that the blended strategy ?ha satisfies the specification,

by Theorem 1.

C. Gridworld

The size of the gridworld in Fig. 1 is variable, and we

generate a number of randomly moving (e.g., the vacuum

cleaner) and stationary obstacles. An agent (e.g., the wheelchair)

moves in the gridworld according to the commands from a

human. For the gridworld scenario, we construct an MDP and

the states of the MDP represents the positions of the agent and

the obstacles. The actions in the MDP induce changes in the

position of the agent.

The safety specification specifies that the agent has to reach

a target cell while not crashing into an obstacle with a certain

probability ? ? 0, 1, formally P??(¬crash U target).

First, we report results for one particular participant in a

gridworld scenario with a 8× 8 grid and one moving obstacle.

The resulting MDP has 2304 states. We compute the human

strategy using MEIRL with the features, e. g., the components

of the cost function of the human, giving the distance to the

obstacle and the goal state.

We instantiate the safety specification with ? = 0.7, which

means the target should be reached with at least a probability

of 0.7. The human strategy ?h induces a probability of 0.546

to satisfy the specification. That is, it does not satisfy the

specification.

We compute the repaired strategy ?ha using the greedy and

the QCP approach, and both strategies satisfies the specification

by inducing a probability of satisfying the specification larger

than ?. On the one hand, the maximum deviation between

the human strategy ?h and ?ha is 0.15 with the LP approach,

which implies that the strategy of the human and the autonomy

protocol deviates at most 15% for all states and actions. On

the other hand, the maximum deviation between the human

strategy ?h and the blended strategy ?ha is 0.03 with the QCP

approach. The results shows that the QCP approach computes

a blended strategy that induces more similar strategy to the

human strategy compared to the LP approach.

(a) Strategy ?h (b) Strategy ?ah (c) Strategy ?a

Fig. 5. Graphical representation of the obtained human, blended, and autonomy

strategy in the grid.

TABLE I

SCALABILITY RESULTS FOR THE GRIDWORLD EXAMPLE.

grid states trans. LP synth. ?LP QCP synth. ?QCP

8× 8 2, 304 36, 864 14.12 0.15 31.49 0.03

10× 10 3, 600 57, 600 23.80 0.24 44.61 0.04

12× 12 14, 400 230, 400 250.78 0.33 452.27 0.05

To finally assess the scalability of our approach, consider

Table I. We generated MDPs for different gridworlds with

different number of states and number of obstacles. We list

the number of states in the MDP (labeled as “states”) and

the number of transitions (labeled as “trans”). We report on

the time that the synthesis process took with the LP approach

and QCP approach (labeled as “LP synth.” and “QCP synth”),

which includes the time of solving the LP or QCPs measured

in seconds. It also includes the model checking times using

PRISM for the LP approach. To represent the optimality of the

synthesis, we list the maximal deviation between the repaired

strategy and the human strategy for the LP and QCP approach

(labeled as “?LP” and “?QCP”). In all of the examples, we

observe that the strategies with the QCP approach yields

autonomy strategies with less deviation to the human strategy

while having similar computation time with the LP approach.

D. UAV mission planning

Similar to the gridworld scenario, we generate an MDP, in

which that of the states MDP denotes the position of the agents

and the obstacles in a AMASE scenario. Consider an AMASE

scenario in Fig. 6. In this scenario, the specification or the

mission of the agent (blue UAV) is to keep surveilling the

green regions (labeled as w1, w2, w3) while avoiding restricted

operating zones (labeled as “ROZ1, ROZ2”) and enemy agents

(purple and green UAVs). As we consider reachability problems,

we asked the participants to visit the regions in a sequence,

i.e., visiting the first region, then second, and then the third

region. After visiting the third region, the task is to visit the

first region again to perform the surveillance.

For example, if the last visited region is w3, then the

safety specification in this scenario is P??((¬crash ?

¬ROZ) U target), where ROZ is to visit the ROZ areas

and target is visiting w1.

We synthesize the autonomy protocol on the AMASE

scenario with two enemy agents that induces an MDP with

15625 states. We use the same blending function and same

(a) An AMASE simulator. (b) The GUI of AMASE.

Fig. 6. An example of an AMASE scenario.

threshold ? = 0.7 as in the gridworld example. The features

to compute the human strategy with MEIRL are given by

the distance to the closest ROZ, enemy agents and the target

region.

The human strategy ?h induces a probability of 0.163 to

satisfy the specification, and it violates the specification like

in the gridworld example. Similar to the gridworld example,

we compute the repaired strategy ?ha with the LP and the

QCP approach, and both strategies satisfy the specification. On

the one hand, the maximum deviation between ?h and ?ha is

0.42 with the LP approach, which means the strategies of the

human and the autonomy protocol are significantly different in

some states of the MDP. On the other hand, the QCP approach

yields a repaired strategy ?ha that is more similar to the human

strategy ?h with the maximum deviation being 0.06. The time

of the synthesis procedure with the LP approach is 481.31

seconds and the computation time with the QCP approach

is 749.18 seconds, showing the trade-offs between the LP

approach and the QCP approach. We see that, the LP approach

can compute a feasible solution slightly faster, however the

resulting blended strategy may be less similar to the human

strategy compared to the QCP approach.

VI. CONCLUSION AND CRITIQUE

We introduced a formal approach to synthesize an autonomy

protocol in a shared control setting subject to probabilistic

temporal logic specifications. The proposed approach utilizes

inverse reinforcement learning to compute an abstraction of

a human’s behavior as a randomized strategy in a Markov

decision process. We designed an autonomy protocol such

that the robot behavior satisfies safety and performance

specifications given in probabilistic temporal logic. We also

ensured that the resulting robot behavior is as similar to the

behavior induced by the human’s commands as possible. We

synthesized the robot behavior using quasiconvex programming.

We showed the practical usability of our approach through

case studies involving autonomous wheelchair navigation and

unmanned aerial vehicle planning.

There is a number of limitations and also possible extensions

of the proposed approach. First of all, we computed an

globally optimal strategy by bisection, which requires checking

feasibility of a number of linear programming problems. A

convex formulation of the shared control synthesis problem

would make computing the globally optimal strategy more

efficient.

We assumed that the human’s commands are consistent

through the whole execution, i. e., the human issues each

command to satisfy the specification. Also, this assumption

implies the human does not consider assistance from the robot

while providing commands – and in particular, the human does

not adapt the strategy to the assistance. It may be possible

to extend the approach to handle non-consistent commands

by utilizing additional side information, such as the task

specifications.

Finally, in order to generalize the proposed approach to

other task domains, it is worth to explore transfer learning 21

techniques. Such techniques will allow us to handle different

scenarios without requiring to relearn the human strategy from

the human’s commands.

REFERENCES

1 Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse

reinforcement learning. In ICML, page 1. ACM, 2004.

2 Christel Baier and Joost-Pieter Katoen. Principles of Model Checking.

The MIT Press, 2008.

3 Mihir Bellare and Phillip Rogaway. The complexity of approximating

a nonlinear program. In Complexity in numerical optimization, pages

16–32. World Scientific, 1993.

4 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cam-

bridge University Press, New York, NY, USA, 2004.

5 Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su.

Perturbation analysis in verification of discrete-time Markov chains.

In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014.

6 Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive

teleoperation. In Robotics: Science and Systems, 2012.

7 Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending

formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013.

8 Andrew Fagg, Michael Rosenstein, Robert Platt, and Roderic Grupen.

Extracting user intent in mixed initiative teleoperator control. In

Intelligent Systems Technical Conference, page 6309, 2004.

9 Lu Feng, Clemens Wiltsche, Laura Humphrey, and Ufuk Topcu. Synthesis

of human-in-the-loop control protocols for autonomous systems. IEEE

Transactions on Automation Science and Engineering, 13(2):450–462,

2016.

10 Vojte?ch Forejt, Marta Kwiatkowska, Gethin Norman, David Parker, and

Hongyang Qu. Quantitative multi-objective verification for probabilistic

systems. In TACAS, pages 112–127. Springer, 2011.

11 Roland Fryer and Matthew O Jackson. A categorical model of cognition

and biased decision making. The BE Journal of Theoretical Economics,

8(1).

12 Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with

temporal logic specifications. IEEE Transactions on Automation Science

and Engineering, 13(1):7–17, 2016.

13 F. Gala?n, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips,

and J. del R. Milla?n. A brain-actuated wheelchair: Asynchronous and

non-invasive brain-computer interfaces for continuous control of robots.

Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28.

14 Gurobi Optimization, Inc. Gurobi optimizer reference manual.

url=http://www.gurobi.com, 2013.

15 Nils Jansen, Murat Cubuktepe, and Ufuk Topcu. Synthesis of shared

control protocols with provable safety and performance guarantees. In

ACC, pages 1866–1873. IEEE, 2017.

16 Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared

autonomy via hindsight optimization. In Robotics: Science and Systems,

2015.

17 Dae-Jin Kim, Rebekah Hazlett-Knudsen, Heather Culver-Godfrey, Greta

Rucks, Tara Cunningham, David Portee, John Bricout, Zhao Wang, and

Aman Behal. How autonomy impacts performance and satisfaction:

Results from a study with spinal cord injured subjects using an assistive

robot. IEEE Transactions on Systems, Man, and Cybernetics-Part A:

Systems and Humans, 42(1):2–14, 2012.

18 Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma.

Teleoperation of a robot manipulator using a vision-based human-robot

interface. IEEE transactions on industrial electronics, 52(5):1206–1219,

2005.

19 Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0:

Verification of probabilistic real-time systems. In CAV, volume 6806 of

LNCS, pages 585–591. Springer, 2011.

20 Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David

Gossow. Strategies for human-in-the-loop robotic grasping. In HRI, pages

1–8. IEEE, 2012.

21 Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE

Transactions on knowledge and data engineering, 22(10):1345–1359,

2010.

22 Shashank Pathak, Erika A?braha?m, Nils Jansen, Armando Tacchella,

and Joost-Pieter Katoen. A greedy approach for the efficient repair of

stochastic models. In NFM, volume 9058 of LNCS, pages 295–309.

Springer, 2015.

23 Martin L. Puterman. Markov Decision Processes: Discrete Stochastic

Dynamic Programming. John Wiley and Sons, 1994.

24 Martin L Puterman. Markov decision processes: discrete stochastic

dynamic programming. John Wiley & Sons, 2014.

25 Constantin A Rothkopf and Dana H Ballard. Modular inverse rein-

forcement learning for visuomotor behavior. Biological cybernetics,

107(4):477–490, 2013.

26 Jian Shen, Javier Ibanez-Guzman, Teck Chew Ng, and Boon Seng

Chew. A collaborative-shared control system with safe obstacle avoidance

capability. In Robotics, Automation and Mechatronics, volume 1, pages

119–123. IEEE, 2004.

27 Brian D Ziebart. Modeling purposeful adaptive behavior with the principle

of maximum causal entropy. 2010.

28 Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey.

Maximum entropy inverse reinforcement learning. 2008.