Robin Hanson says
“In few months, China is likely to be a basket case, having crashed their
economy in failed attempt to stop COVID-19 spreading.” Quantifying the
forecast, he says China’s economy (or growth?) will be “a factor of two to ten
down” and seems to expect dramatic results in 6 months.
Q:
Say we’re training an autonomous car by running a bunch of practice trips and
letting the model learn from experience. For example, to teach safe driving we
might input a reward if it makes a trip without running anyone over and input a
penalty otherwise. What’s the flaw in this approach, and how serious is this
issue in AI systems present and future?
Two big flaws, if we use traditional model-free reinforcement learning
algorithms (Deep Q learning, policy gradient):
The RL agent won’t learn to avoid running over the human until it
actually runs over the human and recieves the penalty a large number of
times.
The RL agent will suffer “The Sisyphean Curse of RL”. Once it learns
to avoid running over humans, it will keep having new experiences where it
doesn’t run over humans. Eventually, it will forget that running over
humans is bad, and occasionally needing to run over humans a few times and
get penalized in order to remember. This will repeat as long as the agent
is being trained.
So, the training process can lead to an arbitrary number of humans being
run over. (In practice of course, you’d stop after the first one if not
sooner).
Q:
Your proposal, called Human Intervention Reinforcement Learning (HIRL),
involves using humans to prevent unwitting AIs from taking dangerous
actions.
How does it work?
A human watches the training process. Whenever the RL agent is about to
do something catastrophic, the human intervenes, changing the RL agent’s
action to avoid the catastrophe and giving the RL agent a penalty.
We record all instances when the human intervenes, and train a
supervised learning algorithm (“the blocker”) to predict when the human
intervenes.
When the blocker is able to predict when the human intervenes, we
replace the human with the blocker and continue training. Now the blocker
is called for every new action the agent takes, and decides whether it
should intervene and penalize the agent.
Eventually, the RL agent should learn a policy that performs well on the
task and avoids proposing the blocked actions, which should then be safe
for deployment.
Q:
What’s a practical example where HIRL might be useful?
One example might be for a chatbot that occasionally proposes an offensive
reply in a conversation (e.g. Microsoft Tay). A human could review
statements proposed by the chatbot and block offensive ones being sent to
end users.
Q:
Is there a use case for HIRL in simulated learning environments?
In simulated environments, one can simply allow the catastrophic action to
happen and intervene after the fact. But depending on the simulation, it
might be more efficient for learning if catastrophic actions are blocked
(if they would end the simulation early, or cause the simulation to run for
a long time in a failed state).
Q:
In what situations would human intervention be too slow or expensive?
Even for self-driving cars, it can be difficult for a safety driver to
detect when something is going wrong and intervene in time. Other robotics
tasks might be similar.
In many domains, it might not be possible to fully hand things over to the
blocker. If the agent doesn’t try some kinds of actions or encounter some
kinds of situations until later in the training process, you either need to
have the human watch the whole time, or be able to detect when new
situations occur and bring the human back in.
Q:
How does the applicability of HIRL change (if at all) if the human is part of
the environment?
HIRL could still apply if the intervening human is part of the
environment, as long as the human supervisor is able to block any
catastrophic action that harms or manipulates the human supervisor, or the
human supervisor’s communication channel.
Q:
Theoretically the idea here is to extract, with an accuracy/cost tradeoff, a
human’s beliefs and/or preferences so an AI can make use of them. At a high
level, how big a role do you think direct human intervention will play in this
process on the road to superintelligent AI?
Ideally, you would want techniques that don’t require the human to be
watching and able to effectively intervene, it would be better if the
blocker could be trained prior to training or if the AI could detect when
it was in a novel situation and only ask for feedback then. I think
HIRL does illustrate how in many situations it’s easier to check whether an
action is safe than to specify the optimal action to perform, and this
principle might end up being used in other techniques as well.
Tall buildings by city: look out for Toronto.
The current top cities in North America for sky scrapers are unambiguously
New York, Chicago, and Toronto in that order.
However, if we count proposed buildings and buildings under construction,
Chicago has 18 at least 150m tall (the dataset is only complete for buildings
at least 150m) and Toronto has 90.
SAA = simultaneous (multiple products at once) ascending auction (same as SMRA)
SMRA = simultaneous multiple round ascending (same as SAA)
CCA = Combinatorial clock auction (not the same as SAA/SMRA)
Schelling point = a way for independent parties to intentionally coordinate
on one choice among many
Value bidding = selecting a package to maximize value of package minus cost
Notation
Products = \(\{1, 2, 3, \ldots, n \}\)
Quantities = \(\{q_1, \ldots, q_n\}\)
Bidders = \(\{1, 2, 3, \ldots, m \}\)
Package: \(x = x(1), \ldots, x(n)\), where \(x(i)\) is the quantity of product \(i\)
Valuation of bidder \(i\) of package \(x\): \(v_i(x)\)
Nutshell
Generally SMRA auctions have a cooperative phase and then a competitive phase.
In the cooperative phase, bidders reduce demand (relative to value bidding) in
order to allocate products without bidding prices up.
Bidders must agree on this allocation without communicating.
Typically this implicit allocation is chosen because it’s fair, natural,
symmetric, or otherwise “makes sense” given the context of the auction.
Keys to the game
(More details below.)
Demand reduction negotiation
Bidders try to indirectly find agreeable allocation
Selecting quantities: Schelling points based on available info
Auction-based, industry-based info
E.g. split product units 50/50 if two bidders are expected to be
interested
Negotiation by sending signals within the auction
Presumably cheap talk in this context, but it happens
Much noise little signal in auctions where bids are constrained
or hidden
If there is an activity rule, once you’ve submitted low demand, there is no
way to increase without decreasing somewhere else
Competition
Basically value bidding
Usually happens if negotiation fails
Complementarity/exposure:
value bidding fails and “cooperation” is inefficient and less likely.
Bids for a quantity \(q\) can turn into bids for quantities \(< q\)
so be careful how high you bid if there are complementarities.
See literature review below for more discussion.
Price raising
Only do in lots where you’re not going to win anything
Start early (to maintain activity)
Demand reduction
Value bidding is no longer a dominant strategy, as it is in VCG/CCA.
Say there is a single product and Bidder 1 bids on quantity 1 at
price=\(1,2,\ldots,10\).
Assume \(v_2(1)=9, v_2(2)=10\).
Bidder 2 (B2), strategy 1: bid on \(q=2\) for \(p=1,\ldots,5\), then bid on \(q=1\)
for \(p=6\).
B2 strategy 2: bid on \(q=1\) for \(p=1\).
B2 results:
CCA, strategy 1: wins \(q=1\) @ \(p=0\)
CCA, strategy 2: wins \(q=1\) @ \(p=0\)
SMRA, strategy 1: wins \(q=1\) @ \(p=6\)
SMRA, strategy 2: wins \(q=1\) @ \(p=1\)
Thus reducing demand (strategy 2) pays in the SMRA format where it didn’t in
the CCA.
When both bidders reduce demand, it’s called “cooperation” aka “tacit
collusion”.
See the literature review below for more examples.
However, with the activity rule, there can be a risk to reducing too much
at the beginning if there is uncertainty about the cooperative outcome, so
a somewhat gradual reduction may be wise.
Price raising
Raising prices for other bidders is a realistic motive.
In the SMRA format it’s relatively simple because raising auction price is the
same as raising price paid.
You don’t have to work backwards from Vickrey price calculations to see what
action would cause an increase in price.
Instead, you simply have to create
excess demand on one or more products where there otherwise would not be.
But, it’s risky because your bids might end up being winning bids.
The ideal scenario is as follows:
Two rivals of yours neatly split supply 50-50, and price doesn’t increase.
Then you come in and place a bid for \(q=1\) (no point using higher \(q\) unless
you need the activity) for a few rounds and then get out before
they decrease their bids.
So this can work for disrupting demand reduction, but only for products you
don’t actually want to win (or you’d be raising your own price too).
Demystifying strategies through experimentation
Try the following mini scenarios one or multiple times to better understand
tactics.
People are assigned to bidders
Bidders’ valuations may be random (independently among bidders)
Other bidders know the possible valuations but not which one was selected
People gain points according to their valuation, lose points to pay for
won products
Possible bonus points for raising rivals’ prices
Goal is not to get more points than opponent but to get more points than others playing as the same player in a different round of the same game
In an actual auction, the other bidders may not be rational
After gaining familiarity with the mini scenarios, full scale mock auctions may
also be helpful.
Scenario: Dealing with uncertainty
1 product, \(q_1=2\), 2 bidders
\[v_1(1) = 2, v_1(2) = 3\]
With probability \(1/2\):
\(v_2(1) = 1, v_2(2) = 2\)
With probability \(1/2\):
\(v_2(1) = 0, v_2(2) = 2\)
Scenario: Are they price raising?
2 products, \(q_1=q_2=2\), 2 bidders
\(v_1(x, 1) = 2x + 2\),
\(v_1(x, 2) = 2x + 3\)
With probability \(1/3\):
\(v_2(1, y) = 2\), \(v_2(2, y) = 3\)
Bonus points for bidder 2 only if its score is positive:
price paid by bidder 1 for product 2
With probability \(2/3\):
\[v_2(x, 1) = v_2(x, 2) = x + 2\]
Scenario: Cooperating without an obvious Schelling point
Synopsis:
Increasing the ratio of bidders to products decreases cooperation.
Complementaries among products decreases cooperation.
Optimal strategy is attempting to cooperate and value bidding if that fails.
EP is not part of the model.
Synopsis:
In German 1999 auction, products were split 50-50 between two major players
at relatively low prices.
A simple game is defined.
Assume there are \(m\) bidders, and \(n=mk\) products each with quantity \(1\),
bidders have equal valuations with strictly decreasing marginal values.
The optimal strategy is to bid on \(k\) products each. If someone competes with
you for your \(k\), value bid.
Synopsis:
Analysis of exposure problem.
For example:
Big bidder has extremely complementary (convex, e.g. \(x^2\)) values. A number of
small bidders have extremely supplementary (concave, e.g. \(\sqrt{x}\)) values.
Due to lack of package bids, big bidder may decide to not bid at all.
However, in spectrum auctions I’m not sure if this is a big factor.
(Not an issue in CCA/VCG.)
Synopsis:
Lab experiments were conducted on spontaneous cooperation in auctions.
Results (p15): Players cooperate more if they get to play the game many
times.
As the number of bidders per product increases, cooperation decreases.
With complementary products, there was little cooperation.
Synopsis:
Analysis of German auction in 2015 which featured cooperation, competition,
and signaling. The auction had high transparency and a great range of actions
(submitting bids higher than clock price). E.g. TEF bids on product A that
VOD was bidding on to send message that VOD should reduce demand in product B
where TEF and VOD are negotiating demand reduction.
Synopsis:
Analysis of German auction in 2010 which was competitive due to lack of
(or too many) Schelling points.
Specifically there were different ways to divide up the blocks that might have
made sense depending on factors such as future mergers or network sharing
agreements and bidders worked towards conflicting outcomes.