Python强化学习代写 | Statistical Machine Learning Project


Multi-armed bandits (MABs) are a simple yet powerful framework for sequential decision-making under
uncertainty. One of the many applications of MABs was pioneered by Yahoo! News, in deciding what news
items to recommend to users based on article content, user profile, and the historical engagement of the user
with articles. Given decision making in this setting is sequential (what do we show next?) and feedback is only
available for articles shown, Yahoo! researchers observed a perfect formulation for MABs like those (-Greedy
and UCB) discussed in Lecture 172. These researchers realised that incorporating some element of user-article
state requires contextual bandits: articles are arms; context per round incorporates information about both
user and article (arm); and f0; 1g-valued rewards represent clicks. Therefore the per round cumulative reward
represents click-through-rate, which is exactly what online services want to maximise to drive user engagement
and advertising revenue. In this project, you will work individually to implement several MAB algorithms.

These are a little more advanced that what was covered in class, coming from real research papers that you will
have to read and understand yourself. By the end of the project you should have developed:

ILO1. A deeper understanding of the MAB setting and common MAB approaches;

ILO2. An appreciation of how MABs are applied;

ILO3. Demonstrable ability to implement ML approaches in code; and

ILO4. Ability to understanding recent ML papers, understand their focus, contributions, and algorithms enough
to be able to implement and apply them. (While ignoring details not needed for your task.)

You have three tasks.

1. Implement LinUCB contextual MAB (Li et al., 2010)
2. Implement MLinUCB (Bouneffouf et al., 2020)
3. Implement SquareCB (Foster & Rakhlin, 2020)

In this task, you will implement a MAB learner as a Python class. You are to read up to and including
Section 3.1 of this WWW’2010 paper to understand and then implement the LinUCB learner with disjoint (not
shared) linear models. LinUCB is described in Algorithm 1. This is a contextual bandit|likely the first you’ve
seen|however it’s workings are a direct mashup of UCB and ridge regression both of which we cover in lecture.

Lihong Li, Wei Chu, John Langford, Robert E. Schapire, `A Contextual-Bandit Approach to Per-
sonalized News Article Recommendation’, in Proceedings of the Nineteenth International Conference
on World Wide Web (WWW’2010), Raleigh, NC, USA, 2010.

Your implementation of WWW’2010 Algorithm 1 should be done within the provided skeleton code for
the LinUCB class. You will need to implement the __init__, play, and update methods. The parameters and
return values for each method are specified in docstrings in proj2.ipynb. Note that tie-breaking in play should
be done uniformly-at-random among value-maximising arms.

While the idea of understanding LinUCB enough to implement it correctly may seem daunting, the WWW
paper is written for a non-ML audience and is complete in its description. The pseudo-code is detailed. There
is one unfortunate typo however: pg. 3, column 2, line 3 of the linked arXiv version linked above should read
ca rather than ba. The pseudo-code uses the latter (correctly) as shorthand for the former times the contexts.
Another hint: for students who haven’t seen \design matrix” before, it means a feature matrix.

Experiments. Once your class is implemented, it is time to perform some basic experimentation.

(a) Include and run an evaluation on the given dataset where column 1 forms arms, column 2 forms rewards
(and missing indicator column 3 is ignored), and columns 4{103 form contexts:

mab = LinUCB(10, 10, 1.0, rng)
LinUCB_rewards, _ = offline_eval(mab, arms, rewards, contexts, 800)
print(“LinUCB average reward “, np.mean(LinUCB_rewards))