pip
will not install the command line tool as far as I know. https://vowpalwabbit.org/start.html has info about how to get the C++/command line tool by building from source (or brew on MacOS). Please feel free to reach out to me if you have any more questions!
Hi all, I’m a newbie to contextual bandits and learning to use VW.
Could anyone help me understand if I’m using it correctly.
Problem: I have a few hundred thousands of historical data and I want to use them to learn a warm-start model. I saw there are some tutorials showing how to use cli in wiki. But i wonder if I can use its python version in this way, assuming the data has been formatted:
vw = pyvw.vw("--cb 20 -q UA --cb_type ips")
for i in range(len(historical_data)):
vw.learn(historical_data[i])
my questions are:
1) Is this the correct way to warm start the model?
2) If so, what prob should I use for each training instance? If it is deterministic, I guess it would be 1.0?
3) For exploitation/exploration after having this initial model, can I save the policy and then apply --cb_explore 20 -q UA --cb_type ips --epsilon 0.2 -i cb.model
to continue the learning?
Thanks for the help in advance!
Hi Guys, I am working on a project similar to News Recommendation Engine which predicts the most relevant articles given user feature vector. I wanted to used VW's contextual bandit for the same.
I have tried using VW, but it seems that VW only output's a single action per trial. Instead, I wanted some sort of ranking mechanism such that I can get the top k articles per trial.
Is there any way to use VW for such use case?
I have asked this question in stackoverflow as well. (https://stackoverflow.com/questions/63635815/how-to-learn-to-rank-using-vowpal-wabbits-contextual-bandit )
Thanks in Advance.
Hi! Thanks to VW authors for the CCB support, finding it very useful!
Quick question: how is offline policy evaluation handled for CCBs in VW? IPS, DM, something else? Was wondering if there is a paper I can read about this. Was looking into https://arxiv.org/abs/1605.04812 but wasn't sure this estimator is the one VW uses specifically for CCBs.
@pmineiro excellent, thanks for the response.
A second question: let's say I have collected bandit data from several policies deployed to production one after the other, i.e. thought of as a whole, it is nonstationary.
Can I use all of the logged data to train a new policy, even though the logged data is generated by X different policies? If so, are ips/dm/dr all acceptable choices or do they break against nonstationary logged data?
How about offline evaluation of a policy? This paper https://arxiv.org/pdf/1210.4862.pdf suggest that IPS can't be used, is explore_eval
the right option?
What I'm looking for is the "correct" way for a data scientist to offline test & learn new policies, possibly with different exploration strategies, using as much data as possible from N previous deployments with N different policies. The same question also applies to automatic retraining of policies on new data as part of a production system, I'm unsure of the "proper" way to do it
Nice, thanks! I've used the personalizer service, just curious as to how it works under the hood. So with IPS & DM it's ok to train model on logged dataset A-> deploy model -> collect logged data B -> train on A+B -> repeat with ever-growing dataset?
What is the purpose of explore_eval
then?
Good day, Vowpal Community, @all
we wanted to switch our contextual bandit models from epsilon-greedy approach to the online cover approach. However, when we ran this simple snippet of code (see below) to check how online cover is going to perform for us, result was not as expected.
import vowpalwabbit.pyvw as pyvw
data_train = ["1:0:0.5 |features a b", "2:-1:0.5 |features a c", "2:0:0.5 |features b c",
"1:-2:0.5 |features b d", "2:0:0.5 |features a d", "1:0:0.5 |features a c d",
"1:-1:0.5 |features a c", "2:-1:0.5 |features a c"]
data_test = ["|features a b", "|features a b"]
model1 = pyvw.vw(cb_explore=2, cover=10, save_resume=True)
for data in data_train:
model1.learn(data)
model1.save("saved_model.model")
model2 = pyvw.vw(i="saved_model.model")
for data in data_test:
print(data)
print(model1.predict(data))
print(model2.predict(data))
for data in data_test:
print(data)
print(model1.predict(data))
print(model2.predict(data))
Output for this snippet was like this:
|features a b
[0.75, 0.25]
[0.5, 0.5]
|features a b
[0.7642977237701416, 0.2357022762298584]
[0.5, 0.5]
|features a b
[0.7763931751251221, 0.22360679507255554]
[0.5, 0.5]
|features a b
[0.7867993116378784, 0.21320071816444397]
[0.5917516946792603, 0.40824827551841736]
For some reason, initiated model2 does not seem to provide results, influenced by loaded weights (it starts with uniform distribution between two actions). Moreso, though no learning has been happening for model1 and model2 for test dataset, predicted probabilities changed over time for both models. Is this an expected behavior for online cover approach? And if yes, could you please guide me to any documentation/article, where I could find an explanation on why it's happening.
explore_eval
is to estimate the online performance of a learning algorithm as it learns, but using an off-policy dataset. it's different than evaluating or learning a policy over an off-policy dataset, because you have to account for the change in information revealed to the algorithm as the result of making different decisions. as such, it is far less data efficient, but sometimes necessary. one use case is to evaluate exploration strategies offline, hence the name.
@pmineiro thanks. So just to be clear, let's say I have logged bandit data and want to know whether an epsilon-greedy algorithm at 10% or 20% would be better. Do I:
As far as I can tell I should be using explore_eval, which is why I'm wondering what the use case for the second option is, i.e. comparing different exploration algorithms by simply comparing losses of respective --cb_explore experiments? it there any situation where this is a valid approach?
--cb_explore
(or --cb_explore_adf
) on an offline CB dataset without --explore_eval
. you only run --cb_explore
either 1) online, i.e., acting in the real-world, 2) offline with a supervised dataset and --cbify
(to simulate #1) or 3) offline with --explore_eval
and an offline CB dataset (to simulate #1). nothing else is coherent.
@pmineiro In this case, should we ignore reward signals after a window has expired, or should we still process them trusting that central limit theorem will help us achieve accuracy over time as we observe more events?
I'm not trying to be salty, but there's no CLT issue here. When you update VW, you are saying "for this context i observed this reward". If you do it again, you are saying "i happened to observe the exact same context again but this time i got this other reward". So the best estimate after that is the average of the first and second reward, which is probably not what you want. With respect to time limit, if you define reward as, e.g., "1 if a click within 30 minutes of presentation else 0" then what happens after 30 minutes is irrelevant.
@wmelton yeah, just to be clear:
If you have a bernoulli bandit, what some people do is that when an arm is pulled, they record +1 trials and update the posterior, and only when they get a reward for that pull do they update +1 successes. In a context-free setting this is sort of OK and will be kind of eventually consistent. I've done this before, primarily because it saves me from keeping track of pulls that get zero rewards and assigning those explicitly. It isn't "correct", however. In bandit settings you should learn when the reward is available, not do such a half-step. But it's a practical compromise.
In contextual bandits, and in VW, doing this will fail because of the issue @pmineiro mentioned. The way to overcome this is to keep track of all predictions and their context in some DB or memory store and learn only when a reward arrives for a particular prediction/context, or a suitable amount of time has passed such that you can assume zero reward and learn on that.
If anyone has any comments on this message I posted I'd be very grateful:
@pmineiro thanks for your patience answering all my questions. I did a quick sanity check: I'd expect explore_eval with 100% exploration against a "world" that never changes and where exactly half of actions are positive (-1 cost) and half negative (+1) would get an estimated average loss of 0, but that's not the case. I'm not sure if this is due to some systemic bias, because in this particular case
--cb_explore_adf
reports the loss I'd expect. I made an issue but I'm not sure if it's a bug or intended behaviour: VowpalWabbit/vowpal_wabbit#2621
@maxpagels_twitter : you definitely do not ever run
--cb_explore
(or--cb_explore_adf
) on an offline CB dataset without--explore_eval
. you only run--cb_explore
either 1) online, i.e., acting in the real-world, 2) offline with a supervised dataset and--cbify
(to simulate #1) or 3) offline with--explore_eval
and an offline CB dataset (to simulate #1). nothing else is coherent.
@maxpagels_twitter I think Paul was referring to your question here
In contextual bandits, and in VW, doing this will fail because of the issue @pmineiro mentioned. The way to overcome this is to keep track of all predictions and their context in some DB or memory store and learn only when a reward arrives for a particular prediction/context, or a suitable amount of time has passed such that you can assume zero reward and learn on that.
This join operation is done for you by Azure Personalizer (https://azure.microsoft.com/en-us/services/cognitive-services/personalizer/). We done presentations and workshops at AI NextConn conferences where we show the detailed dataflow diagram, maybe you can find one of those ... or you could just use APS.
More questions: why, in cb_explore_adf
with epsilon set to 0.0, do I se probability distributions with values other than 0.0 or 1.0? This only happens in the start of a dataset:
maxpagels@MacBook-Pro:~$ vw --cb_explore_adf test --epsilon 0.0
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = test
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.666667 0.666667 1 1.0 known 0:0.333333... 6
0.833333 1.000000 2 2.0 known 1:0.5... 6
0.416667 0.000000 4 4.0 known 2:1... 6
0.208333 0.000000 8 8.0 known 2:1... 6
0.104167 0.000000 16 16.0 known 2:1... 6
0.052083 0.000000 32 32.0 known 2:1... 6
0.026042 0.000000 64 64.0 known 2:1... 6
0.013021 0.000000 128 128.0 known 2:1... 6
0.006510 0.000000 256 256.0 known 2:1... 6
finished run
number of examples = 486
weighted example sum = 486.000000
weighted label sum = 0.000000
average loss = 0.003429
total feature number = 4374
maxpagels@MacBook-Pro:~$
All examples have the same number of arms (3), and on different datasets, I see the same thing at the start of a dataset. One large dataset I have takes some 20,000 examples before giving correct probabilities
--first
works as expected, but not --epsilon
, which at 0.0 exploration should be greedy, ie. the probability vector should have one value of 1.0 and the reset of 0.0.