## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Bernardo Favoreto
Hello everyone!
Does anyone know where (and if) I can find the notebook for the Estimators library? I would like to use this lib, but there's not much documentation/examples on how to do so.
Also, does it make sense to use this lib for CCBs?
Thanks!
Jack Gerrits
@jackgerrits
@Favoreto_B_twitter that repo is very much still a work in progress - so docs/examples are not there yet unfortunately. For CCB the approach that has been taken so far is to do CFE on the first slot only - so in this context I think it does make sense to use it. But it would need some adapting and I am not positive here.
9 replies
Max Pagels
Is there an estimate on when the pypi version of pyvw will have CATS support? 8.9.0 doesn't support CATS labels
Max Pagels
I'm fiddling around with CATs, and have a simple setup with a fixed context. Per round, I ask for an action (range 0-100) and calculate a cost that is zero at 50, otherwise quadratically the absolute distance from 50 in either direction. If tried gridsearching a whole mess of bandwith, epsilon and learning rate values, but the learning is just all over the place. I would have expected the system to converge to an optimal prediction of 50.0 per round pretty easily since the context is always fixed. Instead, it either bounces around or gets stuck on some non-optimal values around 40. Any tips?
olgavrou
@olgavrou
Hi @maxpagels_twitter what is the parameter you pass to --cats? have you experimented with that at all? For cats I would try different combinations of number of discrete actions used by the algorithm (passed in to the --cats arg) and bandwidths (bandwidth being a property of the continuous range). e.g. I would try a grid of num_actions [8, 16, 32, 64, 128, 256, 1024] and e.g. bandwidths [1, 2, 4, 6, 8, 10, 14, 20]. For different number of discrete actions you might need more data for CATS to converge to something sensible. CATS label support in pyvw should be available in the next release (coming soon-ish, we don't want to wait another year for the next vw release). Let me know if you get better results from CATS or not :)
Max Pagels
I tried gridsearching a whole mess of options, including a bunch of action counts, and can get relatively close to an optimum, but the hyperparams seem to be super important to get just right or the learning is way off. But I'll experiment further and report back
3 replies
Bernardo Favoreto
Hello guys!
Can someone help me understand why propensities scores are important when training a CB?
Let's take the epsilon-greedy, for example. When we train a CB model with epsilon-greedy, the pmf output is always the same (just the indexes change). This makes me assume that propensities scores aren't supposed to teach the CB how to output probabilities. Moreover, I believe they are used for "importance weighting", i.e., prob(new_policy)/prob(logging_policy), but isn't this only for when we use IPS? I think I'm missing something quite obvious here...
Also, when we offline train a new model using logged CB data, how is the new CB able to achieve better performance than the logging policy? I mean, it's an excellent thing, but I would like to understand how that is possible.
Thanks!
George Fei
@georgefei
Hi all, I have a few questions related to contextual bandit evaluation:
1.How do I compare the performance of different policies’ decisions using --eval? Do I look at the average loss in the output? If the costs in the input data are all negative and a lower cost is better, does a lower average loss mean one policy is better? What does average loss represent?
2.How do I interpret the output of --explore_eval? More specifically update count, violation count, and final multiplier (what variables do they correspond to in the algorithm on slide 9 of https://pdfs.semanticscholar.org/presentation/f2c3/d41ef70df24b68884a5c826f0a4b48f17095.pdf). Do I also look at the average loss to compare different exploration algo + hyperparameter combinations?

3.In order to use -explore_eval I have to convert my data from cb format to cb_adf format since the cb format is not supported when using -explore_eval. For the example data with two arms below, are the two ways to represent the data equivalent?:

2:10.02:0.5 | x0:0.47 x1:0.84 x2:0.29
1:8.90:0.5 | x0:0.51 x1:0.65 x2:0.67

shared | x0:0.47 x1:0.84 x2:0.29
| a1
0:10.02:0.5 | a2

shared | x0:0.51 x1:0.65 x2:0.67
0:8.90:0.5 | a1
| a2

Wes
@wmelton
Hello all - ive been evaluating Microsoft Personalizer for our company, which i have largely assumed is VW under the hood with MS specific tech/service written on top of it.
My question is this - within a given namespace, does the order of features or their names matter? Im assuming yes, but the VW documentation out there doesnt make it super clear how to handle a situation where two given documents have the same keywords in them, but after tokenization, the keywords are not in the same order due to variance in the number of keywords found in each document. Appreciate guidance there.
finally, i referenced Personalizer only because it sparked this train of thought largely because the documentation for it leverages only rhe JSON format of input data, but seems to neglect any instruction with regards to variation in keyword order if your features are keywords extracted from a document. Thanks!
18 replies
Max Pagels
@georgefei did you already get an answer to your questions? I'd be very interested in them, too
Particularly, I imagine lots of folks do evaluation by gridsearching learning with cb_type ips/dm/dr and choosing the one with the best reported loss. Isn't that wrong, especially considering dm is biased? --eval throws an error if you use DM.
George Fei
@georgefei
related to my third question above about whether the same data in cb and cb_adf formats are going to yield the same result. Not sure about the vw implementation but in the contextual bandit bake-off paper, the reward estimation is formulated differently for each case:
49 replies
Max Pagels
I think there is a very clear need for a policy evaluation tutorial on vowpalwabbit.org. I'd be happy to write one, assuming someone can help answer questions as they arise, since I have a couple of outstanding ones myself. Would folks find this valuable?
10 replies
Max Pagels

@lalo @olgavrou et al, Note that I will need expert advice on this. There is a checklist that needs to be confirmed to absolute certainty or, if untrue, commented on to provide me with the correct interpretation.

Max Pagels
@pmineiro would be also be a very good additional reviewer
olgavrou
@olgavrou
@maxpagels_twitter thanks for taking a stab at this, much appreciated! Will add reviewers to the PR
Max Pagels
@olgavrou no problem. I have a bunch more to come :)
Anyone here have experience using CBs in VW with a "no op" arm that by definition can't generate a reward?
For example, imagine a use case where we intend to potentially contact a user automatically. so arms are "email user", "send text message to user", "send push notification to user" or "do nothing". The first 3 options all have "click through" as the reward, but "do nothing" of course has no such obvious reward. Is there a standard way to handle this? All pointers much appreciated.
Max Pagels
In my view, doing nothing is a valid action but even doing nothing can be either good or bad, and per the CB problem setup it needs some reward. Perhaps CTR isn't the best reward metric to use? Can you find another signal that applies to all actions?
4 replies
Wenjuan Dou
@darlwen
@here for policy of contextual bandits, VW provides ips, dr, dm and mtr for now. For mtr, it uses linear model to optimize the policy. I wonder wether we can use a tree-based or DNN based model?
Crystal Wang
@cwang506
Hi everyone! I’m currently using VWClassifier to predict binary labels (-1, 1) on some dummy dataset where y = sigmoid(X@w) for some random X and w. I am able to use pyvw to fit perfectly to the training dataset when I use VWClassifier without any regularization, but I’m noticing strange behavior once I add in regularization. For example, when I add in l1 regularization of 1e-3, all of my training and testing labels get pushed to 1, and the ROC_AUC score between the predicted and actual labels are 0.5 for both training and test. When compared to sklearn packages SGDClassifier and LogisticRegression, I get vastly different results—the labels do not get pushed to 1, and the ROC_AUC score are all >0.5 when I compare the predicted outcome to the actual outcome. Here is the code I'm running, and any help would be greatly appreciated! Thanks :)
Crystal Wang
@cwang506
Дмитрий Корнильцев
@kornilcdima:matrix.org
[m]
Hi everyone! Does someone have a working python example of new VW's feature "CATS, CATS pdf for Continuous Actions"?
I'm really new to RL and this library. Can someone help me to understand how to process data with this new Bandit?
My task is the following. I have prices and I need to find optimal price (not big not low), my reward is a click rate.
Appreciate any help )
Jack Gerrits
@jackgerrits
Olga created a tutorial Jupyter notebook which has not yet been merged but it is a great resource VowpalWabbit/jupyter-notebooks#6
4 replies
Jack Gerrits
@jackgerrits
What version of vw are you using?
9 replies
Chang Liu
@changliu94
Hi everyone! Does anyone know if there is any command/implementation in vw that can tackle non-stationary environment? Thanks in advance!
Harsh Khilawala
@HarshKhilawala_gitlab
I am new here and want to get started contributing to VowpalWabbit. Can anyone please help me get started?
3 replies
Mónika Farsang
@MoniFarsang
Hi, does someone know whether the RL open source fest results are already out?
Nishant Kumar
@nishantkr18
Yes they are now. Congratulations to everyone selected!
daraya123
@daraya123

Hi all, I had a problem when using SquareCB algorithm to train contextual bandit model, especially when saving & loading it again.
I trained and saved SquareCB model in this way (using the simulation setting as in https://vowpalwabbit.org/tutorials/cb_simulation.html):

vw = pyvw.vw("--cb_explore_adf -q UA -f squarecb.model --save_resume --quiet --squarecb")
num_iterations = 5000
ctr = run_simulation(vw, num_iterations, users, times_of_day, actions, get_cost)
plot_ctr(num_iterations, ctr)
vw.finish()

and then loaded the model :

vw_loaded=pyvw.vw('--cb_explore_adf -q UA -i squarecb.model')
num_iterations = 5000
ctr = run_simulation(vw_loaded, num_iterations, users, times_of_day, actions, get_cost, do_learn=False)

plot_ctr(num_iterations, ctr)
print(ctr[-1])

and the loaded model seems like doing a random exploration.
Could anyone explain how to save and load this model correctly? Thanks in advance.

11 replies
daraya123
@daraya123
George Fei
@georgefei
Hi all, if I use explore_eval to evaluate the reward estimation + exploration algo combination, is it fair for me to compare the explore_eval's output average loss with the realized average loss of the training data?
10 replies
Chang Liu
@changliu94
Can anyone here help me understand how the bagging algorithm does counterfactual learning from logged bandit data? From the bagging algorithm in the bake-off paper, we can see it reduces to an oracle where the probability of choosing an action a is decided by the proportion of the current policies that evaluate action a as optimial. So how will the probability in the logged data be used? I am perplexed here.
Bernardo Favoreto
Hey guys!
I was reading Microsoft's Personalizer docs and wondering why changing the Reward wait time triggers retraining? I understand that there might be delay-related bias. I don't understand how retraining the model is any better than simply using the existing model but with an updated reward wait time? Because in my mind, offline training doesn't suffer from delay-related bias (because all data is there, ready for use).
Does this make sense? What am I missing?
What happens if the user decides to add/modify attributes? Is there retraining or simply a matter of changing the Rank input?
Thanks!
Max Pagels
The key issue is this: say your wait time is one minute and the reward is a click. Now, let's say the click arrived 1,5 minutes after prediction. It's thus not in your training data and assumed to be whatever default reward you specified
Now, if you change the wait time to 2 minutes, the training data must be recreated so as to include your click, which means the model too must be retrained otherwise you are training on different definitions of reward (old stuff has rewards calculated with 1 minute cutoff, newer with 2 minute cutoff). This leads to general weirdness
If you add features, at least in VW-land, you don't need to retrain but just continue training on new data. So in personaliser I think it's just a matter of changing the rank input
of course, if you add informative features later on, they will only be recorded for events after the change, but in an online system that doesn't really matter since it will correct itself over time
Bernardo Favoreto
Awesome, Max! That's exactly what I thought about in both scenarios. Because Personalizer saves data even after the reward wait time, we'd be able to create new data considering the updated wait time.
Thanks!
John
@JohnLangford
@changliu94 the probability of an action is passed to the base algorithm that bagging reduces to where it is used in the update rule.
Marcos Passos
@marcospassos
Hey guys! Does anyone know if there is a way to bypass that first letter limitation regarding namespace names? We need to use arbitrary keys that may begin with the same letter