## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Chang Liu
@changliu94
Thank you, Max! It is very much appreciated that we have a tutorial on off-policy evaluation!
Bernardo Favoreto
How are arrays interpreted in VW? According to (https://docs.microsoft.com/en-us/azure/cognitive-services/personalizer/concepts-features#categorize-features-with-namespaces), we can use an array as a feature value, as long as it's a numeric array.
I was wondering how does this gets interpreted by VW. The docs show an example of a feature called "grams" whose value is an array (e.g., [150, 300, 450]), but to me is still unclear what happens when we use feature values as arrays.
Thanks!
kornilcdima
@kornilcdima

Hey everybody. I've just started to use VW. And I'm solving a dynamic pricing problem where price is discrete action space (10 arms). Prices are cut on buckets, every bucket stores 10% of prices. My cost is CTR. My probability is constant 0.1 since I have 10 arms each of them appears in 10% of cases. My goal is to find optimal prices which lead to increasing CTR.
I know that CATs is better for my case but I prefer not using it as the first attempt.

I have the following questions questions:
1). What is the main difference between --cb and -cb_explore. As I understood --cb_explore just gives probabilities and --cb doesn't. I’ve noticed that it was mentioned that --cb doesn't do exploration and --cb_explore does. Am I right at this point?
2). VW requires the following format action: cost: probability. And probability here is nothing but pmf. Would it be right in my case just to set 0.1 for all cases.
3). I do kind of pretraining on logged data (existing dataset) to learn policy with parameters: --cb_explore 10 cover 13. After that I use a pre-trained model with the flag -i. I get the output with probas and take the highest proba as predicted value. Will I be exploring in this case?

20 replies
Bernardo Favoreto
Hello everyone!
I have a question concerning the use of Slates + CB and CCB + CB.
I've come across the following presentation from Netflix (https://www.slideshare.net/FaisalZakariaSiddiqi/netflix-talk-at-ml-platform-meetup-sep-2019) and was wondering if they used Slates.
Apparently, they do. However, I don't understand how we can use a single slate first to pick a title for a slot and then, at the same prediction, choose a thumbnail. That's why I believe they instead use Slates for title recommendation on multiple slots and CB for thumbnail selection afterward. Would that make sense?
I believe that if the actions for other slots depend on the first slot's action (e.g., the option of thumbnails for a title depends on the title), Slates cannot be used.
For CCB + CB, an example could be using CCB to order topics in a list and then CB to pick the written text for each topic.
Is using Slates or CCB + CB reasonable? Is it very use-case-specific? I'm afraid I'm missing something here.
Thanks!
Max Pagels

So as far as I know netflix actually does so that the possible combos of e.g title and image are predefined, and those form a single arm. Of course the amount of combos is massive, so I don't think they use all.

There has to be some prefiltering going on since i suspect showing (title: crime dramas, show: top gear, picture: jurassic park) would lead to issues :). So I think that they aren't using slates as slates in VW are defined, merely a large action space where one action is one predefined combo of title, genre, picture and so on.

I may also be wrong here

Bernardo Favoreto
Hey guys, regarding CCB and Slates... what's the use of slot attributes? What sort of attributes can a slot have? Would love to hear some examples!
Thanks
2 replies
kornilcdima
@kornilcdima
This message was deleted
kornilcdima
@kornilcdima

Hey guys,
Does anyone have an example of daemon style code for CATs? Right now I’m using a python wrapper which I took from Olga’s notebook example and It works fine. However, I have a subtle vision of how to launch it in daemon-style.
Is it something like this?
pre-training the model on historical data

vw --cats 6 --bandwidth 0.5 --min_value 0 --max_value 3--epsilon 0.3 -d train.dat -f model.vw

vw --cats 6 --bandwidth 0.5 --min_value 0 --max_value 3--epsilon 0.3 --save_resume --daemon --quiet --num_children 1 --port 8080 -i model.vw -f model.vw

updating the model on new data

vw --cats 6 --bandwidth 0.5 --min_value 0 --max_value 3--epsilon 0.3 --save_resume -i model.vw -d train.dat -f model.vw
olgavrou
@olgavrou
hi @kornilcdima here is some documentation on how to use vw in deamon mode and it should work fine if you start vw with the appropriate cats arguments: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format#on-demand-model-saving
you can also gather vw's predictions by passing in the cli argument: -p <predictions_file>
kornilcdima
@kornilcdima

hi @kornilcdima here is some documentation on how to use vw in deamon mode and it should work fine if you start vw with the appropriate cats arguments: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format#on-demand-model-saving

@olgavrou, thank for the answer. I already launched VW for discrete action space. So, as I understood, I should I use the same syntax. Then the main question is in what situation I should use --cats_pdf instead of --cats ?

olgavrou
@olgavrou
@kornilcdima cats will call cats_pdf under the hood and then sample from the pdf for you. So you would use cats when you want vw to do the pdf sampling for you and cats_pdf when you want the entire pdf and/or want to do the sampling yourself (see here:https://github.com/VowpalWabbit/vowpal_wabbit/wiki/CATS,-CATS-pdf-for-Continuous-Actions)
Ryan Angi
@rangi513

@maxpagels_twitter Thanks a ton for your OPE tutorial and your really insightful questions above - I've found it extremely useful.
Currently I have a logged dataset generated from an online bandit policy --cb_adf_expore using --epsilon 0.05 --cb_type dr. I want to determine whether I should be using dr or mtr (IWR) for my cb_type for my online bandit (assuming I restart my policy in the future). I can run --cb_adf over the logged dataset: vw --cb_adf -d train.dat -q AF --cb_type mtr however, based on the OPE tutorial and above comments/questions I understand that I shouldn't compare the PV loss across different OPE estimators. Is there a method I should use to determine the best cb_type option to use for my online policy? (mtr shows a much lower loss than dr, but I understand this isn't really comparable.)

Please let me know if I'm thinking about this completely wrong and if I should continue to use doubly robust and spend my time fiddling with hyperparameters instead of focusing too much on the PE estimator.

16 replies
George Fei
@georgefei
Hi everyone, I came across this thesis https://core.ac.uk/download/pdf/154670973.pdf while I was searching online on the best practices for setting hyperparameters like learning rate, learning rate decay, etc for vw. That thesis was written in 2015 and it concludes that " the performance of vw can seriously deteriorate over time in an online setting with nonstationary data" because the learning rate strictly decreases if we make it decay, and if we set the learning rate to be fixed, we risk underfitting/overfitting. Are these concerns raised in that paper still relevant, and I wonder if there are methods implemented now to address those concerns?
4 replies
Max Pagels
Anyone else have issues with PLT? Can't see any difference at all if I change kary_tree, lr
2 replies
CLI, 8.10.1 (git commit: 3887696)
kornilcdima
@kornilcdima
Hey, I have 2 questions about CATs.
1. What value should I put for pdf when I pre-train a model on historical data. It differs from discrete action space where I could set a pmf based on prior distribution of arms, in case of CATs the action space is continuous. I have 2 options in mind: 1-to use a constant, 2-to use pdf-value from vw.predict.
2. I'd like to imitate Thomson Sampling (TS). Does it have sense to use --botstrap in order to imitate TS? According to this article it does have sense. https://arxiv.org/abs/1706.04687
olgavrou
@olgavrou
@kornilcdima yes cats will expect the value that the pdf had at the action predicted, which is the prob value that you see from vw.predict (the action predicted and the pdf_value at that action) , so using that is the right way to go here if you are training on historical data
George Fei
@georgefei
hey team, I have a very basic question regarding --save_resume, I noticed after setting a non-default learning rate, power_t, lambda and other hyperparameters in the first pass and setting --save_resume, when I load the model during the second pass the parameters displayed in the output went back to the default ones. this makes me wonder if the original hyperparameters are still being used in the subsequent passes?
9 replies
kornilcdima
@kornilcdima

Hi everyone, could someone suggest what is better to do in my case.
I'm using CATs for predicting CTR (the balance is 4-10%) where action space belongs to deciding on the optimal price for buying a click. The price-space is non-stationary and changes over time. The reward function is: -1 - for win, 0 - for lose. I also tried to use probability (from LogReg) of win instead of -1.
I failed to get any good pre-train policy on my historical data. Tried to use different exploration policies, different parameters. All the time I get a very low cumulative reward rate (see the picture).
The distribution of predicted prices that I get from vw.predict is uniform and with a small bandwidth does not cover the whole range of prices.

Is it a good idea to do pre-training then, since I only get the uniform distribution of prices?

George Fei
@georgefei
Hi everyone, I noticed the average loss when using cb_explore is lower than that when using cb, on the same logged data with all the hyperparameters fixed. Is it because cb_explore uses the pmf to pick an action stochastically and cb always picks the action that is estimated to perform the best? If this is the case, when tuning the hyperparameters in backtesting using logged data, should I run in cb_explore mode since it's closer to the production setting?
2 replies
kornilcdima
@kornilcdima

Hey, I have a question about --cats_pdf output
When I print pdf-value, that gives 2 corteges with chosen action range and exploit probability. The values inside are always constant. Is it normal behavior? I expected to see different values, i.e. different ranges and pdf_values. My version of python VW is '8.10.0'.

If I use --cats_pdf, then pdf output is:

1 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
2 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
3 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
4 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
5 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
6 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
7 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
8 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)
9 (0.0, 17.923999786376953, 0.04280378296971321) (17.923999786376953, 80.0, 0.003750000149011612)

When I use just --cats, I put values for learning in the following manner:

prediction, pdf_value = vw.predict(f'ca | {f_2} {f_3} {f_4} {f_5} {f_6}')
vw.learn(f'ca {price_pred}:{cost}:{pdf_value} |{f_2} {f_3} {f_4} {f_5} {f_6}')

Model's parameters:

min_value = 0
max_value = 80
bandwidth = 16

vw = pyvw.vw(f"--cats_pdf {num_actions} --bandwidth {bandwidth} \
--min_value {min_value} --max_value {max_value} \
--dsjson --chain_hash --{exploration}")
2 replies
Max Pagels

I have a silly IPS question. Let's say I have a uniform random logging policy that chooses between two actions and always receives a reward of 1 regardless of context. Evidence would suggest that no matter what policy I deploy afterwards, I would continue to get a reward of 1 per round.

Not let's say I have a candidate policy P that happens to choose the same actions at each timestep, though not with any randomness/exploration.

Based on this data, per round, the IPS estimate is 1/0.5 = 2, and since both policies agree each round, the average IPS over the history is also 2, when you would expect it to be one given that regardless of context or exploit/explore, that's the reward the logging policy saw each round. The candidate policy, if deployed, won't get to a reward of 2 per round, but rather 1.

What assumption am I violating in this example? Is there some stationarity requirement? I thought the IPS estimator is a martingale.

14 replies
Max Pagels
I'd add that with snips instead of ips, I get the expected 1.0.
MochizukiShinichi
@MochizukiShinichi
Hello everyone, VW newbie here :) I'm trying to use vw contextual bandits implementation to solve for an optimization problem where the arm would be a product and feedback would be click/dismiss. Unlike optimizing for CTR where the cost is 0/1, I'm thinking of assigning a value representing the estimated monetary impact to each (arm, feedback) combo. For instance a click on product X would yield a reward \$5. Are there any caveats I should be aware of when using non binary rewards? Thanks in advance!
3 replies
kornilcdima
@kornilcdima

@olgavrou is it possible to take a saved CATs model and change the flag --cats to --cats_pdf in order to get not only a continuous prediction but also ranges itself?

From what I see, If I use saved model and switch the flag the model's output still is prediction and pdf_value. But I'd be good to get range buckets as well.

2 replies
George Fei
@georgefei

Hi everyone, I noticed the average loss when using cb_explore is lower than that when using cb, on the same logged data with all the hyperparameters fixed. Is it because cb_explore uses the pmf to pick an action stochastically and cb always picks the action that is estimated to perform the best? If this is the case, when tuning the hyperparameters in backtesting using logged data, should I run in cb_explore mode since it's closer to the production setting?

could someone quickly confirm if my understanding is correct? I also had a typo in the original question; the average loss when using cb_explore is higher than that when using cb

Marco Rossi
@marco-rossi29

Hi everyone, I noticed the average loss when using cb_explore is lower than that when using cb, on the same logged data with all the hyperparameters fixed. Is it because cb_explore uses the pmf to pick an action stochastically and cb always picks the action that is estimated to perform the best? If this is the case, when tuning the hyperparameters in backtesting using logged data, should I run in cb_explore mode since it's closer to the production setting?

could someone quickly confirm if my understanding is correct? I also had a typo in the original question; the average loss when using cb_explore is higher than that when using cb

Ryan Angi
@rangi513

@pmineiro I watched your presentation on Distributionally Robust Optimization from December at NeurIPS and it was really well done. One question I have has to do with your first point on why this works for —cb_adf (offline optimization) but not —cb_explore_adf (online optimization). I similarly see loss improvements using this offline with data collected online from Policy A to train another policy - policy B (with a slight increase in the constant learning rate). However, I'm trying to rationalize why this would be a bad idea to add --cb_dro to my online policy that is sequentially trained with minibatches with --save_resume and --cb_explore_adf (epsilon greedy).

Will this not work well for me because the (tau) exponentiatedly weighted averages of the sufficient statistics will no longer be able to keep track of what time t it is at? Or is it some other reason?

Would the better way to think about using this feature be: use —cb_dro offline to discover the best hyperparameters to use, and then use those hyperparameters in the online setting? My hope is to use this almost as a regularization technique if I have a lot of features to improve online learning, but I would love some guidance if I have some fundamental misunderstanding on this feature and I should just be using -l1 regularization online instead.

Max Pagels

A feature request (or actually two) came to mind and I'm wondering if there is a) a need for it and b) how technically challenging it is to implement.

The first is that the CLI average loss would be less confusing to newcomers if it stated if the loss is a PV or holdout loss (the little h might not be apparent) and, more importantly, what exactly the reported loss is (e.g 0/1, rmse, etc.)

The second is being able to use some --cb_type but report a different loss. E.g. train with DR but report IPS. I guess this is more tricky to implement but for consistency in policy evaluation, it would be nice.

Thoughts? Does anyone else think these might be improvements to the experience?

Jacob Alber
@lokitoth
Hey @maxpagels_twitter, it seems like this issue covers the request for using a different loss function in reporting. Does that match the second part of the request? VowpalWabbit/vowpal_wabbit#2222
André Monteiro
@drelum
Hi everyone. The issue VowpalWabbit/vowpal_wabbit#2943 is marked as fixed but the error is still present in version 8.10.2. Anyone else facing the same problem?
olgavrou
@olgavrou
@drelum 8.10.2 and 8.10.1 were patch releases and solved very specific things, they did not include the bug fix from that issue unfortunately. If you use one of the latest python wheels from master then the problem should not persist: see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Python#bleeding-edge-latest-commit-on-master
Sean Pinkney
@spinkney

I'm struggling with how to exactly input this data into vowpal wabbit cb

This is one example. I have the machine-session as a name. Then I have a numeric feature which should be split up as a vector (vector 1231242), "clues" which have associated confidences between 0-1 associated with them, and the action (ie demos with K classes). In this case, there are 3 actions. Ideally, the algorithm would choose action 1 as there are 2 clues each with high confidence. The outcome should be only in 1 class.

    machine session  vector    clues        confidences        demos
1:       1       1 1231242    {1, 2, 5}    {1.0, 1.0, 0.8}    {1, 1, 2}

I don't have probabilities associated with the actions and I'm not sure how this would look. In one sense the https://vowpalwabbit.org/tutorials/cb_simulation.html shows something that I'd like to try as

shared | User machine=1 session=1
:0.0:|Action demo=1 |Vector :1 :2 :3 :1 :2 :4 :2
:0.0:|Action demo=1 |Vector :1 :2 :3 :1 :2 :4 :2
:0.2:|Action demo=2 |Vector :1 :2 :3 :1 :2 :4 :2

But maybe I should just do the multiclass as but this leaves out that there are 2 clues for 1

1:0 2:0.2 |Vector :1 :2 :3 :1 :2 :4 :2
4 replies
Ryan Angi
@rangi513
In Contextual Bandits we care about minimizing regret (maximizing reward) over time. Generally OPE methods and progressive validation loss is helpful in determining the average performance of a policy offline. Do we ever care about measuring how accurate (RMSE or otherwise) the greedy linear model is underneath the policy at estimating cost against a test set? If I did care about measuring the performance of the cost sensitive classifier underneath the policy is that something I could extract from a VW cb_explore_adf policy or do I need to train a new regression from scratch in VW with the same parameters?
2 replies
kornilcdima
@kornilcdima
Is it possible somehow to change the posterior distribution for a chosen context? As MABT selected an optimal arm many times, the variance of posterior was decreased and now it is not able anymore to choose another arm which became optimal for a chosen context.
Jack Gerrits
@jackgerrits
Just a quick announcement/FYI, we're using issues as a way to communicate and discuss deprecations and removals for VW. Take a look at the 'deprecation' (there's just two) tag and if it is something you have an opinion on then please feel free to comment https://github.com/VowpalWabbit/vowpal_wabbit/issues?q=is%3Aissue+is%3Aopen+label%3ADeprecation We're hoping this is a reasonable way to communicate changes to allow us to make progress while not adversely affecting anyone
MochizukiShinichi
@MochizukiShinichi
Hey folks, could anyone please point me to some resources I can read on algorithm details of --cb_adf implementation in VowpalWabbit?
K Krishna Chaitanya
@kkchaitu27
Hi Everyone, I have a doubt regarding action probabilities in input format of vowpalwabbit for contextual bandit. In the wiki, it is said that the input format must be action:cost:probability | features . what is probability here, is it probability for the action to get a reward/cost or something else. I read somewhere that it is the probability of exploration for that action, what does it mean?

hello, I am trying to train vowpal model using C++ API using this piece of code:

    vw* vw = VW::initialize("-f train1.vw --progress 1");
{
ezexample ex(vw, false);

ex.set_label("1");
ex.train();
ex.finish();
}
{
ezexample ex1(vw, false);

ex1.set_label("0");
ex1.train();
ex1.finish();
}

VW::finish(*vw);

this snippet generates the model, but the number of examples and number of features is 0, am I doing something wrong? I also tried to use example instead of ezexample and the result was the same and in either case, I did not see a progress log...

6 replies
Max Pagels
@kkchaitu27 contextual bandits have exploration, ie there should always be a nonzero probability of choosing some action. the reason for this is to try out different actions to learn what works and what doesn’t. this probability is the one mentioned in the docs. it’s value depends on the exploration algorithm, for epsilon greedy with two actions and 10 percent exploration the best action is chosen with prob .95 and the other with .05. if you use cb_explore when collecting data, vw calculates these probabilities for you
K Krishna Chaitanya
@kkchaitu27
@maxpagels_twitter Thanks for your response, how do I compute probability if I have historic data? Is it equal to number of times that action has been chosen/total number of times the context has appeared?
Ryan Angi
@rangi513

I'm happy to turn this into a github issue, but want to make sure I'm not attempting some unintended behavior first.

I am attempting to do multiple passes over a cb_adf dataset to hopefully improve the quality of my q function. I'm thinking of trying an offline bandit using the whole dataset and multiple passes instead of online with iterative updates. However, I get the following error after the first pass:

libc++abi.dylib: terminating with uncaught exception of type VW::vw_exception: cb_adf: badly formatted example, only one line can have a cost
[1]    90720 abort      vw --cb_adf --passes 2 -c -d train.dat

Here is my command and dataset for reproducibility:
vw --cb_adf --passes 2 -c -d train.dat

train.dat

shared | a:1 b:0.5
0:0.1:0.75 | a:0.5 b:1 c:2
| a:1 c:3

shared | s_1 s_2
0:1.0:0.5 | a:1 b:1 c:1
| a:0.5 b:2 c:1

I'm using version 8.10.1. I found this SO post and VowpalWabbit/vowpal_wabbit@431c270 by @jackgerrits that maybe was supposed to fix this but also could be unrelated.

Are multiple passes not supported for --cb_adf? If so, maybe some better error messaging might be useful here?

2 replies
K Krishna Chaitanya
@kkchaitu27

This is a sample dataset I created

1:1:1.0 2:2 3:3 4:4 | a b c
1:1 2:2:1.0 3:3 4:4 | a b c
1:1 2:2 3:3:1.0 4:4 | a b c
1:1 2:2 3:3 4:4:1.0 | a b c
1:1 2:2:0.7 3:3 4:4 | d e f

when I do

vw -d sampledata.vw --cb 4
8 replies
I get
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
[critical] vw (cb_adf.cc:279): cb_adf: badly formatted example, only one cost can be known.