If anyone has any comments on this message I posted I'd be very grateful:
@pmineiro thanks for your patience answering all my questions. I did a quick sanity check: I'd expect explore_eval with 100% exploration against a "world" that never changes and where exactly half of actions are positive (-1 cost) and half negative (+1) would get an estimated average loss of 0, but that's not the case. I'm not sure if this is due to some systemic bias, because in this particular case
--cb_explore_adf
reports the loss I'd expect. I made an issue but I'm not sure if it's a bug or intended behaviour: VowpalWabbit/vowpal_wabbit#2621
@maxpagels_twitter : you definitely do not ever run
--cb_explore
(or--cb_explore_adf
) on an offline CB dataset without--explore_eval
. you only run--cb_explore
either 1) online, i.e., acting in the real-world, 2) offline with a supervised dataset and--cbify
(to simulate #1) or 3) offline with--explore_eval
and an offline CB dataset (to simulate #1). nothing else is coherent.
@maxpagels_twitter I think Paul was referring to your question here
In contextual bandits, and in VW, doing this will fail because of the issue @pmineiro mentioned. The way to overcome this is to keep track of all predictions and their context in some DB or memory store and learn only when a reward arrives for a particular prediction/context, or a suitable amount of time has passed such that you can assume zero reward and learn on that.
This join operation is done for you by Azure Personalizer (https://azure.microsoft.com/en-us/services/cognitive-services/personalizer/). We done presentations and workshops at AI NextConn conferences where we show the detailed dataflow diagram, maybe you can find one of those ... or you could just use APS.
More questions: why, in cb_explore_adf
with epsilon set to 0.0, do I se probability distributions with values other than 0.0 or 1.0? This only happens in the start of a dataset:
maxpagels@MacBook-Pro:~$ vw --cb_explore_adf test --epsilon 0.0
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = test
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.666667 0.666667 1 1.0 known 0:0.333333... 6
0.833333 1.000000 2 2.0 known 1:0.5... 6
0.416667 0.000000 4 4.0 known 2:1... 6
0.208333 0.000000 8 8.0 known 2:1... 6
0.104167 0.000000 16 16.0 known 2:1... 6
0.052083 0.000000 32 32.0 known 2:1... 6
0.026042 0.000000 64 64.0 known 2:1... 6
0.013021 0.000000 128 128.0 known 2:1... 6
0.006510 0.000000 256 256.0 known 2:1... 6
finished run
number of examples = 486
weighted example sum = 486.000000
weighted label sum = 0.000000
average loss = 0.003429
total feature number = 4374
maxpagels@MacBook-Pro:~$
All examples have the same number of arms (3), and on different datasets, I see the same thing at the start of a dataset. One large dataset I have takes some 20,000 examples before giving correct probabilities
--first
works as expected, but not --epsilon
, which at 0.0 exploration should be greedy, ie. the probability vector should have one value of 1.0 and the reset of 0.0.
vw --cb_explore_adf
, is there a command line argument to make the policy class be decision trees?
hi @darlwen ,
Stack of reductions for every vw run is defined by 2 things:
1) DAG of dependencies that are defined in setup function for every reduction.
i.e. here:
https://github.com/VowpalWabbit/vowpal_wabbit/blob/b8732ffec3f8c7150dace1c41434bf3cdb4d8436/vowpalwabbit/cb_explore_adf_greedy.cc#L96
if we have cb_explore_adf reduction included, we also include cb_adf one.
2) topoligical order here: https://github.com/VowpalWabbit/vowpal_wabbit/blob/b8732ffec3f8c7150dace1c41434bf3cdb4d8436/vowpalwabbit/parse_args.cc#L1246
So, final stack of reduction for each vw run is actually sub-stack from 2) that contains:
1) reductions that you explicitly provided in your command line
2) reductions that defined in input model file (if any)
3) reductions populated as dependencies.
In your case you have ccb_explore_adf, ftrl provided explicitly by you, others are populated as dependencies:
ccb_explore_adf -> cb_sample
ccb_explore_adf -> cb_explore_adf_greedy -> cb_adf -> csoaa_ldf
thanks @ataymano much more clear now. In VW::LEARNER::base_learner* setup_base(options_i& options, vw& all)
when enter the following logic,
else
{
all.enabled_reductions.push_back(std::get<0>(setup_func));
return base;
}
my understanding is that it won't do auto setup_func = all.reduction_stack.top();
anymore, for example, when we get "ftrl_setup" then it enters the else
logic, then how it makes the rest reductions(scorer, ccb_explore_adf etc.) enabled?
pyvw.vw
object to process a data file when I instantiate it with a --data
argument. Based on this fairly recent s.o. answer https://stackoverflow.com/a/62876763, my understanding is that it should do just that, but I am not having any luck. I'm using vw version 8.9.0, did something change in a recent release? I have confirmed that using the same options from the command line works so I don't think I'm doing something obviously wrong like using a wrong file name
reported cost/probability
, or 0 if cost is not reported. (c(a) = cost/probability * I(observed action = a)
). Unbiased if probabilities are correct, usually high varianceThis is the only thing I’ve found that describes the implementation for csoaa: http://users.umiacs.umd.edu/~hal/tmp/multiclassVW.html. As I read it, that means csc based bandit methods:
If so, is it reasonable to think of ips and mtr as essentially the same except:
cost * I(action = observed action)/probability
as target and 1
as weightcost
as target and I(action = observed action)/probability
as weightmean(cost * I(observed action = predicted action) / probability))
or something more sophisticated, like https://arxiv.org/abs/1210?
I'm using explore_eval
to evaluate exploration algorithms (e-greedy with different epsilon values). Can someone confirm that explore_eval
isn't intended for use with more than one pass over the data?
The core issue I have is that i'd like to evaluate the best policy + exploration algorithm for a system in which the policy is trained once per week and then deployed. So the model itself is stationary for a week but across e.g. a year, it isn't. I'd like to use data generated by this system to do offline evaluation of new policies + exploration algorithms
Hi all, I am reading the code to make clear how vw do epsilon greedy exploration.
I find the following code in cb_explore_adf_greedy.cc:
void cb_explore_adf_greedy::predict_or_learn_impl(VW::LEARNER::multi_learner& base, multi_ex& examples)
{
// Explore uniform random an epsilon fraction of the time.
VW::LEARNER::multiline_learn_or_predict<is_learn>(base, examples, examples[0]->ft_offset);
ACTION_SCORE::action_scores& preds = examples[0]->pred.a_s;
uint32_t num_actions = (uint32_t)preds.size();
size_t tied_actions = fill_tied(preds);
const float prob = _epsilon / num_actions;
for (size_t i = 0; i < num_actions; i++) preds[i].score = prob;
if (!_first_only)
{
for (size_t i = 0; i < tied_actions; ++i) preds[i].score += (1.f - _epsilon) / tied_actions;
}
else
preds[0].score += 1.f - _epsilon;
}
It givens the action with the largest cost a score: 1-epsilon, and the rest actions a score: epsilon/num_actions. Is this how it do exploration based on epsilon? I am a little confused about it, can someone help explain it?
Epsilon greedy works as follows (example with 4 arms):
Per round, choose the best arm given context (i.e. arm with lowest cost) with probability epsilon. With probability 1-epsilon, choose an arm uniformly at random.
With epsilon = 0.1, at any given round, the probability of choosing the best arm is 1-0.1 plus 0.1 x 1/4 -> 0.925 ("exploit"). The probability of choosing a suboptimal arm is 0.1 * (1/4) = 0.025 ("explore")
Watched the great content at https://slideslive.com/38942331/vowpal-wabbit, thanks to all involved! A related question:
I am implementing a ranking system, where the action sets per slot are not disjoint, i.e. i basically want a ranking without duplicates. The video mentions that the theory behind slates is worked out for the intersected/joint action case, but that it's still being worked on in VW.
Am I shooting myself in the foot if I use CCB instead of slates now? Is there some rough estimate of when joint action sets will be supported in slates mode? Is slates mode planned as a replacement for CCB? @jackgerrits is probably the one to ask :)