I'm using explore_eval
to evaluate exploration algorithms (e-greedy with different epsilon values). Can someone confirm that explore_eval
isn't intended for use with more than one pass over the data?
The core issue I have is that i'd like to evaluate the best policy + exploration algorithm for a system in which the policy is trained once per week and then deployed. So the model itself is stationary for a week but across e.g. a year, it isn't. I'd like to use data generated by this system to do offline evaluation of new policies + exploration algorithms
Hi all, I am reading the code to make clear how vw do epsilon greedy exploration.
I find the following code in cb_explore_adf_greedy.cc:
void cb_explore_adf_greedy::predict_or_learn_impl(VW::LEARNER::multi_learner& base, multi_ex& examples)
{
// Explore uniform random an epsilon fraction of the time.
VW::LEARNER::multiline_learn_or_predict<is_learn>(base, examples, examples[0]->ft_offset);
ACTION_SCORE::action_scores& preds = examples[0]->pred.a_s;
uint32_t num_actions = (uint32_t)preds.size();
size_t tied_actions = fill_tied(preds);
const float prob = _epsilon / num_actions;
for (size_t i = 0; i < num_actions; i++) preds[i].score = prob;
if (!_first_only)
{
for (size_t i = 0; i < tied_actions; ++i) preds[i].score += (1.f - _epsilon) / tied_actions;
}
else
preds[0].score += 1.f - _epsilon;
}
It givens the action with the largest cost a score: 1-epsilon, and the rest actions a score: epsilon/num_actions. Is this how it do exploration based on epsilon? I am a little confused about it, can someone help explain it?
Epsilon greedy works as follows (example with 4 arms):
Per round, choose the best arm given context (i.e. arm with lowest cost) with probability epsilon. With probability 1-epsilon, choose an arm uniformly at random.
With epsilon = 0.1, at any given round, the probability of choosing the best arm is 1-0.1 plus 0.1 x 1/4 -> 0.925 ("exploit"). The probability of choosing a suboptimal arm is 0.1 * (1/4) = 0.025 ("explore")
Watched the great content at https://slideslive.com/38942331/vowpal-wabbit, thanks to all involved! A related question:
I am implementing a ranking system, where the action sets per slot are not disjoint, i.e. i basically want a ranking without duplicates. The video mentions that the theory behind slates is worked out for the intersected/joint action case, but that it's still being worked on in VW.
Am I shooting myself in the foot if I use CCB instead of slates now? Is there some rough estimate of when joint action sets will be supported in slates mode? Is slates mode planned as a replacement for CCB? @jackgerrits is probably the one to ask :)
Regarding CCBs, I have a follow-up question. The docs mention this:
"If action_ids_to_include is excluded then all actions are implicitly included". What's the use case for action_ids_to_include?
It also states "This is currently unsupported". Does that refer to action_ids_to_include or the exclusion of action_ids_to_include :)?
Hi everyone, in vw source code, when compute prediction, we have the following code:
float finalize_prediction(shared_data* sd, vw_logger& logger, float ret)
{
if (std::isnan(ret))
{
ret = 0.;
if (!logger.quiet)
{ std::cerr << "NAN prediction in example " << sd->example_number + 1 << ", forcing " << ret << std::endl; }
return ret;
}
if (ret > sd->max_label) return (float)sd->max_label;
if (ret < sd->min_label) return (float)sd->min_label;
return ret;
}
If I use squaredloss
, then the prediction for the above function's input is 1.36777e+09
, but after finalize_prediction
, it become 0, does it make sense?
Hi, I plan to contribute to RLOSF 2021. As per the website the applications have started from 14th January 2021, but I am not able to find a link to application form. Any help would be highly appreciated.
Sorry about that, the date has been moved back to Feb 1 per https://www.microsoft.com/en-us/research/academic-program/rl-open-source-fest/
--cb_explore_adf
and I noticed that very often the model gets biased (the probability mass function output is mostly the same, regardless of context). I've tried using regularizers, modify the LR, add decay, and some other stuff, but I'm still not convinced the model is not biased, because when I run a few predictions for visualization, the PMF is often the same or at least the highest probability is at the same index.max(prob)
occurs? I.e., why it is recommended to use the sample_custom_pmf
(from: https://vowpalwabbit.org/tutorials/cb_simulation.html#getting-a-decision-from-vowpal-wabbit)? As I understand this is to add some kind of randomization, but aren't the model already exploring when we train it with explore_adf?For example,
The bandit bakeoff paper mentions that
We run our CB algorithms in an online fashion using
Vowpal Wabbit: .... we consider online CSC or regression oracles. Online CSC itself reduces to multiple online
regression problems in VW...
I understand the loss function and the gradient updates but I want to know what is online regression model class implemented in VW ?
I am following the tutorial on CTR with cb_explore_adf and I would love to know if it is possible to use the namespace feature article to be numeric...
in the tutorial, you guys tells us to do like this:
shared |User user=Tom time_of_day=morning
|Action article=politics
|Action article=sports
|Action article=music
|Action article=food
is it possible to pass numerical values and let the model generialize better when there is a new feature in the middle?
shared |User user=Tom time_of_day=morning
|Action price:2.99
|Action price:10.99
Hello everyone,
I'm Harsh Sharma, an undergraduate student from IIIT, Gwalior, pursuing Computer Science and Engineering. I'm interested in participating in the Microsoft RL Open Source fest this year, and I'm specifically interested in working on these projects:
17 - RL-based query planner for open-source SQL engine
20 - AutoML for online learning
Since I've worked with Deep Learning for the NL2SQL task before, I would like to work on 17. Could someone here please clarify what the "query planner" here refers to? Does it mean join query optimization? Also, I'd be really grateful if someone could guide me as to what would be the first step to implement such a query planner in an SQL engine.