simple_label.h, is the
Baseexplained in this page https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format ?
initialis only used for residual regression? In other case, it should be always initialized as 0 ?
Thanks for the answer. I know for sure that VW works with costs. I have also taken care to have the probabilities in a logical range and not imbalanced.
I will give a try to the —cb_type suggestion.
UPDATE: In fact, I realised that the problem only arises when I use —bag as my policy. —epsilon and —softmax on the other hand work well.
Something that I've been breaking my head on in the past months (master thesis) is the claim IPS is unbiased. This raises the following questions for me:
 A Swaminathan and T Joachims. Batch learning from logged bandit feedback through counterfactualrisk minimization .J. Mach. Learn. Res., 2015. ISSN 1532-4435. URL http://www.jmlr.org/papers/volume16/swaminathan15a/swaminathan15a.pdf
Please let me know if this is the right place to ask this question.
Hey @JohnLangford, thanks for the answer. I understand the proof. I just watched the video. At 21:50 you make the claim that "we don't try for every context every action, that would be crazy". I agree.
But then you're assuming some generalization over contexts, right? You're assuming that similar contexts have similar probabilities of occurring and similar action selection distributions.
vw -d <data_file> --cache_file <file_name> -t, the magic arg is -t which causes no training to occur. You can either use -c or --cache_file, -c will just give a default name
@JohnLangford, thanks for your answers so far. I understand the unbiasedness now.
W.r.t. variance I'm thinking the following. Obviously, variance depends on the importance weights, but also on the randomness in the contexts (and rewards). So when you say "the set of features x can be very large and high-dimensional", this is true, but more features also introduce more variance (when the behavior and target policy disagree much).
So w.r.t. contexts we're dealing with the classical bias/variance trade-off from supervised learning, right? Using more features to describe a context will give a less biased answer of a policy's true value using IPS, but will inherently also introduce more variance in the estimation.
Hm I'm starting to doubt the correctness of my previous statement now. I read Section 4 "Variance Analysis" in . So it states that variance in IPS is composed of three terms:
1) Randomness in rewards
2) Randomness in x
3) Importance weighting penalty
1) and 3) fully make sense to me. It's in 2) that I'm slightly confused. So the claim is that $Varx[\varrho\pi(x)]$ is the contribution due to randomness in x (assuming that \delta \approx 0) (irrespective of how many features you use to describe x).
The way I read this is: the more expected rewards differ per context-action pair, the higher this variance term. The assumption is then, that if contexts are more random these expected rewards are more likely to be different.
 Doubly Robust Policy Evaluation and Learning
@JohnLangford I have another question which I hope you might be able to answer. I tried to reproduce the experiment from Section 5.1 in . I only used the ecoli data set.
1) [policy evaluation] Why do you fit the linear loss model on the training set? Doesn't it make more sense to fit it on test (with the actions as chosen by the behavior policy). When I do this, DM has (almost) no bias.
2) [policy optimization] When I use the normal argmax-regressor approach (DM) it works just as well as IPS with DLM. I almost feel like this is as expected. In the end we sample actions uniformly. However, in the paper DM was omitted because "[it] is significantly worse on all datasets, as indicated in Fig. 1". Am I seeing something wrong here?
 Doubly Robust Policy Evaluation and Learning
Hi all, I have a question regarding the costs in —cb_explore_adf.
It seems that the training algorithm is not scale invariant with respect to the costs, meaning that if one multiplies all costs by the same factor and trains a new bandit, then this the results are different.
For example, in my use-case the cost is 0/1 depending on click/no-click. However, the model converges faster if I set the cost for no-click to 10.
I'm trying to understand the cache files produced by vw and am pretty lost. Based on what I understand from cache.cc, for each point data there should be in order
But for my simple test file
-1 | 40:0.5922770329868341 22:0.41005119833784964
1 | 9:0.7918840417547113 21:0.6883349776040766
I am not finding that in cache file. I can find 0.592 in bytes 52:55 and 0.41 in bytes 58:61 but there should be 9 bytes according to my thinking between them. Additionally, I can't find 1 or -1 anywhere. Please help! Thanks