## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Yuhan GUO
@silverguo
I found that LBFGS will produce bad model if all weights are initialized from 0, warmstarting from a trained sgd model could be a good idea, is there other methods to solve that problem?
Jacob Alber
@lokitoth
Hey @silverguo: Sorry for taking so long to get to this. I am not sure about your first few questions, but per this paper (https://arxiv.org/pdf/1110.4198.pdf) it seems that your suggestion of warmstarting from trained sgd model is a good approach.
mikgor
@mikgor
Rodrigo Kumpera
@kumpera
@mikgor how did you invoke VW?
Jack Gerrits
@jackgerrits
@mikgor I replied to your SO question, hopefully that helps you understand the format
Yuhan GUO
@silverguo
Hi, I have a question about the code, the struct label_data defined in simple_label.h, is the initial same as Base explained in this page https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format ?
initial is only used for residual regression? In other case, it should be always initialized as 0 ?
John
@JohnLangford
Yes, yes, and yes :-)
Sandeep Bhutani
@sandeepbhutani304
Hi... Can anyone tell what is special in vowpal wabbit. There are a lot of machine learning libraries, what VW is offering so that developers should learn this one?
John
@JohnLangford
The short answer is 'online interactive learning' which other systems do not really support. Here's a quora post https://www.quora.com/How-relevant-is-Vowpal-Wabbit-in-2016 I did a few years ago which is still relevant and the new website (http://vowpalwabbit.org) has more info.
hermes-z
@hermes-z
Hi all. I have a question regarding the bandit models of VW. I am currently running some simulations for a recommendation task with cb_explore_adf. Every arm has a probability of click and a cost, which is paid when there is no-click. In this case, it VW identifies the arm with the minimum expected cost always.
However, if I turn the problem around and try to reward the click events (by giving a negative cost), then it fails to converge to the optimal solution.
1. Is there an asymmetry between positive and negative costs in VW?
2. Is there a cli argument I should pass to the model?
Thanks
John
@JohnLangford
First, vw operates on a cost basis---make sure you are getting that right.
Second, an asymmetry may exists in your problem if the probability of a click is particularly low or high.
Third, try using --cb_type dr to automatically learn a good offset. This doesn't quite remove asymmetry in the solution (which does exist due to default parameters), but it goes a substantial way towards doing so.
hermes-z
@hermes-z

Thanks for the answer. I know for sure that VW works with costs. I have also taken care to have the probabilities in a logical range and not imbalanced.

I will give a try to the —cb_type suggestion.

UPDATE: In fact, I realised that the problem only arises when I use —bag as my policy. —epsilon and —softmax on the other hand work well.

John
@JohnLangford
I'm not sure what's up. --bag should have little difference from --epsilon in terms of loss assymetry. If you have a concrete example you might want to share, you could open an issue on github.
hermes-z
@hermes-z
Ok. I will try to create a reproducible example and open an issue.
Boris Mattijssen
@borismattijssen

Hi folks,

Something that I've been breaking my head on in the past months (master thesis) is the claim IPS is unbiased. This raises the following questions for me:

1. We're still creating bias in how contexts are defined, right? The less specific a context is, the more bias we introduce.
2. What happens when contexts are unique? How valid are the results from IPS still? For example, in the data set presented in [1], about one in three contexts is unique while the mean probability of choosing an action is 0.14. This creates bias, right?
3. What is the difference between the IPS estimator and grouping data by (context,action) and using the average reward?

[1] A Swaminathan and T Joachims. Batch learning from logged bandit feedback through counterfactualrisk minimization .J. Mach. Learn. Res., 2015. ISSN 1532-4435. URL http://www.jmlr.org/papers/volume16/swaminathan15a/swaminathan15a.pdf

Please let me know if this is the right place to ask this question.

John
@JohnLangford
IPS is unbiased for every example individually. See http://hunch.net/~rwil for a simple proof of this.
Boris Mattijssen
@borismattijssen

Hey @JohnLangford, thanks for the answer. I understand the proof. I just watched the video. At 21:50 you make the claim that "we don't try for every context every action, that would be crazy". I agree.

But then you're assuming some generalization over contexts, right? You're assuming that similar contexts have similar probabilities of occurring and similar action selection distributions.

John
@JohnLangford
Generalization is typically provided by a policy class. See the tutorial here: http://hunch.net/~jl/projects/prediction_bounds/prediction_bounds.html
Boris Mattijssen
@borismattijssen
I'm not really sure if I'm understanding how prediction bounds fit in.
John
@JohnLangford
They provide a definition of generalization---good performance on one data set implies good performance on another.
hermes-z
@hermes-z
How is it possible to use several input files? In the wiki it claims that this is possible, and the "num sources” that appears in the model info also suggests so, but I can’t make it work. "vw -d file1 -d file2 " throws an error, while “vw -d file1 file2” ignores file2. Thanks for the help
Jack Gerrits
@jackgerrits
It is not currently supported to have multiple input files. There is an issue tracking this feature request here: VowpalWabbit/vowpal_wabbit#1895
Could you point to me to what you found on the wiki?
John
@JohnLangford
I believe it is actually possible to use multiple cache files at present.
Jack Gerrits
@jackgerrits
You are right John, thanks for the correction. So @hermes-z, you could preprocess multiple data files into cache files and then run vw over the multiple cache files.
hermes-z
@hermes-z
@jackgerrits Thank you. How can I preprocess a data file into a cache file without actually doing the training?
I will try to find the part in the wiki that suggests that multiple inout files are possible.
Jack Gerrits
@jackgerrits
I'm not sure what the best command line to use there would be. @cheng-tan, what are you doing for cache file generation?
But I think this would work decently well: vw -d <data_file> --cache_file <file_name> -t, the magic arg is -t which causes no training to occur. You can either use -c or --cache_file, -c will just give a default name
John
@JohnLangford
Parsing is the dominant cost so it's not to relevant what model-based parameters are used. Nevertheless, there is --noop if you really want to minimize computation :-)
Boris Mattijssen
@borismattijssen

@JohnLangford, thanks for your answers so far. I understand the unbiasedness now.

W.r.t. variance I'm thinking the following. Obviously, variance depends on the importance weights, but also on the randomness in the contexts (and rewards). So when you say "the set of features x can be very large and high-dimensional", this is true, but more features also introduce more variance (when the behavior and target policy disagree much).

So w.r.t. contexts we're dealing with the classical bias/variance trade-off from supervised learning, right? Using more features to describe a context will give a less biased answer of a policy's true value using IPS, but will inherently also introduce more variance in the estimation.

John
@JohnLangford
The number of features is in general unrelated to the degree to which the data gathering policy agrees with the evaluated policy.
Of course, variance does matter.
Boris Mattijssen
@borismattijssen
Though I would say that if there are more features, there is more data to disagree on, right?
Besides that, can you confirm that this statement is correct: "So w.r.t. contexts we're dealing with the classical bias/variance trade-off from supervised learning, right? Using more features to describe a context will give a less biased answer of a policy's true value using IPS, but will inherently also introduce more variance in the estimation."?
Boris Mattijssen
@borismattijssen

Hm I'm starting to doubt the correctness of my previous statement now. I read Section 4 "Variance Analysis" in [1]. So it states that variance in IPS is composed of three terms:

1) Randomness in rewards
2) Randomness in x
3) Importance weighting penalty

1) and 3) fully make sense to me. It's in 2) that I'm slightly confused. So the claim is that $Varx[\varrho\pi(x)]$ is the contribution due to randomness in x (assuming that \delta \approx 0) (irrespective of how many features you use to describe x).
The way I read this is: the more expected rewards differ per context-action pair, the higher this variance term. The assumption is then, that if contexts are more random these expected rewards are more likely to be different.

[1] Doubly Robust Policy Evaluation and Learning

Boris Mattijssen
@borismattijssen

@JohnLangford I have another question which I hope you might be able to answer. I tried to reproduce the experiment from Section 5.1 in [1]. I only used the ecoli data set.

1) [policy evaluation] Why do you fit the linear loss model on the training set? Doesn't it make more sense to fit it on test (with the actions as chosen by the behavior policy). When I do this, DM has (almost) no bias.
2) [policy optimization] When I use the normal argmax-regressor approach (DM) it works just as well as IPS with DLM. I almost feel like this is as expected. In the end we sample actions uniformly. However, in the paper DM was omitted because "[it] is significantly worse on all datasets, as indicated in Fig. 1". Am I seeing something wrong here?

[1] Doubly Robust Policy Evaluation and Learning

hermes-z
@hermes-z

Hi all, I have a question regarding the costs in —cb_explore_adf.

It seems that the training algorithm is not scale invariant with respect to the costs, meaning that if one multiplies all costs by the same factor and trains a new bandit, then this the results are different.

1. Why is this happening? Isn’t the problem reduced to minimizing a linear function?
2. Are there any guidelines/heuristics on how to choose the scale of the costs.

For example, in my use-case the cost is 0/1 depending on click/no-click. However, the model converges faster if I set the cost for no-click to 10.

Thanks

AaEll
@AaEll
Hi all, I'm using a container w/ VW 8.7.0 and I'm running into a Heisenbug when running vw --daemon. Every other request causes a "Write Error : Bad File Descriptor"
is there some flags I'm supposed to have turned on?
Jack Gerrits
@jackgerrits
@JohnLangford could you chime in about the cost sensitivity?
@AaEll you should not need to provide any other args. If you can put together a repro can you open an issue and I'll look into it?
John
@JohnLangford
Wrt costs, the essential issue is that the update rule is not scale or encoding invariant. For encoding invariance, using --cb_type dr typically helps. For scale invariance, you might try using --coin which has a scale-invariant update rule.
hermes-z
@hermes-z
Thanks for the answer and the tip. Isn’t —cb_type dr the default one?
Jack Gerrits
@jackgerrits
Mtr is the current default
AaEll
@AaEll
@jackgerrits I opened an issue. Thanks!
pmcvay
@pmcvay

I'm trying to understand the cache files produced by vw and am pretty lost. Based on what I understand from cache.cc, for each point data there should be in order

1. 4 bytes for a UInt32 label
2. 4 bytes for a Float weight
3. 8 bytes for a size_t cached tag
4. 1 byte for an unsigned char num_indices (but if this is the number of indices it doesn't seem like vw can handle problems where the feature dimension is > 255 which confuses me)
Then for each index in num_indices there should be
1. 1 byte for unsigned char index
2. 8 bytes for size_t storage
3. Storage bytes for data

But for my simple test file
-1 | 40:0.5922770329868341 22:0.41005119833784964
1 | 9:0.7918840417547113 21:0.6883349776040766

I am not finding that in cache file. I can find 0.592 in bytes 52:55 and 0.41 in bytes 58:61 but there should be 9 bytes according to my thinking between them. Additionally, I can't find 1 or -1 anywhere. Please help! Thanks

Jack Gerrits
@jackgerrits
There is run length encoding and zig zag encoding in there as well. Another way to approach this could be to actually just create a base_learner whose role in life is to edit the label, and let the existing infrastructure handle parsing and writing cache files. This design would not be something that we'd merge into master, but for a quick hack it shouldn't be too hard. I can put together something soon and share it with you.
pmcvay
@pmcvay
That would be amazing - thanks!