- Join over
**1.5M+ people** - Join over
**100K+ communities** - Free
**without limits** - Create
**your own community**

Otherwise you probably will not have direct acces to the weights

Under Ubuntu, likely the easiest way to get this set up in via CMake; are you able to follow the Cmake build instructions to build vw itself? (Guessing yes, but want to confirm)

@silverguo: For a really lightweigt, simple app to show how to initialize VW, parse some examples, and feed them into it, then save a model, take a look at: https://github.com/lokitoth/vw_use_demo

It does not really cover deeper scenarios, but may help you get unblocked. I will be offline for a bit, but will be back on later in case you have question, so please put them here :-)

Hi, I meet a problem with constant, I train model on a large imbalanced dataset with precomputed constant (by using

`--constant`

option, the value is about -4.4 because the positive samples are ~ 1.2%), but after first pass, the value of constant become nearly 0, then I turn off the `l2`

, but still same thing, since I have a big training set and a lot of features, there are some collisions with VW constant hash (116060), do you think that the collision with constant hash will impact the value of constant?
If you get feature collisions, then the value will definitely be impacted, because it is not longer actually constant - when you have multiple colliding features, what ends up happening (under sparse features - not completely sure what the behaviour is for dense) is the feature is added multiple times. This seems to have the effect of adding the feature values together.

Thus my expectation is that within a given namespace, if you have two features with a given feature hash (f), one with a value 1 (default value for string features) and the other with value c, you will get two feature entries in your sparse vector for index=f, one with value=1 and the other with value=c; when you compute the model's value for this you will end up multiplying the weight for feature-index f (w_f) by 1 and by c, and adding both of those partials to the final regression bit.

(I am not very confident about what happens when the base learner is not the linear regressor; I will have to look that up)

Also, I'm a bit confused about constant in VW, if I use logistic loss, binary classification, the constant should be the intercept in linear function?

but in VW training log file, we have a best constant, that one is the weighted average of labels

Just to make sure, the constant should be the intercept of linear function (not same as best constant in VW log) ?

@silverguo - I spent a bit of time digging through this.

The idea of best constant is the following: Given a problem with examples X_i, labels Y_i, a linear model class (Y-hat_i = W * X_i + C), and loss function L_i = L(Y-hat_i, Y_i), find a value of 'C' such that the total loss is minimized for a weight-vector W = 0.

In other words:

`Min( Sum[i]( L(C, Y_i) ) )`

In the case of logistic loss, which is L(Y-hat, Y) = log( 1 + exp(-(Y-hat_i * Y_i)) ), the constant loss is:

`L = log( 1 + exp ( -Y_i * C ))`

Since Y_i is one of -1 or 1, let 'a' be the number of examples with label '1' and 'b' be the number of examples with label '-1'. The total loss over your dataset is:

`L_total = a * log ( 1 + exp(-C) ) + b * log ( 1 + exp (C) )`

Min(L_total) is at C when d/dC{L}(C) = 0

`d/dC{L} = [( b * exp(C) ) / ( 1 + exp(C) )] - [( a * exp(-C) ) / ( 1 + exp(-C) ) ] = 0`

Let z = exp(C):

`[(b * z) / (1 + z)] - [(a / z) / (1 + 1/z)] = 0`

For all k, C in R, 1 + exp(k * C) is positive, thus (1 + z)(1 + 1/z) != 0, so:

```
[(b * z) * (1 + 1/z)] - [(a / z) * (1 + z)] = 0
[b*z + b] - [a/z + a] = 0
```

For all C in R, z = exp(C) > 0, thus z != 0, so:

`b*z^2 + (b - a)*z - a = 0`

Solve for quadratic:

```
z = [(a - b) +- sqrt((b - a)^2 + 4 * b * a)] / (2*b)
z = [(a - b) +- sqrt(b^2 - 2ab + 4ab + a^2)] / (2*b)
z = [(a - b) +- sqrt((a + b)^2)] / (2*b)
z = [(a - b) +- (a + b)] / (2*b)
z = {(2*a) / (2*b), -(2*b)/(2*b)}
= {a / b, -1}
```

Since we have the restriction z > 0

```
z = a / b
C = ln(a / b)
```

Looking in best_constants.cc, at lines 48-59 we compute that value for best-constant:

```
else if (funcName.compare("logistic") == 0)
{
label1 = -1.; // override {-50, 50} to get proper loss
label2 = 1.;
if (label1_cnt <= 0)
best_constant = 1.;
else if (label2_cnt <= 0)
best_constant = -1.;
else
best_constant = log(label2_cnt / label1_cnt);
}
```

Are you sure the value you are seeing is the weighted average, and not the log-liklihood?

Sorry for the wall-of-text

I think the reason is because I currently use a old version of VW (7.7), best_constant in that version is calculated by https://github.com/VowpalWabbit/vowpal_wabbit/blob/7.7/vowpalwabbit/main.cc#L73

@silverguo: Thinking about the implications more deeply, I would not expect the final constant to be the same as "best constant", because "best constant" assumes that the non-constant features have no value in separating the two classes: It asks the question of: if I only had the ability to choose a constant as my regressor, what is the constant that gives me the least loss; in essence, it is asking: "If I had no features, what would be the best naive starting point?".

On the flip side, if the features capture the separation perfectly, one could end up with a situation where the best value for the constant is 0. Imagine one has a dataset with unequal counts of examples of class1 and class2, with only a single feature, whose value happens to be the label of the class (-1 or 1). As above, the best-constant will find a non-zero best, which, taken alone, will produce the best classifier one can get. However, if we run the training, I would expect the model to eventually converge on weight = 1, with the constant 0, because there is a feature inside the examples which perfectly captures the separation.

As described in wiki page https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments

--initial_pass_length is a trick to make LBFGS quasi-online. You must first create a cache file, and then it will treat initial_pass_length as the number of examples in a pass, resetting to the beginning of the file after each pass. After running --passes many times, it starts over warmstarting from the final solution with twice as many examples.

Yes, except that it's 2n+1. The relevant code is here: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/parse_dispatch_loop.h#L35

I found that LBFGS will produce bad model if all weights are initialized from 0, warmstarting from a trained sgd model could be a good idea, is there other methods to solve that problem?

Hey @silverguo: Sorry for taking so long to get to this. I am not sure about your first few questions, but per this paper (https://arxiv.org/pdf/1110.4198.pdf) it seems that your suggestion of warmstarting from trained sgd model is a good approach.

Hi, I have a question about the code, the struct

`label_data`

defined in `simple_label.h`

, is the `initial`

same as `Base`

explained in this page https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format ?
`initial`

is only used for residual regression? In other case, it should be always initialized as 0 ?
The short answer is 'online interactive learning' which other systems do not really support. Here's a quora post https://www.quora.com/How-relevant-is-Vowpal-Wabbit-in-2016 I did a few years ago which is still relevant and the new website (http://vowpalwabbit.org) has more info.

Hi all. I have a question regarding the bandit models of VW. I am currently running some simulations for a recommendation task with cb_explore_adf. Every arm has a probability of click and a cost, which is paid when there is no-click. In this case, it VW identifies the arm with the minimum expected cost always.

However, if I turn the problem around and try to reward the click events (by giving a negative cost), then it fails to converge to the optimal solution.

However, if I turn the problem around and try to reward the click events (by giving a negative cost), then it fails to converge to the optimal solution.

- Is there an asymmetry between positive and negative costs in VW?
- Is there a cli argument I should pass to the model?

Thanks

Second, an asymmetry may exists in your problem if the probability of a click is particularly low or high.

Third, try using --cb_type dr to automatically learn a good offset. This doesn't quite remove asymmetry in the solution (which does exist due to default parameters), but it goes a substantial way towards doing so.

Thanks for the answer. I know for sure that VW works with costs. I have also taken care to have the probabilities in a logical range and not imbalanced.

I will give a try to the —cb_type suggestion.

UPDATE: In fact, I realised that the problem only arises when I use —bag as my policy. —epsilon and —softmax on the other hand work well.

Hi folks,

Something that I've been breaking my head on in the past months (master thesis) is the claim IPS is unbiased. This raises the following questions for me:

- We're still creating bias in how contexts are defined, right? The less specific a context is, the more bias we introduce.
- What happens when contexts are unique? How valid are the results from IPS still? For example, in the data set presented in [1], about one in three contexts is unique while the mean probability of choosing an action is 0.14. This creates bias, right?
- What is the difference between the IPS estimator and grouping data by (context,action) and using the average reward?

[1] A Swaminathan and T Joachims. Batch learning from logged bandit feedback through counterfactualrisk minimization .J. Mach. Learn. Res., 2015. ISSN 1532-4435. URL http://www.jmlr.org/papers/volume16/swaminathan15a/swaminathan15a.pdf

Please let me know if this is the right place to ask this question.