Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jacob Alber
    @lokitoth
    Ah, yeah, in that case you need to be in C++
    Otherwise you probably will not have direct acces to the weights
    Under Ubuntu, likely the easiest way to get this set up in via CMake; are you able to follow the Cmake build instructions to build vw itself? (Guessing yes, but want to confirm)
    Jacob Alber
    @lokitoth
    @silverguo: For a really lightweigt, simple app to show how to initialize VW, parse some examples, and feed them into it, then save a model, take a look at: https://github.com/lokitoth/vw_use_demo
    It does not really cover deeper scenarios, but may help you get unblocked. I will be offline for a bit, but will be back on later in case you have question, so please put them here :-)
    Yuhan GUO
    @silverguo
    @lokitoth Thanks a lot for creating that example, I'm able to follow the cmake build to build vw itself, I will start from your demo and back to here once I have question :)
    LifeHappens
    @qtbow_twitter
    Complete n00b here. Can anyone recommend a prerequisite learning/reading guide for understanding how to use VW? I tried reading the Github wiki and instantly became lost
    Jacob Alber
    @lokitoth
    Hello @qtbow_twitter - welcome to the community :-) Are you looking for a tutorial on some part of vw in particular? Could you share a bit more about the problem you are trying to solve?
    Yuhan GUO
    @silverguo
    Hi, I meet a problem with constant, I train model on a large imbalanced dataset with precomputed constant (by using --constant option, the value is about -4.4 because the positive samples are ~ 1.2%), but after first pass, the value of constant become nearly 0, then I turn off the l2, but still same thing, since I have a big training set and a lot of features, there are some collisions with VW constant hash (116060), do you think that the collision with constant hash will impact the value of constant?
    Yuhan GUO
    @silverguo
    Oh, I think I partly understand why, my bit_precision is 24, the VW constant hash is no longer 116060
    Jacob Alber
    @lokitoth
    If you get feature collisions, then the value will definitely be impacted, because it is not longer actually constant - when you have multiple colliding features, what ends up happening (under sparse features - not completely sure what the behaviour is for dense) is the feature is added multiple times. This seems to have the effect of adding the feature values together.
    Jacob Alber
    @lokitoth

    Thus my expectation is that within a given namespace, if you have two features with a given feature hash (f), one with a value 1 (default value for string features) and the other with value c, you will get two feature entries in your sparse vector for index=f, one with value=1 and the other with value=c; when you compute the model's value for this you will end up multiplying the weight for feature-index f (w_f) by 1 and by c, and adding both of those partials to the final regression bit.

    (I am not very confident about what happens when the base learner is not the linear regressor; I will have to look that up)

    Yuhan GUO
    @silverguo
    Oh, that makes a lot of sense
    Also, I'm a bit confused about constant in VW, if I use logistic loss, binary classification, the constant should be the intercept in linear function?
    but in VW training log file, we have a best constant, that one is the weighted average of labels
    Just to make sure, the constant should be the intercept of linear function (not same as best constant in VW log) ?
    Jacob Alber
    @lokitoth

    @silverguo - I spent a bit of time digging through this.

    The idea of best constant is the following: Given a problem with examples X_i, labels Y_i, a linear model class (Y-hat_i = W * X_i + C), and loss function L_i = L(Y-hat_i, Y_i), find a value of 'C' such that the total loss is minimized for a weight-vector W = 0.

    In other words:

    Min( Sum[i]( L(C, Y_i) ) )

    In the case of logistic loss, which is L(Y-hat, Y) = log( 1 + exp(-(Y-hat_i * Y_i)) ), the constant loss is:

    L = log( 1 + exp ( -Y_i * C ))

    Since Y_i is one of -1 or 1, let 'a' be the number of examples with label '1' and 'b' be the number of examples with label '-1'. The total loss over your dataset is:

    L_total = a * log ( 1 + exp(-C) ) + b * log ( 1 + exp (C) )

    Min(L_total) is at C when d/dC{L}(C) = 0

    d/dC{L} =  [( b * exp(C) ) / ( 1 + exp(C) )] - [(  a * exp(-C) ) / ( 1 + exp(-C) ) ] = 0

    Let z = exp(C):

    [(b * z) / (1 + z)] - [(a / z) / (1 + 1/z)] = 0

    For all k, C in R, 1 + exp(k * C) is positive, thus (1 + z)(1 + 1/z) != 0, so:

    [(b * z) * (1 + 1/z)] - [(a / z) * (1 + z)] = 0
    
    [b*z + b] - [a/z + a] = 0

    For all C in R, z = exp(C) > 0, thus z != 0, so:

    b*z^2 + (b - a)*z - a = 0

    Solve for quadratic:

    z = [(a - b) +- sqrt((b - a)^2 + 4 * b * a)] / (2*b)
    z = [(a - b) +- sqrt(b^2 - 2ab + 4ab + a^2)] / (2*b)
    z = [(a - b) +- sqrt((a + b)^2)] / (2*b)
    z = [(a - b) +- (a + b)] / (2*b)
    
    z = {(2*a) / (2*b), -(2*b)/(2*b)}
      = {a / b, -1}

    Since we have the restriction z > 0

    z = a / b
    
    C = ln(a / b)

    Looking in best_constants.cc, at lines 48-59 we compute that value for best-constant:

    else if (funcName.compare("logistic") == 0)
    {
      label1 = -1.;  // override {-50, 50} to get proper loss
      label2 = 1.;
    
      if (label1_cnt <= 0)
        best_constant = 1.;
      else if (label2_cnt <= 0)
        best_constant = -1.;
      else
        best_constant = log(label2_cnt / label1_cnt);
    }

    Are you sure the value you are seeing is the weighted average, and not the log-liklihood?

    Sorry for the wall-of-text
    Jacob Alber
    @lokitoth
    Another thought: I would expect it to be weighted average of label if you are using square loss
    Yuhan GUO
    @silverguo
    Very clear explanation, it seems that the best constant defined as ln(a / b) = ln(p / (1-p)) = ln(odds), but I did see weighted average of labels in VW log with logistic loss
    I think the reason is because I currently use a old version of VW (7.7), best_constant in that version is calculated by https://github.com/VowpalWabbit/vowpal_wabbit/blob/7.7/vowpalwabbit/main.cc#L73
    Yuhan GUO
    @silverguo
    @lokitoth For best constant computed by the way you described, does it have relationship with the constant value in a trained model? (from my case, the converged constant value is quite different compared to the value of ln(odds), even without any regularization)
    Jacob Alber
    @lokitoth
    @silverguo: You ask the best questions :) I'll have to take a deeper look.
    Jacob Alber
    @lokitoth

    @silverguo: Thinking about the implications more deeply, I would not expect the final constant to be the same as "best constant", because "best constant" assumes that the non-constant features have no value in separating the two classes: It asks the question of: if I only had the ability to choose a constant as my regressor, what is the constant that gives me the least loss; in essence, it is asking: "If I had no features, what would be the best naive starting point?".

    On the flip side, if the features capture the separation perfectly, one could end up with a situation where the best value for the constant is 0. Imagine one has a dataset with unequal counts of examples of class1 and class2, with only a single feature, whose value happens to be the label of the class (-1 or 1). As above, the best-constant will find a non-zero best, which, taken alone, will produce the best classifier one can get. However, if we run the training, I would expect the model to eventually converge on weight = 1, with the constant 0, because there is a feature inside the examples which perfectly captures the separation.

    Yuhan GUO
    @silverguo
    @lokitoth very nice example to show that no direct relation between 'best constant' and the constant in model, really appreciated!
    Yuhan GUO
    @silverguo
    Hi, I'm a little confused about an option in VW
    --initial_pass_length is a trick to make LBFGS quasi-online. You must first create a cache file, and then it will treat initial_pass_length as the number of examples in a pass, resetting to the beginning of the file after each pass. After running --passes many times, it starts over warmstarting from the final solution with twice as many examples.
    Yuhan GUO
    @silverguo
    So if I set --initial_pass_length to 30K, --passes to 10, VW will firstly use first 30K samples in cache file to train 10 passes, then use first 60K samples in cache file to train another 10 passes?
    Jacob Alber
    @lokitoth
    Hi @silverguo, sorry for the delay in getting back to you. Unfortunately, I am not super familiar with BFGS in general (and in VW in particular). I am going to get in touch with the author of the code, and either he or I will get back to you.
    John
    @JohnLangford
    Yuhan GUO
    @silverguo
    @JohnLangford @lokitoth thanks for help! Why does it make BFGS quasi-online? Still using half of the training set for warmstarting (like mini-batch maybe), is there any paper or article talking about this trick?
    I found that LBFGS will produce bad model if all weights are initialized from 0, warmstarting from a trained sgd model could be a good idea, is there other methods to solve that problem?
    Jacob Alber
    @lokitoth
    Hey @silverguo: Sorry for taking so long to get to this. I am not sure about your first few questions, but per this paper (https://arxiv.org/pdf/1110.4198.pdf) it seems that your suggestion of warmstarting from trained sgd model is a good approach.
    Rodrigo Kumpera
    @kumpera
    @mikgor how did you invoke VW?
    Jack Gerrits
    @jackgerrits
    @mikgor I replied to your SO question, hopefully that helps you understand the format
    Yuhan GUO
    @silverguo
    Hi, I have a question about the code, the struct label_data defined in simple_label.h, is the initial same as Base explained in this page https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format ?
    initial is only used for residual regression? In other case, it should be always initialized as 0 ?
    John
    @JohnLangford
    Yes, yes, and yes :-)
    Sandeep Bhutani
    @sandeepbhutani304
    Hi... Can anyone tell what is special in vowpal wabbit. There are a lot of machine learning libraries, what VW is offering so that developers should learn this one?
    John
    @JohnLangford
    The short answer is 'online interactive learning' which other systems do not really support. Here's a quora post https://www.quora.com/How-relevant-is-Vowpal-Wabbit-in-2016 I did a few years ago which is still relevant and the new website (http://vowpalwabbit.org) has more info.
    hermes-z
    @hermes-z
    Hi all. I have a question regarding the bandit models of VW. I am currently running some simulations for a recommendation task with cb_explore_adf. Every arm has a probability of click and a cost, which is paid when there is no-click. In this case, it VW identifies the arm with the minimum expected cost always.
    However, if I turn the problem around and try to reward the click events (by giving a negative cost), then it fails to converge to the optimal solution.
    1. Is there an asymmetry between positive and negative costs in VW?
    2. Is there a cli argument I should pass to the model?
      Thanks
    John
    @JohnLangford
    First, vw operates on a cost basis---make sure you are getting that right.
    Second, an asymmetry may exists in your problem if the probability of a click is particularly low or high.
    Third, try using --cb_type dr to automatically learn a good offset. This doesn't quite remove asymmetry in the solution (which does exist due to default parameters), but it goes a substantial way towards doing so.
    hermes-z
    @hermes-z

    Thanks for the answer. I know for sure that VW works with costs. I have also taken care to have the probabilities in a logical range and not imbalanced.

    I will give a try to the —cb_type suggestion.

    UPDATE: In fact, I realised that the problem only arises when I use —bag as my policy. —epsilon and —softmax on the other hand work well.

    John
    @JohnLangford
    I'm not sure what's up. --bag should have little difference from --epsilon in terms of loss assymetry. If you have a concrete example you might want to share, you could open an issue on github.
    hermes-z
    @hermes-z
    Ok. I will try to create a reproducible example and open an issue.
    Boris Mattijssen
    @borismattijssen

    Hi folks,

    Something that I've been breaking my head on in the past months (master thesis) is the claim IPS is unbiased. This raises the following questions for me:

    1. We're still creating bias in how contexts are defined, right? The less specific a context is, the more bias we introduce.
    2. What happens when contexts are unique? How valid are the results from IPS still? For example, in the data set presented in [1], about one in three contexts is unique while the mean probability of choosing an action is 0.14. This creates bias, right?
    3. What is the difference between the IPS estimator and grouping data by (context,action) and using the average reward?

    [1] A Swaminathan and T Joachims. Batch learning from logged bandit feedback through counterfactualrisk minimization .J. Mach. Learn. Res., 2015. ISSN 1532-4435. URL http://www.jmlr.org/papers/volume16/swaminathan15a/swaminathan15a.pdf

    Please let me know if this is the right place to ask this question.