Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gediminas Žylius
    @gediminaszylius
    The more I play around with MLDB the more I like it. But I have few questions:
    1) I saw that SVM is implemented as plugin using LIBSVM. My question is, would it be hard to add other popular algorithms like BPR, binary matrix factorization (both implemented in LIBMF) useful and popular for web click data and factorization machines (implementation in LIBFFM) usefull for nonlinear modeling in sparse settings? Those seem to be popular ant efficient implementations in c++ too.
    2) It seems to me that some popular statistical functions are missing (like standard deviation, median, percentile/quantile calculations and etc.) is it on purpose or I missed somthing in documentation?
    3) What about feature selection? Are there ways to implement it internally during model training (like CV) or only possible externally? Maybe best option would be using jseval?
    I hope I'm not bothering you too much.
    Jeremy Barnes
    @jeremybarnes
    Hi @GedasZ, MLDB was designed to include other algorithms, especially those with a C or C++ interface, so certainly it would be useful to implement those algorithms by wrapping them similarly to libSVD. Standard deviation is implemented as an aggregator as are most of the parametric distribution functions (and it's quite easy to add new ones). The nonparametric ones like median, percentile/quantile would also be relatively simple to add; if you're interested we could show you how to get started on them. For feature selection, most algorithms are relatively robust to dense features and so feature selection is no that much of an issue; we tend to use the explain feature on an algorithm like random forests that does select features in order to understand what it has chosen and why. For sparse features we typically do not include them directly in a classifier; we will embed them into a dense feature space using an unsupervised embedding like an SVD and then operate on the output of that feature space. Typically we will do feature engineering or selection using SQL; using COLUMN EXPR you can implement quite sophisticated feature selection logic (for example, in https://github.com/mldbai/mldb/blob/master/testing/MLDB-498-svd-apply-function.js#L101 we have a feature selection select as select COLUMN EXPR (AS columnName() WHERE rowCount() > 100 ORDER BY rowCount() DESC, columnName() LIMIT 1000) from reddit_dataset, which selects the 1,000 most frequent (sparse) features that occur more than 1,000 times. jseval is a great escape hatch for when you need to do something that's not possible otherwise, and it is pretty fast, so that's another option. And you're certainly not bothering us, keep the questions coming!
    François Maillet
    @mailletf
    @GedasZ small addition for #2. You asked for functions, so Jeremy’s answer is probably what you were looking for, but I’ll just add that there is the summary statistics procedure that calculates all the standard statistics on a dataset: https://docs.mldb.ai/doc/#builtin/procedures/SummaryStatisticsProcedure.md.html
    Gediminas Žylius
    @gediminaszylius
    @jeremybarnes Good to know that LIBMF and LIBFFM functionality could be added. Those libs implement really popular (or gaining increased popularity) algorithms efficiently. As I understand, user could not add those plugins by himself that are written in c++, so my question is, should one raise this request in github as a possibility of MLDB developer team to consider adding those algorithms in the future or your plans are very tight and it is not expected in near future?
    @mailletf thanks for pointing this out, I missed it somewhere in documentation
    Jeremy Barnes
    @jeremybarnes
    @GedasZ it is possible for users to add C++ plugins by themselves, outside the main MLDB repo, though the user experience isn't as good as it could be. That's how we implement our enterprise versions and upcoming Geographic/LiDAR functionality. Essentially, you need to do the same as we've done here for the mongodb plugin here: https://github.com/mldbai/mldb/tree/master/mongodb, except that mldb becomes a submodule of the main repo. You then put the external libraries in ext, and write the code in C++ to bridge between their representation and MLDB's data representation. At the end, you compile the plugin, and copy it into your mldb_data/plugins/autoload/<pluginname> directory, and it will be loaded by MLDB on startup.
    Another alternative would be to simply contribute it as a plugin to the open source version of MLDB
    If you're interested in pursuing either of these options, then we'd be happy to support you as you do so
    Gavin Hackeling
    @gavinmh
    This message was deleted
    Gediminas Žylius
    @gediminaszylius
    Hi, is it possible to export trained classifier file (for example "predictor.cls") from one MLDB instance to another and use it, or it must be trained on particular instance and is not reusible to another instances?
    Gediminas Žylius
    @gediminaszylius
    Another question, is it possible to train regression model (say gradient boosted regression trees) on multiple outputs ( 1 feature vector -> N-dimensional output/target vector) or I must train each model for every label independently?
    Jeremy Barnes
    @jeremybarnes
    @GedasZ It is completely possible to export trained classifier files from one MLDB instance to the other. They include all of the information about the feature names and types needed to run it in a different situation. You can also export a JSON dump of the classifier using /v1/functions/<classifier>/details although currently it's not possible to re-import that elsewhere.
    Regression models can only be trained on a single output currently. Can you tell us more about what you want to do with multiple outputs over the same model structure vs two separate models?
    Gediminas Žylius
    @gediminaszylius
    About multi-output regression, I guess decision trees hardly apply for multiple-output regression problem, but for example neural networks do (there are many implementations that support multiple output regression in neural networks). However, suppose I need to train decision tree based models for multiple-output regression. The assumption that every output is independent from each other is valid, so I can train N models for every output scalar. Training every model independently however is not much of a problem in my case, but when I need to do a prediction, then I need to write a function (say using jseval) that iterates through N models and produces output, or there would be more efficient and easier way to do that?
    Jeremy Barnes
    @jeremybarnes
    Agreed that neural nets have lots of valid multiple-prediction cases especially in a generative setting. To make a prediction, you can simply construct a vector of the outputs, for example SELECT [ classifier1(features), classifier2(features), classifier3(features) ] to return a 3-element vector with the three models.
    For trees, it would be hard to support multiple outputs within the optimized training framework that we have, as many of the optimizations are around reducing memory bandwidth, and multiple labels simply take more memory. For neural nets implementing gradient descent training is still a WIP, but since the loss function is arbitrary then multiple outputs will be accepted.
    Gediminas Žylius
    @gediminaszylius
    understood, thank you :)
    Gediminas Žylius
    @gediminaszylius
    So I tried to train a regressor for one of multiple outputs (called V3) with following python code:
    mldb.put("/v1/procedures/factor_regressor_model", {
    "type": "classifier.train",
    "params": {
    "mode": "regression",
    "trainingData": """
    select
    {* EXCLUDING("""+",".join(["V"+str(i+3) for i in xrange(50)])+""")} as features,
    V3 as label
    from query_factor_table_id
    """,
    "algorithm": "bbdt",
    "functionName":"factor_regressor",
    "modelFileUrl": "file:///mldb_data/factor_regressor.cls"
    }
    })
    with the bbdt and bbdt-5 configurations it gives me following error and database crashes (imported data are gone):

    ResourceError Traceback (most recent call last)

    <ipython-input-71-bd023aae1a85> in <module>()
    11 "algorithm": "bbdt",
    12 "functionName":"factor_regressor",
    ---> 13 "modelFileUrl": "file:///mldb_data/factor_regressor.cls"
    14 }
    15 })

    /usr/local/lib/python2.7/dist-packages/pymldb/init.pyc in inner(args, **kwargs)
    16 result = add_repr_html_to_response(fn(
    args, **kwargs))
    17 if result.status_code < 200 or result.status_code >= 400:
    ---> 18 raise ResourceError(result)
    19 return result
    20 return inner

    ResourceError: '502 Bad Gateway' response to 'PUT http://localhost/v1/procedures/factor_regressor_model'

    {
    "error": "MLDB Unavailable",
    "details": {
    "message": "MLDB is unable to respond either because it has not finished booting or because it has crashed and has not finished rebooting. Recent log messages are available via HTTP at /logs/mldb"
    },
    "httpCode": 502
    }

    but it works with bdt algorithm configuration
    what could be the problem?
    Jeremy Barnes
    @jeremybarnes
    @GedasZ would it be possible to send us the dataset (in private is OK) so that we can debug it? Boosting on regression tasks requires special care as the margin is ill-defined and numerical errors can creep in due to the exponentiation in the boosting loss. It shouldn't crash the DB, however, and we'd like to fix that. I would run bagged decision trees, not bagged boosted decision trees, for a regression problem.
    Gediminas Žylius
    @gediminaszylius
    Also when I got to log file i saw "MLDB exited due to signal 11 SIGSEGV: segmentation fault (internal error)"
    Gediminas Žylius
    @gediminaszylius
    it seems that random forest (bagged decision trees) implementation for regression outputs -NaN data (after importing in python pandas dataframe). When for example same configurations in xgboost works fine. I also dont see if there is possible to select cost optimization function. So I assume that for regression problems MSE is implemented?
    Jeremy Barnes
    @jeremybarnes
    If you use bagged decision trees, then regression should work. The only cost function implemented to calculate split points in decision trees for regression is MSE, you are right about that. What other function have you had success with that you'd like to use? In general, in MLDB we try to avoid having too many options so that it's possible to have reasonable results without doing any work on parameter tuning.
    Jeremy Barnes
    @jeremybarnes
    I just verified; the segfault in MLDB on boosted decision trees is due to boosting not having a suitable cost function for regression implemented. In the next release, it will return an error instead of segfaulting.
    in general, the naive approach (calculating the margin as the L1 norm of the difference between the prediction and the output) is problematic, as the boosting loss function is exponential in that difference, and it can never go to zero, hence we end up with huge error terms. We would probably need some kind of acceptable error zone as a hyperparameter to make it work.
    for the bagged decision trees with the NaNs, could you show us what you did?
    Dan Croitoru
    @cromodnet
    new to MLDB: trying to install it. what is MLDB_IDS (user id I need to put in command to install into docker container -e MLDB_IDS="id"
    Jeremy Barnes
    @jeremybarnes
    It's the user ID of the user who you want to own the files that MLDB produces in mldb_data. Without it, those files end up owned by root or an unknown user ID which makes it hard to work both inside and outside of MLDB at the same time.
    Dan Croitoru
    @cromodnet
    Thanks Jeremy but whatever I put as ID, I obtain always the same error message: usermod: invalid user ID 'xxx'
    Jeremy Barnes
    @jeremybarnes
    @cromodnet if you type id in your shell, what do you get as output?
    Vinay Agarwal
    @vinkaga
    Does mldb have inception-v4 model yet?
    François Maillet
    @mailletf
    @vinkaga you're talking about inception resnet2 ? not yet. there is some work that needs to be done in mldb to support it. we’ve started but don’t have an ETA on it
    Vinay Agarwal
    @vinkaga
    @mailletf Thanks
    Da Kuang
    @dkuang1980
    @mailletf Nice talk yesterday at Pycon Canada! I am Da from SurveyMonkey, very nice to meet you.
    François Maillet
    @mailletf
    hey @dkuang1980 thanks for coming to the talk and it was also good to meet you. fell free to shoot me an email if you have any questions and I’d be happy to jump on a call with you if you want to explore the possibility of using mldb
    Simon Narang
    @simonnarang
    this is super cool
    Simon Narang
    @simonnarang
    i want to run it on AWS. Do I just pay for the AWS hosting and you guys take a portion of that or do I need to pay an additional cost
    François Maillet
    @mailletf
    Hey @simonnarang. If you’re using the open-source version, you can run it wherever you want and you don’t owe us anything. If you want to run the docker image of the enterprise version, then there is a license fee for MLDB, plus any hosting costs that your provider will charge. We also offer a hosted version of MLDB where you essentially don’t have to deal with the hosted aspect of it and we manage it all. Can you shoot me an email at francois@mldb.ai and we can go over things in more details?
    Simon Narang
    @simonnarang
    sure
    Gediminas Žylius
    @gediminaszylius
    Hello, I wanted to ask how preprocessing of image is performed when using deepteach demo with tensorflow inception v3 model? Is it cropped to required dimensions or resized, interpolated if dimensions are too small or somehow different?
    Simon Narang
    @simonnarang
    i think the example just uses an image that is already in the right dimensions
    Jeremy Barnes
    @jeremybarnes
    It is resized to the correct dimensions, as part of the model graph
    Gediminas Žylius
    @gediminaszylius
    And what about when the image is too small? Does this part of the graph fills missing values with some constant or generates missing values by nearest neighbours interpolation for example?
    Jeremy Barnes
    @jeremybarnes
    I believe that it will interpolate the missing pixels.
    Gediminas Žylius
    @gediminaszylius
    hi, i'm still confused when i can access function input values via $ and when I cannot. For example when I use transpose function after FROM statement with $ inputs i get error: