Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gediminas Žylius
    @gediminaszylius
    mldb.get('/v1/functions/some_classifier/application', input={
    "features":{"feature1":1.0,
    "feature2":[],
    "feature3":0.0}})
    this would be one observation
    Simon Lemieux
    @simlmx
    yes
    Gediminas Žylius
    @gediminaszylius
    but i want to send data viai http
    Simon Lemieux
    @simlmx
    right
    Gediminas Žylius
    @gediminaszylius
    if i will use it via sql i need to write data first, would it add much overhead ?
    Simon Lemieux
    @simlmx
    I'm not sure there is a way to do multiple HTTP calls for a given function in one call right now, but give ma second and I'll make sure
    Gediminas Žylius
    @gediminaszylius
    as far as I checked it seems to be not possible
    Simon Lemieux
    @simlmx
    I think you might be right
    Gediminas Žylius
    @gediminaszylius
    using predictions in batch mode via sql is great and fast, but if you have a data in batch streams, this would mean that you must write it in DB first and apply sql function. Other option would be to do it via REST API using streams. But if you have say 1000 values that you can fit in URL nicely, it would be very efficient. I dont know, maybe using REST API in streams with say 1000 samples sequentialy ads much overhead or not?
    Jeremy Barnes
    @jeremybarnes
    you can create a custom route that performs a batch query, either using the sql.query function or defining a handler in Python or JS
    Gediminas Žylius
    @gediminaszylius
    I also assume that missing values are encoded as [] ?
    François Maillet
    @mailletf
    encoded as NULL. except if you use the replace_null() function. the models learn to handle missing values if they’re present in the training set
    Simon Lemieux
    @simlmx
    also if that's possible, you could write all your examples in a file, use import.{text,json} do import them efficiently in a dataset, and then use a /v1/query to call the functions on that dataset
    Jeremy Barnes
    @jeremybarnes
    @GedasZ would you provide a REST call that you'd like to do (structure of the input and output data) and we can help you best define how to make it?
    Jeremy Barnes
    @jeremybarnes
    ah, OK, I see you already did above
    you can process multiple observations in one using row_dataset and/or column expr, we'll post an example by the end of the day
    Gediminas Žylius
    @gediminaszylius
    Thank you, looking forward to see your example
    To be more precise, I need to produce prediction values to batch (say 1000 observations) of instances over HTTP. Those instances are not available in MLDB database, each instance is represented by some feature set (say, feature1, feature2, feature3 in my example above) and should be send as JSON file using MLDB's REST API to classifier function.
    I also need to get those 1000 observation prediction values as response
    Jeremy Barnes
    @jeremybarnes
    yep, got it
    Jeremy Barnes
    @jeremybarnes

    @GedasZ It's possible using the following constructs, but there are some little issues that will be fixed in the next release of MLDB

    var functionConfig = {
        type: 'sql.query',
        params: {
            query: 'select horizontal_sum(value) as value, column FROM row_dataset($input)',
            output: 'NAMED_COLUMNS'
        }
    };
    
    var fn = mldb.put('/v1/functions/score_many', functionConfig);
    
    mldb.log(fn);
    
    var functionConfig2 = {
        type: 'sql.expression',
        params: {
            expression: 'score_many({input: rowsToScore})[output] AS *',
            prepared: true
        }
    };
    
    var fn2 = mldb.put('/v1/functions/scorer', functionConfig2);
    
    mldb.log(fn2);
    
    var input = { rowsToScore: [ { x: 1, y: 2}, {a: 2, b: 3, c: 4} ] };
    
    var res = mldb.get('/v1/functions/scorer/application',
                       { input: input, outputFormat: 'json' });
    
    mldb.log(res);

    What this does is a) creates an sql.function object to actually perform the multi row classification (by using row_dataset to turn the input row, which is variable length, into a table). The horizontal_sum function is a placeholder; you would use whatever function implemented your logic eg a classifier, b) creates a prepared sql.expression as an interface (this avoids re-binding the expression on every call, which is expensive, although maybe it doesn't matter if you are calling with a payload of 1,000 values), and c) calls the endpoint with two rows. It produces a JSON output of [ 3, 9 ], which is the horizontal sum of elements of each of the input rows.

    If you need to get going on this before the next release (next week), then we can show you how to set up an endpoint using a JS plugin that can perform arbitrary logic in response to an arbitrary route. There is slightly more code and it will be a little slower as it's implemented in JS, but it should work fine. Let us know.

    You could call that with as few or as many rows as you want
    Gediminas Žylius
    @gediminaszylius
    Great! Thank you, I will try this out and check how fast this approach is compared to simply streaming row by row.
    Gediminas Žylius
    @gediminaszylius
    This message was deleted
    I just tried to reproduce results using your example in python:

    `mldb.put("/v1/functions/eval_function1", {
    "type": "sql.query",
    "params": {
    "query": "select horizontal_sum(value) as value, column FROM row_dataset($input)",
    "output": "NAMED_COLUMNS"
    }
    })

    mldb.put("/v1/functions/score_many1", {
    "type": "sql.expression",
    "params": {
    "expression": "eval_function1({input: rowToScore})[output] AS *",
    "prepared": True
    }
    })

    mldb.get('/v1/functions/score_many1/application',{ "input": { "rowToScore":[{ "x": 1, "y": 2}, {"a": 2, "b": 3, "c": 4}] }})
    `

    and got :
    GET http://localhost/v1/functions/score_many1/application
    200 OK
    {
    "output": {
    "0.y": 2,
    "0.x": 1,
    "1.a": 2,
    "1.b": 3,
    "1.c": 4
    }
    }
    which is not the sum or rows
    did I made an error?
    Gediminas Žylius
    @gediminaszylius
    This message was deleted
    François Maillet
    @mailletf
    hey! there’s actually little issues that we are fixing as we speak related to that code. are you running mldb with the release docker or are you working from source?
    Gediminas Žylius
    @gediminaszylius
    oh sorry, you meant that this is not workig yet and will work next week if I understood correctly?
    François Maillet
    @mailletf
    yes apologies for the confusion. what woudl work now is doing it using a JS plugin that can perform arbitrary logic in response to an arbitrary route. more code and not the way we would suggest doing it once we’ve fixed the issues
    Jeremy Barnes
    @jeremybarnes
    @GedasZ FYI mldbai/mldb#730
    Jeremy Barnes
    @jeremybarnes
    There will be a new /v1/functions/<function>/batch route that directly implements the solution to your problem, so no need to mess around with the more complex solution.
    Gediminas Žylius
    @gediminaszylius
    Very cool! Can't wait for new release :)
    Gediminas Žylius
    @gediminaszylius
    The more I play around with MLDB the more I like it. But I have few questions:
    1) I saw that SVM is implemented as plugin using LIBSVM. My question is, would it be hard to add other popular algorithms like BPR, binary matrix factorization (both implemented in LIBMF) useful and popular for web click data and factorization machines (implementation in LIBFFM) usefull for nonlinear modeling in sparse settings? Those seem to be popular ant efficient implementations in c++ too.
    2) It seems to me that some popular statistical functions are missing (like standard deviation, median, percentile/quantile calculations and etc.) is it on purpose or I missed somthing in documentation?
    3) What about feature selection? Are there ways to implement it internally during model training (like CV) or only possible externally? Maybe best option would be using jseval?
    I hope I'm not bothering you too much.
    Jeremy Barnes
    @jeremybarnes
    Hi @GedasZ, MLDB was designed to include other algorithms, especially those with a C or C++ interface, so certainly it would be useful to implement those algorithms by wrapping them similarly to libSVD. Standard deviation is implemented as an aggregator as are most of the parametric distribution functions (and it's quite easy to add new ones). The nonparametric ones like median, percentile/quantile would also be relatively simple to add; if you're interested we could show you how to get started on them. For feature selection, most algorithms are relatively robust to dense features and so feature selection is no that much of an issue; we tend to use the explain feature on an algorithm like random forests that does select features in order to understand what it has chosen and why. For sparse features we typically do not include them directly in a classifier; we will embed them into a dense feature space using an unsupervised embedding like an SVD and then operate on the output of that feature space. Typically we will do feature engineering or selection using SQL; using COLUMN EXPR you can implement quite sophisticated feature selection logic (for example, in https://github.com/mldbai/mldb/blob/master/testing/MLDB-498-svd-apply-function.js#L101 we have a feature selection select as select COLUMN EXPR (AS columnName() WHERE rowCount() > 100 ORDER BY rowCount() DESC, columnName() LIMIT 1000) from reddit_dataset, which selects the 1,000 most frequent (sparse) features that occur more than 1,000 times. jseval is a great escape hatch for when you need to do something that's not possible otherwise, and it is pretty fast, so that's another option. And you're certainly not bothering us, keep the questions coming!
    François Maillet
    @mailletf
    @GedasZ small addition for #2. You asked for functions, so Jeremy’s answer is probably what you were looking for, but I’ll just add that there is the summary statistics procedure that calculates all the standard statistics on a dataset: https://docs.mldb.ai/doc/#builtin/procedures/SummaryStatisticsProcedure.md.html
    Gediminas Žylius
    @gediminaszylius
    @jeremybarnes Good to know that LIBMF and LIBFFM functionality could be added. Those libs implement really popular (or gaining increased popularity) algorithms efficiently. As I understand, user could not add those plugins by himself that are written in c++, so my question is, should one raise this request in github as a possibility of MLDB developer team to consider adding those algorithms in the future or your plans are very tight and it is not expected in near future?
    @mailletf thanks for pointing this out, I missed it somewhere in documentation
    Jeremy Barnes
    @jeremybarnes
    @GedasZ it is possible for users to add C++ plugins by themselves, outside the main MLDB repo, though the user experience isn't as good as it could be. That's how we implement our enterprise versions and upcoming Geographic/LiDAR functionality. Essentially, you need to do the same as we've done here for the mongodb plugin here: https://github.com/mldbai/mldb/tree/master/mongodb, except that mldb becomes a submodule of the main repo. You then put the external libraries in ext, and write the code in C++ to bridge between their representation and MLDB's data representation. At the end, you compile the plugin, and copy it into your mldb_data/plugins/autoload/<pluginname> directory, and it will be loaded by MLDB on startup.
    Another alternative would be to simply contribute it as a plugin to the open source version of MLDB
    If you're interested in pursuing either of these options, then we'd be happy to support you as you do so
    Gavin Hackeling
    @gavinmh
    This message was deleted
    Gediminas Žylius
    @gediminaszylius
    Hi, is it possible to export trained classifier file (for example "predictor.cls") from one MLDB instance to another and use it, or it must be trained on particular instance and is not reusible to another instances?
    Gediminas Žylius
    @gediminaszylius
    Another question, is it possible to train regression model (say gradient boosted regression trees) on multiple outputs ( 1 feature vector -> N-dimensional output/target vector) or I must train each model for every label independently?
    Jeremy Barnes
    @jeremybarnes
    @GedasZ It is completely possible to export trained classifier files from one MLDB instance to the other. They include all of the information about the feature names and types needed to run it in a different situation. You can also export a JSON dump of the classifier using /v1/functions/<classifier>/details although currently it's not possible to re-import that elsewhere.
    Regression models can only be trained on a single output currently. Can you tell us more about what you want to do with multiple outputs over the same model structure vs two separate models?