Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gediminas Žylius
    @gediminaszylius
    understood, thank you :)
    Gediminas Žylius
    @gediminaszylius
    So I tried to train a regressor for one of multiple outputs (called V3) with following python code:
    mldb.put("/v1/procedures/factor_regressor_model", {
    "type": "classifier.train",
    "params": {
    "mode": "regression",
    "trainingData": """
    select
    {* EXCLUDING("""+",".join(["V"+str(i+3) for i in xrange(50)])+""")} as features,
    V3 as label
    from query_factor_table_id
    """,
    "algorithm": "bbdt",
    "functionName":"factor_regressor",
    "modelFileUrl": "file:///mldb_data/factor_regressor.cls"
    }
    })
    with the bbdt and bbdt-5 configurations it gives me following error and database crashes (imported data are gone):

    ResourceError Traceback (most recent call last)

    <ipython-input-71-bd023aae1a85> in <module>()
    11 "algorithm": "bbdt",
    12 "functionName":"factor_regressor",
    ---> 13 "modelFileUrl": "file:///mldb_data/factor_regressor.cls"
    14 }
    15 })

    /usr/local/lib/python2.7/dist-packages/pymldb/init.pyc in inner(args, **kwargs)
    16 result = add_repr_html_to_response(fn(
    args, **kwargs))
    17 if result.status_code < 200 or result.status_code >= 400:
    ---> 18 raise ResourceError(result)
    19 return result
    20 return inner

    ResourceError: '502 Bad Gateway' response to 'PUT http://localhost/v1/procedures/factor_regressor_model'

    {
    "error": "MLDB Unavailable",
    "details": {
    "message": "MLDB is unable to respond either because it has not finished booting or because it has crashed and has not finished rebooting. Recent log messages are available via HTTP at /logs/mldb"
    },
    "httpCode": 502
    }

    but it works with bdt algorithm configuration
    what could be the problem?
    Jeremy Barnes
    @jeremybarnes
    @GedasZ would it be possible to send us the dataset (in private is OK) so that we can debug it? Boosting on regression tasks requires special care as the margin is ill-defined and numerical errors can creep in due to the exponentiation in the boosting loss. It shouldn't crash the DB, however, and we'd like to fix that. I would run bagged decision trees, not bagged boosted decision trees, for a regression problem.
    Gediminas Žylius
    @gediminaszylius
    Also when I got to log file i saw "MLDB exited due to signal 11 SIGSEGV: segmentation fault (internal error)"
    Gediminas Žylius
    @gediminaszylius
    it seems that random forest (bagged decision trees) implementation for regression outputs -NaN data (after importing in python pandas dataframe). When for example same configurations in xgboost works fine. I also dont see if there is possible to select cost optimization function. So I assume that for regression problems MSE is implemented?
    Jeremy Barnes
    @jeremybarnes
    If you use bagged decision trees, then regression should work. The only cost function implemented to calculate split points in decision trees for regression is MSE, you are right about that. What other function have you had success with that you'd like to use? In general, in MLDB we try to avoid having too many options so that it's possible to have reasonable results without doing any work on parameter tuning.
    Jeremy Barnes
    @jeremybarnes
    I just verified; the segfault in MLDB on boosted decision trees is due to boosting not having a suitable cost function for regression implemented. In the next release, it will return an error instead of segfaulting.
    in general, the naive approach (calculating the margin as the L1 norm of the difference between the prediction and the output) is problematic, as the boosting loss function is exponential in that difference, and it can never go to zero, hence we end up with huge error terms. We would probably need some kind of acceptable error zone as a hyperparameter to make it work.
    for the bagged decision trees with the NaNs, could you show us what you did?
    Dan Croitoru
    @cromodnet
    new to MLDB: trying to install it. what is MLDB_IDS (user id I need to put in command to install into docker container -e MLDB_IDS="id"
    Jeremy Barnes
    @jeremybarnes
    It's the user ID of the user who you want to own the files that MLDB produces in mldb_data. Without it, those files end up owned by root or an unknown user ID which makes it hard to work both inside and outside of MLDB at the same time.
    Dan Croitoru
    @cromodnet
    Thanks Jeremy but whatever I put as ID, I obtain always the same error message: usermod: invalid user ID 'xxx'
    Jeremy Barnes
    @jeremybarnes
    @cromodnet if you type id in your shell, what do you get as output?
    Vinay Agarwal
    @vinkaga
    Does mldb have inception-v4 model yet?
    François Maillet
    @mailletf
    @vinkaga you're talking about inception resnet2 ? not yet. there is some work that needs to be done in mldb to support it. we’ve started but don’t have an ETA on it
    Vinay Agarwal
    @vinkaga
    @mailletf Thanks
    Da Kuang
    @dkuang1980
    @mailletf Nice talk yesterday at Pycon Canada! I am Da from SurveyMonkey, very nice to meet you.
    François Maillet
    @mailletf
    hey @dkuang1980 thanks for coming to the talk and it was also good to meet you. fell free to shoot me an email if you have any questions and I’d be happy to jump on a call with you if you want to explore the possibility of using mldb
    Simon Narang
    @simonnarang
    this is super cool
    Simon Narang
    @simonnarang
    i want to run it on AWS. Do I just pay for the AWS hosting and you guys take a portion of that or do I need to pay an additional cost
    François Maillet
    @mailletf
    Hey @simonnarang. If you’re using the open-source version, you can run it wherever you want and you don’t owe us anything. If you want to run the docker image of the enterprise version, then there is a license fee for MLDB, plus any hosting costs that your provider will charge. We also offer a hosted version of MLDB where you essentially don’t have to deal with the hosted aspect of it and we manage it all. Can you shoot me an email at francois@mldb.ai and we can go over things in more details?
    Simon Narang
    @simonnarang
    sure
    Gediminas Žylius
    @gediminaszylius
    Hello, I wanted to ask how preprocessing of image is performed when using deepteach demo with tensorflow inception v3 model? Is it cropped to required dimensions or resized, interpolated if dimensions are too small or somehow different?
    Simon Narang
    @simonnarang
    i think the example just uses an image that is already in the right dimensions
    Jeremy Barnes
    @jeremybarnes
    It is resized to the correct dimensions, as part of the model graph
    Gediminas Žylius
    @gediminaszylius
    And what about when the image is too small? Does this part of the graph fills missing values with some constant or generates missing values by nearest neighbours interpolation for example?
    Jeremy Barnes
    @jeremybarnes
    I believe that it will interpolate the missing pixels.
    Gediminas Žylius
    @gediminaszylius
    hi, i'm still confused when i can access function input values via $ and when I cannot. For example when I use transpose function after FROM statement with $ inputs i get error:
    {
    "httpCode": 500,
    "error": "Binding context MLDB::SqlExpressionMldbScope does not support bound parameters ($1... or $name)"
    }
    mldb.put("/v1/functions/nearestneighbors", {
    "type": "sql.query",
    "params": {
    "query": "SELECT * FROM transpose((SELECT nearest
    " + collection + "({coords: inception({url: $URL})})[distances] AS *))",
    "output": "FIRST_ROW"
    }
    })
    I want to get transposed output of embedding dataset that I get from embedding.neighbors type function
    François Maillet
    @mailletf
    if you execute the same query in an mldb.query() does it work?
    Gediminas Žylius
    @gediminaszylius
    yes, when i hardcode input instead of $URL it works fine
    François Maillet
    @mailletf
    so It’s actually a known issue with parameters not making into the subselect in an sql.query. the best workaround I can suggest is to change your calls to your nearestneighbors function to a call to the /v1/query and put the complete sql call with your SELECT * FROM tranapose() and adding a LIMIT 1 to get essentially the same output as you would with sql.query
    Gediminas Žylius
    @gediminaszylius
    understood, thanks
    Gediminas Žylius
    @gediminaszylius
    question about embedding dataset: is there a way to calculate how much approximately embedding dataset will take RAM? For example, using your inception model with second to last layer (~2k per instance) features for knn calculation for 10000 images RAM consumption increases by about 2.5GB, after this code:
    mldb.put("/v1/procedures/embedder", {
    "type": "transform",
    "params": {
    "inputData": """
    SELECT inception({url: URL}) AS *
    FROM images_tmp LIMIT 10000
    """,
    "outputDataset": {
    "id": "embeddedimages%s" % collection,
    "type": "embedding"
    }
    }
    })
    so I guess for 100ks of images, this kind of embedding would require a huge RAM for knn application and there is no way to do it more efficiently?
    Gediminas Žylius
    @gediminaszylius
    also I see that RAM consumtion not much differs when I use tabular dataset instead of embedding, so the reason is maybe not embedding, but how dataset is stored am I right?
    François Maillet
    @mailletf
    @GedasZ sorry for not answering ealier. There’s no way to ask MLDB how much a resulting operation would take. However things should be at worst linear so you can infer the size from doing the operation on a few datapoints. The increase of ram you mention seems bigger than it should be. we’ll make some checks on our side and I’ll let you know
    Dr. Di Prodi
    @robomotic
    hello guys, I am new to mldb and I would like to add basic feature ranking functionality from basic functions like counting attributes to mutual information ranking, what are the guidelines to implement those e.g. function vs procedure, native vs plugin etc. etc
    François Maillet
    @mailletf
    hey @robomotic. potentially you want a combination of both functions and procedures. functions can be used to take one of many columns in a row and return one to many columns back, while a procedure can take the result of a whole query and return a new dataset or any other artefact. as for native vs plugin, I’d go with plugin. we have a sample plugin you can base your code on here: https://github.com/mldbai/mldb_sample_plugin
    paperstack
    @paperstack
    Hello... We need to use Pandas version 0.18 or greater (maybe via experimental.external procedure). Is there a way to do so? My assumption was that experimental.external would use the global python interpreter and I pip installed pandas 0.19 but the plugin keeps throwing errors relating to pandas 0.17.
    François Maillet
    @mailletf
    hey @paperstack. you’re using pandas in the notebook running on mldb?
    Dr. Di Prodi
    @robomotic
    hello guys, I can't find the Amazon AMI for launching an Ec2 instance