by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Henry Saputra
    @hsaputra
    Yay, it is working =)
    Holger Peters
    @HolgerPeters
    hi
    I recently reported a bug dmlc/xgboost#1995 and provided a fix with PR dmlc/xgboost#1996
    was wondering if there's anything else to do for a contribution (like writing to a mailing list etc)?
    Denis M Korzhenkov
    @denkorzh
    Hi!
    I've opened an issue #2140 to provide a possibility to have row id in prediction file. Unfortunately it's not a priority now.
    Are there any enthusiasts ready to develope this option for CLI mode?
    Priyanka Goyal
    @goyalpri
    I am working on a big data analytics project, which has frequent updates. I have to perform a lot of analytical queries. Can u guys suggest me the tech stack, I should use for the project?
    I am thinking HBase and Hadoop but since I'm new to big data world, I'm kind of confused. Thanks in advance.
    geoHeil
    @geoHeil
    How big is big? How fast do you need to process updates? Streaming 7 realtime (sub second/ minutes) or batch queries?
    Priyanka Goyal
    @goyalpri
    ~1 inserts/queries per minute.
    and ~2 million db records.
    geoHeil
    @geoHeil
    How many columns Terabytes of data? If it is small < 1TB (max up to 5TB) probably a single machine could hold it all in memory (for very small $$). If you consider your data too big. (But still think it all fits into memory i.e. fast queries) maybe https://www.snappydata.io is good. For medium sized data Presto and Impala will be interesting and for very big data Hive. What would you expect for query latency? Certainly spark (unless you are using petabytes of data) would be an intersting fit as well and with something like snappy probably sort usable for fast one off queries.
    Priyanka Goyal
    @goyalpri
    great, thanks
    but since I'm more familiar with SQL, can I do something transactional with low latency in hadoop cluster?
    yes, I just have GBs of data
    geoHeil
    @geoHeil
    Transactional usually scales bad. Most often eventual consistency is used. But for OLTP maybe hbase is a fit. But that still ist just a nosql key value store.
    Priyanka Goyal
    @goyalpri
    So can I write SQL for analytical and HBase for database?
    Does that mix well?
    geoHeil
    @geoHeil
    If it is only a coulbe of 100reds of GB you really should think if hadoop is the right solution. Anyway regarding SQL: do you need transactions or not? If you just want to output the results of analytical queries probably not. Check https://hbase.apache.org/acid-semantics.html for HBase in case transactions are mandatory. Mabe https://phoenix.apache.org is a fit for your needs.
    Priyanka Goyal
    @goyalpri
    yeah, phoenix is awesome
    I think I'll go with phoenix and hbase. thank you so much.
    you're so kind and intelligent
    skywalkerytx
    @skywalkerytx
    hi hi, how to make fevalin xgb.train() accept multiple custom eval functions?
    aprelkin
    @aprelkin
    hey guys, does anyone know , how to solve this error , when accesing through xgboost4j on Windows?
    "dmlc-core/include/dmlc/logging.h:235: [10:17:46] src/io/input_split_base.cc:190: file offset not calculated correctly", although the file looks good.
    Mateusz Dymczyk
    @mdymczyk
    hey, been trying (and failing) to implement distributed XGBoost4j using Rabit (like you guys did for Spark and Flink). Do I just need to start a RabitTracker, pass the rt.getWorkerEnvs() to all subnodes, start Rabit.init(env), do train and Rabit.shutdown() on each and just call rt.waitFor(0) on the driver, or is there something more to it? Seems my train instances aren’t communicating with eachother and are just training using local data
    Mateusz Dymczyk
    @mdymczyk
    ok seems it only isn’t working on MacOS, works fine on Linux
    Jonathan Hourany
    @JonathanHourany
    Hello! I'm sorry if I missed this information somewhere on Google, but what's the best way to remove xgboost after a global install with sudo python setup.py install? I didn't realize there was a pip installable package already and I'd rather do that in my virtualenv
    El-Hassan Wanas
    @foocraft
    @JonathanHourany Could you try sudo pip uninstall xgboost
    Jonathan Hourany
    @JonathanHourany
    @foocraft Thanks for the reply. I did, and pip kicked back with package-not-found error
    Sergei Lebedev
    @superbobry
    Hi! A question on the JVM API: why are there two overloads for setBaseMargin in Booster? One where the margin is Array[Float] and the other one for the nested case Array[Array[Float]]?
    David Hirvonen
    @headupinclouds
    It looks like v0.6.0 is the last stable release. This is about one year old. I'd like to add an update to the hunter (CMake) package manager (last version was 0.4.0), and am curious if this is the recommended version, if a new release is planned in the near future, or if there is some more recent tested tag/commit that should serve as a stable release point. Thanks!
    El-Hassan Wanas
    @foocraft
    Hi all, I'm having an issue since around 2 months now and it's been reported multiple times, #2286 I'm wondering if there's a fundamental reason why this has to happen
    I checked the code, and it seems that it occurs while preparing histograms
    More interestingly, when I set missing to some number, e.g. -9999 for xgboost modeling parameters and pass a dataset that doesn't have missing values, AUC drops to 0.5 from 0.65. This is possibly an unrelated issue, but it seems handling of missing values causes multiple problems
    lesshaste
    @lesshaste
    import xgboost gives the warning cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    but it looks like it should be fixed in rhiever/tpot#284
    was that never merged?
    Dmitry Mottl
    @Mottl
    Hi, everybody! Could someone help me with a xgboost parameters. I have the following code:
    xgb = XGBRegressor(n_estimators=1000, silent=0)
    xgb.fit(train.as_matrix(), trainY, verbose=1, eval_metric="rmse")
    P = xgb.predict(test.as_matrix())
    it doesn't output RMSE metric during training. Where am I wrong?
    Ketan Kunde
    @KetanKunde_twitter
    Hi
    i was looking to build xgboost from source
    just wanted to know if anyone on the group has tried doing it before?
    Ketan Kunde
    @KetanKunde_twitter
    also just wanted to confirm whether this is 100% open source
    Guryanov Alexey
    @Goorman
    Hello. Does anyone know if i can slice xgboost's DMatrix by column or block certain features from being used in specific train instance?
    Chris Chow
    @ckchow
    @Goorman it's probably easier to make a new DMatrix with those rows removed or censored in whatever way you need.
    lesshaste
    @lesshaste
    how can you use the pearson correlation coefficient as the loss function with the xgboost regressor?
    Guryanov Alexey
    @Goorman
    @ckchow you have probably meant columns removed and yes this is the only solution i see right now. The problem is that i have to construct DMatrix from sparse libsvm file, and for example to perform greedy feature selection i would have to create new (big) libsvm file every iteration. Which is annoying.
    Chris Chow
    @ckchow
    Oh, I see. can't you construct DMatrices in memory from arrays of arrays?
    Chris Chow
    @ckchow
    At least in Java there is a float[][] constructor, and I think there's a numpy constructor in python as well. might be out of luck if you're using the command line version.
    lesshaste
    @lesshaste
    hi... does anyone understand why xgboost is so slow if you have lots of classes? This code shows the problem https://bpaste.net/show/f7573b5a2fb9 RandomForestClassifier takes about 15 seconds
    but xgboost never terminates at all for me
    Lyndon White
    @oxinabox
    I am training a binary classifier.
    In the problem I am working on,
    I can generate more training data at will.
    In that by running a simulation I can (determenistically) determine the correct label for any feature set
    Each training case takes a bit to generate (say 0.5 seconds).
    The main motivation for training a classifier is that evaluating via simulation takes too long.