Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
  • May 14 2020 22:39
    @bjorno43 banned @minitechtips_twitter
  • May 14 2020 22:37
    @bjorno43 banned @real-action
  • Feb 01 2020 00:26
    @bjorno43 banned @Ndoua
  • Jan 07 2020 03:10
    @bjorno43 banned @doctor-sam
  • Oct 02 2019 18:47
    sarony removed as member
  • Oct 02 2019 17:45
    erictleung commented #82
  • Aug 15 2019 11:17
    FrednandFuria opened #82
  • Jun 20 2019 21:19
    @bjorno43 banned @shenerd140
  • May 10 2019 09:13
    @bjorno43 banned @zhaokunhaoa
  • Apr 27 2019 19:48
    @mstellaluna banned @zhonghuacx
  • Apr 25 2019 17:07
    @mstellaluna banned @cmal
  • Jan 08 2019 22:07
    @mstellaluna banned @gautam1858
  • Jan 08 2019 22:05
    @mstellaluna banned @dertiuss323
  • Dec 15 2018 23:34
    @mstellaluna banned @Julianna7x_gitlab
  • Oct 12 2018 05:50
    @bjorno43 banned @NACH74
  • Oct 05 2018 23:02
    @mstellaluna banned @JomoPipi
  • Sep 16 2018 12:21
    @bjorno43 banned @yash-kedia
  • Sep 16 2018 12:16
    @bjorno43 banned @vnikifirov
  • Sep 05 2018 08:13
    User @bjorno43 unbanned @androuino
  • Sep 05 2018 07:38
    @bjorno43 banned @androuino
Dhwaj Sharma
Muhammad Yasir
i have a question regarding dataset imbalance
so i will elaborate
Muhammad Yasir

I have a dataset which is for binary classification ( or at least we are approaching it from a binary classification perspective )

There are a total of 2.5 million rows, with label 0 belonging to around 220000 (2.2 million) rows and label 1 belonging to around 321000 (0.3 million) rows , there are around 45 features.

The imbalance approaches a ratio of around 1 : 7

My problem is very straightforward, even WITHOUT any data preprocessing if i try to classify the data

the classification algorithms, no matter what parameters are set, give around 99% in ALL performance metrics ( accuracy, precision, recall, f1 score etc )

This would probably suggest a bad case of overfitting but i am not sure, feel free to explain and add your opinion to what could be the reason

I tried to visualize the graph using TSNE and saw that the entire data is shaped like an ellipse and there is heavy overlap between both the lables. This means that (1) data is badly imbalanced (2) data is badly overlapped , i highly doubt i can use anomaly detection there as all the 'anomalies' (label 1) are sitting close with the 'normal' (label 0) data

any suggestions on how i should proceed ?

7 replies
Dr. Muhammad Anjum
@SyedMuhamadYasir Hi Dear I need some help plz
Hi, I need help with the API Key
HELP any chance somebody can help with this mysql install/config/socket error?
I am running "mysql_secure_installation" and getting
sudo mysql_secure_installation

Securing the MySQL server deployment.

Enter password for user root:
Error: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
Michael Li 🚀Publish Reproducible Jupyter Notebook
Does anyone else find Matplotlib's API hard to remember? I have friends who just export their Pandas data and plot with Excel. I spend a lot of time googling Matplotlib help. How do you make your Python Data Plots?
2 replies
Josh Goldberg
Yes. Matplotlib is not an intuitive API in my opinion. I prefer ggplot2.
@becausealice2 @erictleung Still here after all these years!
Alice Jiang
@GoldbergData Kinda sorta.... I check in every once in a while but I've made a career change and have been buy trying to keep up with all that comes with that. How have you been? any interesting projects?
I have a question. Lets say a car can have full trust output range [0,1] and can stire to left and right output range [-1,1] and an ANN should find the right values. Lets say the ANN should have the output [trust,stire angle]. Can i just split the raw output of the last linear layer and apply sigmoid to first output and tanh to 2nd output or has the activation to be the same for all entries of the output of the last linear layer?
Quincy Larson

Hey @/all freeCodeCamp is building a data science curriculum with advanced math and data science projects. Learn more here: https://www.freecodecamp.org/news/building-a-data-science-curriculum-with-advanced-math-and-machine-learning/

We are looking for open source contributors and experienced math + CS teachers for (paid) help with instructional design. If you are interested, please reach out to me at quincy@freecodecamp.org

Eric Leung
@theunknown22:matrix.org that's an interesting approach, where you'd apply different activation functions on nodes in a single layer. It appears to be possible, at least in PyTorch https://discuss.pytorch.org/t/control-specific-nodes-in-the-layer/78992. It hypothetically could help and give you more range on what is possible for it to predict. But it will make it more difficult to understand. I hope that helps.
Josh Goldberg
@becausealice2 I still work in the field. No interesting open source projects at this time.
Piyush Hirapara
How hard is it to animate bivariate Gaussian distribution by varying mean, individual variance and covariance?
In Python
Eric Leung
@GoldbergData good to see you around! I couldn't help notice you're at Amazon now. You working mostly in R or Python (or none of the above) these days?
Eric Leung
@Piyush-97 if you're using Jupyter Notebooks, you could consider using Jupyter Widgets to create a slider that can vary mean, variance, and maybe covariance for a distribution you want to create. You can probably then recreate visualizations like this https://rpsychologist.com/cohend/

I digitized some roads as multilines, hospitals as multipoints, boundary as polygon, then created how many roads intersect using the Simple Features (SF) library by getting latitudes and longitudes from google maps and plotted it using ggplot2 it worked well.

I then wanted to check and plot how many roads intersects with a hospital and created a 200mtr buffer around it and tried using st_intersects() function for the same, using this only gave 1:1 as answer and a message saying

Sparse geometry binary predicate list of length 1, where the predicate was `intersects' 1: 1

And when I tried plotting it, using ggplot it gives this error message

Error: data must be a data frame, or other object coercible by fortify(), not an S3 object with class sgbp/list Run rlang::last_error() to see where the error occurred.

I have added more details and code in a Stackoverflow question, please help please help 🥺.

Link as Plaintext: https://stackoverflow.com/questions/67350113/unable-to-plot-intersections-using-st-intersects-in-r

Alice Jiang
@GoldbergData Good on you! I was losing my mind over the local culture in the field and finally just threw in the towel. It's been a couple years since I looked at any data and I've honestly been missing it. I may take it back up as a hobby just to scratch that itch.
Hi I am new to the concept of gitter / channels. Is it ok to ask here just right away any question related to datascience, e.g. regarding a lavaan CFA model?
Hi, Is this group active?
Josh Goldberg
@becausealice2 culture can be disheartening. I see your point. Nice to hear from you. Definitely take a stab at some data stuff if it interest you. 🙂
Vicky Mbaka
Hello people. I'm a new coder studying data science and I'm joining this community in order to get exposure and learn from experts within the group.
Hello guys l'm new to machine learning, I'm a little bit confused is cross validation used to extract best hyperparameters or a good fit model or both?
Josh Goldberg
@elhafayn Cross validation is typically used to determine the best parameters of your model. It partitions your training data so you can get multiple data points for your metrics to determine your best parameters. You could do cross validation without parameter search also, but the former use-case is more typically I think.
1 reply
Daniel Enemona Adama
Hi, I’m Daniel
Munsif Raza
Hello everyone, I'm Munsif Raza
Mery Me
I want to learn
Can you help me
Mery Me
Thank you @jaralasad
So from where can I start?
You want to learn Data Science?
Jesse Smith
I want to build
Mery Me
Alyona Kavyerina

Hi everybody,
Here is the first comparative benchmark and benchmark framework for vector databases / search engines - https://qdrant.tech/benchmarks/

It is an open-source project, so you can reproduce it, propose your modification, new engines to compare, e.t.c. - https://github.com/qdrant/vector-db-benchmark

Hello, everyone. I want to learn Data Science. Where should I start?
Data science is a difficult field as it requires skills and knowledge, but it is definitely possible to learn it, especially if you are motivated. I became interested in data science for sustainable business solutions when I started a company. If you are a complete beginner, I recommend doing a Codecademy course.
Jonathan Bennett
Hey there folma.
I'm new to Gitter, so this is my first post... I'm looking for advice on the potential career prospects of advanced-level autodidactic I. expertise
Jonathan Bennett
*advanced-level autodidactic experts in I.T. .. I am self-educated in a vast range of technical skills, from DevOps, to education, and beyond. I have developed my skills over 20+ years. I am a disabled person, looking to begin working from home, as a technical professional. I would be interested to hear from anyone in this thread, with regards to the potential career prospects for someone who lacks accreditation/certification, yet possess a vast array of highly advanced technical expertise.
Jonathan Bennett
Could anyone in this thread possibly discuss with me; the most important skills required of data scientists, the industry standards required of junior developers in data science, and any learning resources I might use to develop solid foundations in data science, mathematics, and other fields pertaining to data science.