These are chat archives for FreeCodeCamp/DataScience

13th
Jan 2019
Vinay Banakar
@VinayBanakar
Jan 13 07:30
Hi, I have 2M images (dataset) that needs 1 to 1 comparison with 1000 images (target set) for similarity check. I have figured the similarity algo and it takes about 600ms for each 1 to 1 comparison, as you can see doing brute force it going to have a lot of latency. Do you suggest any designs that I can implement to improve throughput here? Initial thought is to split 2M 1 to 1 comparisons to multiple threads, and then run map reduce on all threads running job for a single image to deduce results. But I think there are better ways to do this
Niranjan Salimath
@srniranjan
Jan 13 11:38
What is your similarity algo? Why aren’t you looking at this as a classification task which has 1000 lables, and then maybe use a CNN or something?
evaristoc
@evaristoc
Jan 13 12:05

Hi @VinayBanakar
@srniranjan has a point. If you are doing similarity though it is because you are NOT having elements in the target set that could be found in the learning/test set, or your task doesn't consists in finding them.

A similarity test is applicable if you just want to see how similar they are "overall".

What CNN does nicely is sampling (by panels): Can you do something similar? Can you then think a way of indexing? You can index the panels and compare from there. Things I have also seen used is hashing, in the same idea.

Then you can try to find a branch-and-bound search algorithm over the sampled panels/hashes which would be more efficient that comparing all images one by one. That, assuming that similarity has to do with the way the image is displayed (eg. position).

An indexed branch-and-bound algo is also "parallelizable".

Just an idea.

evaristoc
@evaristoc
Jan 13 12:12
It would be interesting to understand what you mean by "similarity" in this case though.
evaristoc
@evaristoc
Jan 13 12:21

People:

I recently wrote an article for fCC Medium:
https://medium.com/me/stats/post/71875ab184ee

Hope to hear your views?

Alice Jiang
@becausealice2
Jan 13 16:37
@pdurbin absolutely! I'm sorry, I thought I replied to this earlier :joy:
Alice Jiang
@becausealice2
Jan 13 17:15
@evaristoc The link is broken for those of us who aren't you. The link should be https://medium.com/post/71875ab184ee
Also, have you become a JS AR dev, now?
Did we lose you? D:
Philip Durbin
@pdurbin
Jan 13 17:21
@becausealice2 no problem. Did you decide if you're coming t DataFest or not?
Alice Jiang
@becausealice2
Jan 13 18:54
I couldn't reschedule my appointment for the 23rd but I signed up to be on a waitlist to be there the 22nd
I wont have a computer, but I remember better when I write by hand anyways.
Philip Durbin
@pdurbin
Jan 13 18:55
I'm sure you can just show up. You'll find a seat.
Alice Jiang
@becausealice2
Jan 13 19:51
:+1:
Alice Jiang
@becausealice2
Jan 13 20:38
Anyone here know SPSS and how to export to CSV without randomly selected observations' features wrapping around and becoming inconsistent new observations?