Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Nov 21 2021 16:20
    serge-hulne closed #36
  • Nov 21 2021 15:34
    serge-hulne edited #36
  • Nov 21 2021 15:29
    serge-hulne opened #36
  • Mar 29 2020 08:41

    watzon on master

    Update FUNDING.yml (compare)

  • Jan 29 2020 04:24
    watzon closed #35
  • Jan 29 2020 04:24
    watzon commented #35
  • Jan 21 2020 22:37
    Calamari commented #35
  • Jan 21 2020 20:09
    watzon commented #35
  • Jan 21 2020 18:28
    Calamari edited #35
  • Jan 21 2020 18:27
    Calamari opened #35
  • Nov 11 2019 19:58
    watzon commented #30
  • Nov 11 2019 19:20
    rmarronnier commented #30
  • Nov 07 2019 23:07
    watzon unlabeled #34
  • Nov 07 2019 23:07
    watzon unlabeled #33
  • Nov 07 2019 23:07
    watzon labeled #33
  • Nov 07 2019 23:07
    watzon unlabeled #33
  • Nov 07 2019 23:06
    watzon labeled #34
  • Nov 07 2019 23:06
    watzon labeled #34
  • Nov 07 2019 23:06
    watzon labeled #33
  • Nov 07 2019 23:06
    watzon labeled #33
Rémy Marronnier
@rmarronnier
The WORKFLOWNAME is probably set to Crystal CI but you can change it
Chris Watson
@watzon
Haha I actually had done that for one of the other repos already
My example links to the workflow as well though. I'll see if I can find it.
Also, I'm playing around with the idea of using the Bayes Classifier to do language detection.
Rémy Marronnier
@rmarronnier
Hehe, you beat me to it ;-)
Chris Watson
@watzon
I've got a poc working already
Rémy Marronnier
@rmarronnier

Also, I'm playing around with the idea of using the Bayes Classifier to do language detection.

Have fun !

Wow !
That was fast :-)
Chris Watson
@watzon
Basically it tokenizes a string normally and then takes each of the word tokens and makes them into smaller tokens up to three characters long. That way it can guess the likelyhood of a language based on the characters that are next to each other
Rémy Marronnier
@rmarronnier
I'm playing with vectors and matrices for the summarizer
Chris Watson
@watzon
Ooh nice
image.png
The best part, it works fairly well on small text samples
These both return the correct answer
Rémy Marronnier
@rmarronnier
Fucking awesome !
Chris Watson
@watzon
I just need some good sample sets to train on now
Rémy Marronnier
@rmarronnier
I'm going to write a proposal for an Evaluation repo for Cadmium.
Chris Watson
@watzon
What kind of Evaluation?
Rémy Marronnier
@rmarronnier
it will be a collection of crystal scripts that :
1 - Download a dataset
2- Run ad Cadmium::module against it
3- Compare the results with the good values
For example, for language identification : http://www.cs.cmu.edu/~ralf/langid.html
There is a data set of wikipedia texts in 100+ languages and the associated good iso codes in separate text files
I ran our current language identification algo and I got 24 % of good results :-p
Our current algo only recognise 400 languages :-)
You can train your classifier on this data set
Chris Watson
@watzon
Nice!
Chris Watson
@watzon
Holy shit I've even got it differentiating between Spanish and Portuguese
Which are very similar languages
Chris Watson
@watzon
Training on that dataset is taking a very long time :joy:
Rémy Marronnier
@rmarronnier
HahaHa !
Have you tried with MT :-p
Chris Watson
@watzon
Lol not yet
I'd have to make the classifier support it
Interesting note though, training the classifier on huge chunks of text takes an exorbitant amount of time, but training it on smaller chunks is extremely fast.
At first I tried to do classifier.train on the entire courpus for each of the languages and it was taking hours with no end in sight. But I just now tried training line by line instead and it finished in minutes.
Chris Watson
@watzon
Getting close
image.png
Too bad I can't spell
Rémy Marronnier
@rmarronnier
I love this UI !
Chris Watson
@watzon
Me too :) I like having a progress bar
I would love to figure out a way to multithread the workflow and have one progress bar per thread, but idk if it's possible
Chris Watson
@watzon
Training just finished on a massive amount of data
188 languages
Rémy Marronnier
@rmarronnier
>
I would love to figure out a way to multithread the workflow and have one progress bar per thread, but idk if it's possible
I don't see why not, don't apt-get do this when dowloading several packages. ?
Anyway, it's so cool to train your own models with your own code. Congrats !
Chris Watson
@watzon
I'm sure that a multithreaded workload is possible, it's just going to require some heavy refactoring in the actual BayesClassifier class
Rémy Marronnier
@rmarronnier
Yeah, I guess. When MT is by default on in Crystal, we'll have refactoring our algos. (Good luck with glove :-p)
Chris Watson
@watzon
Luckily I already have the GloVe algo ready to go
Rémy Marronnier
@rmarronnier
yeah, you're right
Chris Watson
@watzon
It's technically multithreaded already, but it's using a library that spins up new threads itself
Rémy Marronnier
@rmarronnier

but it's using a library that spins up new threads itself

I'll have to look again at your code because apart from apatite I don't see any special library

Chris Watson
@watzon
Oh it looks like I never pushed my last update
That's why