Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Dec 03 2015 04:13

    wencanluo on master

    lastest update (compare)

  • Dec 03 2015 03:56

    wencanluo on master

    Added the training dataset (compare)

  • Dec 03 2015 03:30

    wencanluo on master

    Add google suggestion data Merge branch 'master' of https:… (compare)

  • Dec 02 2015 15:41

    wencanluo on master

    Update README.md (compare)

  • Dec 02 2015 15:16

    wencanluo on master

    Update README.md (compare)

  • Dec 02 2015 15:12

    wencanluo on master

    Update README.md (compare)

  • Dec 02 2015 15:11

    wencanluo on master

    Update README.md (compare)

  • Sep 04 2015 21:12
    wencanluo unlabeled #13
  • Sep 04 2015 21:12
    wencanluo closed #13
  • Sep 04 2015 21:12

    wencanluo on master

    upddated the prediction (compare)

  • Aug 28 2015 22:56
    wencanluo closed #64
  • Aug 28 2015 22:56
    wencanluo unlabeled #64
  • Aug 28 2015 22:56
    wencanluo unlabeled #63
  • Aug 28 2015 22:56
    wencanluo closed #63
  • Aug 27 2015 05:50

    wencanluo on master

    fixed the unique numbers (compare)

  • Aug 27 2015 05:50
    wencanluo labeled #63
  • Aug 27 2015 05:50
    wencanluo labeled #64
  • Aug 27 2015 05:50
    wencanluo unlabeled #62
  • Aug 27 2015 05:50
    wencanluo closed #62
  • Aug 21 2015 23:57
    wencanluo labeled #62
Wencan Luo
@wencanluo
Wencan Luo
@wencanluo
out of the 7341968 names, 99.5% (7302479) of them are equal or less than 3 words.
Dmitry Mozzherin
@dimus
Sounds about right
Wencan Luo
@wencanluo
Do you have experience how to query a big table row by row? Now, I have a 'out of memory' problem.
Dmitry Mozzherin
@dimus
can you paste your query?
I talked to Paddy yesterday, he said he should be available this Friday at 7PM for our meeting, he did not read my email in time. Also he said he will add a 'nomenclatural quality' column to his spreadsheet with GN data sources
Wencan Luo
@wencanluo
select canonical_forms.name, name_string_indices.data_source_id, name_string_indices.classification_path
from name_strings
join canonical_forms
on name_strings.canonical_form_id = canonical_forms.id
join name_string_indices
on name_strings.id = name_string_indices.name_string_id
Dmitry Mozzherin
@dimus
Yes, now we can see new information that Paddy added
Dmitry Mozzherin
@dimus
I think you can significantly decrease the size of the query if you look at the problem from the point of finall result
I imagine that our features will be written in a table similar to parsed_name_strings, where id is the same as in name_strings table
and this table would have a field "has classification" or something like that.
After you have this field it will be much cheaper to see how it relates to canonical forms table.
Wencan Luo
@wencanluo
wencanluo/good-bad-names-for-GN#24
Check out the good and bad names given by the source weight
Dmitry Mozzherin
@dimus
@wencanluo I did submit midterm evaluation for you, and yes you did pass
Dmitry Mozzherin
@dimus
@wencanluo what do you think about having a meeting today/tomorrow?
Wencan Luo
@wencanluo
How about tomorrow? Any time after 7pm EST is ok
Dmitry Mozzherin
@dimus
Lets talk at 7PM today then
Wencan Luo
@wencanluo
great. Talk to you then
Wencan Luo
@wencanluo
You can check out the classification results using the parser features below
https://docs.google.com/document/d/1mblzmi1o0dm70OSvR0qR7vrQ69KBONw_wroArRMeCPc/edit
You can check out the simple bad names in the current data base below:
https://docs.google.com/document/d/11df0O-VSOQH_pWwHsov8zuw9ZUbLn1KpfKmQ4vrQ5Ww/edit
Dmitry Mozzherin
@dimus
thank Wencan
All the species names:
Wencan Luo
@wencanluo
Could you show the fuzzy matching algorithm? so that I don't need to write it myself.
this is Ruby version of the algorithm, unfortunatelly no Python version exists to my knowledge
it uses Damerau-Levenshtein algorithm and custom weighting of results
plus matching of years and authors
Wencan Luo
@wencanluo
Dmitry Mozzherin
@dimus
One thing we have to take in account when we are generating data source quality metric — if a name is a synonym and a bad name — it should not influence metric. Often synonyms contain “popular” alternative spellings and misspellings which is good. We know if something is synonym, if name_string_index’s taxon id is different than the last taxon id in the path, or if this name is marked in the table as synonym
Wencan Luo
@wencanluo
Great to know. The synonym information will also help the good and bad name classifier
Wencan Luo
@wencanluo
This is the document that shows some examples bad names predicted by the model
https://docs.google.com/document/d/11df0O-VSOQH_pWwHsov8zuw9ZUbLn1KpfKmQ4vrQ5Ww/edit
This is the detail features about the predict model
https://docs.google.com/document/d/1mblzmi1o0dm70OSvR0qR7vrQ69KBONw_wroArRMeCPc/edit
The current model got an accuracy of 79.2% on the VertNet data set evaluated with 10-fold cross-validatation
The final model trained with all the vertnet data set.
Wencan Luo
@wencanluo
Wencan Luo
@wencanluo
This is the new data sources rating results after updating the model
https://docs.google.com/spreadsheets/d/12GXMyhjUBxOJJBW_TO9fC9vyDqCt7-_LxhkNE25t3FY/edit#gid=0