These are chat archives for FreeCodeCamp/DataScience

7th
Sep 2015
Suzanne Atkinson
@AdventureBear
Sep 07 2015 03:55 UTC
hello data science room
Caroline Dikibo
@cdikibo
Sep 07 2015 04:15 UTC
hi
evaristoc
@evaristoc
Sep 07 2015 08:37 UTC
@AdventureBear welcome! I hope you enjoy this room!
evaristoc
@evaristoc
Sep 07 2015 15:10 UTC

Hi people:

@BerkeleyTrue, @QuincyLarson, @dcsan, @benmcmahon100
@cdikibo, @andela-bfowotade
@abhisekp, @biancamihai, @Lightwaves, @AdventureBear, @SaintPeter, @mildused, @ArielLeslie, @qmikew1

Did you know that...?

  • In the set that we are still analysing (ie. the Help channel Jan-Jul 15) there are about 1.6 mill words and punctuation signs, counting all messages. The data is raw: this is including also links, code-related lines, etc.
  • Just analysing the raw data we still wanted to find some important bigram collocations (bigrams: set of paired words; collocations: expressions of multiple words which commonly co-occur). I used two mixed samples with different sizes. After some treatment to the text (tokenization), words that trended to appear together were:
    • proper names of food, places or objects (eg "penaut butter", "eiffel tower");
    • proper names of programming tools or programming concepts (eg. "regular expressions", "action hero");
    • and of course, some bonfires names (eg. "chunky monkey").
      However the results between samples seemed to be different because the methodology used relies a lot on sample size. No larger sizes were used because the algorithm to calculate collocations is CPU intensive and was taking too much to evaluate the full set. This is just exploratory.
  • We could evaluate bigrams for all the dataset. Although also exploratory, we went through detecting bigrams with the word "bonfire(s)". There were at least over 700 bigrams with the word "bonfire(s)" after eliminating some stopwords and punctuation signs. Below I listed the most frequent ones in reverse order. The result shows the potential and challenges of closing the gap to detect relevant words in an specific context.
[(('bonfire', 'links'), 11),
 (('finished', 'bonfire'), 11),
 (('advanced', 'bonfire'), 11),
 (('bonfire', 'info'), 11),
 (('bonfire', 'symmetric'), 12),
 (('bonfire', 'spoiler'), 12),
 (('bonfire', 'title'), 12),
 (('bonfire', 'search'), 12),
 (('bonfire', 'script'), 13),
 (('case', 'bonfire'), 13),
 (('bonfire', 'using'), 13),
 (('bonfire', '16'), 13),
 (('bonfire', 'http'), 14),
 (('bonfire', 'convert'), 14),
 (('bonfire', 'exact'), 14),
 (('bonfire', 'bfname'), 14),
 (('thou', 'bonfire'), 15),
 (('bonfire', 'roman'), 15),
 (('mutations', 'bonfire'), 15),
 (('done', 'bonfire'), 15),
 (('arrays', 'bonfire'), 16),
 (('bonfire', 'sorted'), 16),
 (('bonfire', 'challenges'), 16),
 (('entities', 'bonfire'), 16),
 (('please', 'bonfire'), 17),
 (('latin', 'bonfire'), 17),
 (('bonfire', 'mutations'), 17),
 (('bonfire', 'make'), 18),
 (('bonfire', 'seek'), 18),
 (('bonfire', 'chunky'), 19),
 (('bonfire', 'pairwise'), 19),
 (('monkey', 'bonfire'), 20),
 (('destroy', 'bonfire'), 21),
 (("'", 'bonfire'), 24),
 (('difference', 'bonfire'), 24),
 (('multiple', 'bonfire'), 24),
 (('palindrome', 'bonfire'), 24),
 (('string', 'bonfire'), 25),
 (('bonfire', '"'), 28),
 (('bonfire', 'arguments'), 28),
 (('optional', 'bonfire'), 31),
 (('person', 'bonfire'), 32),
 (('pairwise', 'bonfire'), 32),
 (('change', 'bonfire'), 37),
 (('bonfire', 'challenge'), 43),
 (('bonfire', "'"), 46),
 (('"', 'bonfire'), 102),
 (('bonfire', 'called'), 105),
 (('bonfire', 'name'), 117),
 ((':[', 'bonfire'), 207)]
CamperBot
@camperbot
Sep 07 2015 15:11 UTC
Sorry, can't find a bonfire called links 11 finished bonfire 11 advanced bonfire 11 bonfire info 11 bonfire symmetric 12 bonfire spoiler 12 bonfire title 12 bonfire search 12 bonfire script 13 case bonfire 13 bonfire using 13 bonfire 16 13 bonfire http 14 bonfire convert 14 bonfire exact 14 bonfire bfname 14 thou bonfire 15 bonfire roman 15 mutations bonfire 15 done bonfire 15 arrays bonfire 16 bonfire sorted 16 bonfire challenges 16 entities bonfire 16 please bonfire 17 latin bonfire 17 bonfire mutations 17 bonfire make 18 bonfire seek 18 bonfire chunky 19 bonfire pairwise 19 monkey bonfire 20 destroy bonfire 21 bonfire 24 difference bonfire 24 multiple bonfire 24 palindrome bonfire 24 string bonfire 25 bonfire 28 bonfire arguments 28 optional bonfire 31 person bonfire 32 pairwise bonfire 32 change bonfire 37 bonfire challenge 43 bonfire 46 bonfire 102 bonfire called 105 bonfire name 117 bonfire 207. [ Check the map? ]
evaristoc
@evaristoc
Sep 07 2015 15:11 UTC
hahaha!! ^^

Quick Report:
This week the team was not totally available (sickness, other issues). Additionally, @Lightwaves has expressed that he won't be able to continue working as part of the active team, busy with the school. He has done a great job! Thanks, @Lightwaves!
DA app:
--- we worked on extending the app to include data from several rooms.
--- the project is progressing quite well and we are happy to recognise the help from other people:
------ @biancamihai gave some important clues on how to implement an async functionality and we would be in conversations for a potential implementation of a database and a deamon for scheduled data updates
------ @abhisekp worked a reduce that was giving me headaches and guided a comparison with what it is done with the camperbot to get data from the rooms

Text Mining:
--- the project as conceived is still behind and now we are less people working on it, but it got a special reference in the Did you know... section

Other projects/ideas:
--- @SaintPeter's random thoughts:

  • correlations between Gitter activity about challenges vs solved challenges
  • Gitter utilization rate
  • surveying people to get more data about user performance and characteristics; use the information to improve service

--- @ArielLeslie's peanut gallery come-ups:

  • FCC plans for open data: when? how? which format? (possibly a question for @BerekelyTrue)
  • suggestion for how to analyse that data: logstash, elasticsearch; status of the data (sanitation? cleaned?)
  • (also @SaintPeter) databased? --> some ideas of data attributes to load (date, etc)
  • attrition analysis (my note: it can be calculated to some extent and that would be for next edition of this report)
  • community involvement and categorization of users according to that
  • campers' types, based on data available
  • finding pairing based on progress level, like at Words With Friends (my note: Interesting...)
  • user performance indicators (also @SaintPeter)

--- @Lightwaves' suspicious sounds:

  • unsupervised learning (clustering) for user characterization

--- @qmikew1's idea...:

  • opening a Security-related room (where is that room??)
CamperBot
@camperbot
Sep 07 2015 15:11 UTC
evaristoc sends brownie points to @lightwaves and @lightwaves and @biancamihai and @abhisekp and @saintpeter and @arielleslie and @berekelytrue and @saintpeter and @saintpeter and @lightwaves and @qmikew1 :sparkles: :thumbsup: :sparkles:
:warning: could not find receiver for berekelytrue
:star: 409 | @abhisekp | http://www.freecodecamp.com/abhisekp
:star: 532 | @saintpeter | http://www.freecodecamp.com/saintpeter
:star: 201 | @qmikew1 | http://www.freecodecamp.com/qmikew1
:star: 244 | @biancamihai | http://www.freecodecamp.com/biancamihai
:star: 532 | @saintpeter | http://www.freecodecamp.com/saintpeter
:star: 348 | @arielleslie | http://www.freecodecamp.com/arielleslie
:star: 171 | @lightwaves | http://www.freecodecamp.com/lightwaves
:star: 171 | @lightwaves | http://www.freecodecamp.com/lightwaves
:star: 171 | @lightwaves | http://www.freecodecamp.com/lightwaves
:star: 532 | @saintpeter | http://www.freecodecamp.com/saintpeter
evaristoc
@evaristoc
Sep 07 2015 15:12 UTC
hahahahahaha!!! ^^
but that was good...

This Week...:
DA app:
--- @andela-bfowotade will be working on the frontend a bit more (he is giving a more experienced, profesional view to some aspect of the project)
--- we will be working on improving the code and start preparing everything for an eventual deployment (possibly heroku)
--- we will be also evaluate the database aspect and the possibilities for scheduled updates

Text Mining:
--- the project as originally conceived will progress slowly but other small things are planned, particularly using the nltk or scikit-learn to carry out some analyses
--- we are trying to find small, feasible but relevant projects for this part and see how they fits FCC interest and rationale; we will mention of any progress when that occurs

Abhisek Pattnaik
@abhisekp
Sep 07 2015 15:26 UTC
@evaristoc ah! Nice. could you also include the project url along with your message?
evaristoc
@evaristoc
Sep 07 2015 15:26 UTC
@abhisekp hi doctor! which one?
The app?
Abhisek Pattnaik
@abhisekp
Sep 07 2015 15:27 UTC
@evaristoc your app??
evaristoc
@evaristoc
Sep 07 2015 15:27 UTC
Sure!!
Abhisek Pattnaik
@abhisekp
Sep 07 2015 15:28 UTC
@evaristoc i was saying.. it would have been better if you included the app url along with the analysis
evaristoc
@evaristoc
Sep 07 2015 15:30 UTC
The analyses I am doing are python-based... but that it is a good comment (I haven't make any mention of the code): quick and dirty
#!/usr/bin/env python3

import os, sys
import urllib, urllib.request, urllib.parse
import json, re
from datetime import date
#import collections
import pickle
import nltk
from nltk.collocations import *
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from operator import itemgetter


if __name__ == "__main__":
    directory = "personal/directory"
    raw = pickle.load(open(directory+"help.pkl", "rb"))
    #msg = []
    with open(directory+'help_messages.txt', 'w') as textin:
        for elem in raw:
            textin.write(elem['text']+'\n')
    with open(directory+'help_messages.txt','r') as textout:
        tokens = nltk.wordpunct_tokenize(textout.read())
        msgs = nltk.Text(tokens)
        # for l in textout.readlines():
        #     line = l.split(' ')
        #     for w in line:
        #         msg.append(w)
    len(msgs) #1331417; 1651687
    words = [w.lower() for w in msgs]
    vocab = sorted(set(words))
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(msgs[:100000])
    print(finder.nbest(bigram_measures.pmi, 1000))    
    finder = BigramCollocationFinder.from_words(msgs[:500000])
    print(finder.nbest(bigram_measures.pmi, 1000))
    bgs = nltk.bigrams(words)
    bonfires_bigrams_list = []
    stopws = nltk.corpus.stopwords.words('english')
    #http://stackoverflow.com/questions/23317458/how-to-remove-punctuation
    punctuations = ['!', '...', '#', '$', '%', '&', '`', '```', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
    for bg in bgs:
        if bg[0] in ('bonfires', 'bonfire') or bg[1] in ('bonfires', 'bonfire'):
            content = [w for w in bg if w not in stopws]
            content = [w for w in content if w not in punctuations]
            if len(content) == 2:
                bonfires_bigrams_list.append((bg[0].lower(), bg[1].lower()))
    bonfires_counts = {}
    for bon in bonfires_bigrams_list:
        if bon not in list(bonfires_counts.keys()):
            bonfires_counts[bon] = 0
        bonfires_counts[bon] += 1

    dd = []
    for key, value in sorted(bonfires_counts.items(), key=itemgetter(1)):
        dd.append((key,value))
Abhisek Pattnaik
@abhisekp
Sep 07 2015 15:54 UTC
@evaristoc oh! No!! pls don't paste codes here.
evaristoc
@evaristoc
Sep 07 2015 16:43 UTC
Oh... sorry, @abhisekp... next time I will find a better place...
Abhisek Pattnaik
@abhisekp
Sep 07 2015 16:45 UTC
@evaristoc btw, is there no regex in python for checking punctuations?
evaristoc
@evaristoc
Sep 07 2015 17:00 UTC

Yes, @abhisekp, but for me it was a quick way to find the punctuations from the string module (I just found that reference) and comparing as a set. There would be issues with regex like '...' or other things that I didn't want to give it a thought.

But the more you really want to clean or get more relevant info, or the more you want to code away from the nltk library, the more you would rely on direct regex to do some work... (there should be a lot of regex in the nltk library...)