These are chat archives for FreeCodeCamp/DataScience

Jan 2018
Josh Goldberg
Jan 05 2018 00:22
@evaristoc when do you want to get started?
Josh Goldberg
Jan 05 2018 00:27
@timjavins these security implications seem pretty draconian. Is it as bad as it reads?
Hèlen Grives
Jan 05 2018 10:20
For all of you I wish you the best for 2018!
Jan 05 2018 12:15


Following my comments in a yesterday's message about possible factors affecting data manipulation :point_up_2:, I took my script, changed it to avoid some redundancies or unneeded checks and this was the result:

100000 messages in around 0 mins 5 secs for an increase in userdict_kw length of 1703.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 5289.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 11467.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 14099.
27982 messages in around 0 mins 0 secs for an increase in userdict_kw length of 63072: Finished... 

Lineal, as it should be. If it lasted longer was because I decided to leave 1 min between analyses.

For us who are more analysts and not programmers by training, this is a reminder of the importance of getting better at it.

Jan 05 2018 12:20

The changes above didn't implement any multiprocessing yet. For what I have read and experienced myself, most importantly is to solve as many inefficiencies in your non-parallel code as possible before trying any multiprocessing (unless you have plenty of experience on it as programmer).

Otherwise you will bring your inefficiencies with you to the parallel code that in the best case scenario would mean extra work, considering parallelism is not easy.

However, parallelism is for sure an extraordinary way to improve your code even more by assigning more resources to your program.

@GoldbergData Great!!! I will let you know for sure. What I am going to do is to upload the data to Kaggle and we could work on specific analyses that will be related to the projects?

The data in Kaggle won't be the whole data to be used in the projects I have in mind, so your analyses will be more dedicated to tune up the analytical tools we are going to use for the project as a whole. As soon as I get the complete data I will share that with you for the full analysis.

Keep you informed!

Jan 05 2018 13:32


Something about recovering memory in Python. Interestingly, multiprocessing is in fact a good option for that!

Why multiprocessing should be taken with care? Because forking and spawning processes using multiprocessing actually makes copies of an existing parent. If the parent is big, the children will be big too.

Additionally each new process will open Python again, and each Python process consumes memory (in my computer, iPython without libraries, just IDE: 17KB). If you open too many processes even with no much data but many libraries, you could add too much overload to memory.

Adding to the link above, another one about how difficult is to clear memory in Python (I think I mentioned already that the garbage collector was not enough):

Josh Goldberg
Jan 05 2018 13:37
@evaristoc sounds great! Thank you!
Jan 05 2018 13:37
goldbergdata sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 394 | @evaristoc |
Jan 05 2018 14:22


More about parallel processing with Python.

I will focus on Kaggle. In fact, on solutions about how to easily solve the datasets I am uploading in Kaggle. Keep you updated.

Robert Crawford
Jan 05 2018 14:43
Hi, I am having trouble parsing the data for the D3 project Visualize Data with a Bar Chart. The json data has no column definition that I can tell, here is the link
Both the the Date and Value columns are not specified.
Jan 05 2018 18:47
Following above, a possible solution to the memory overhead with those inefficient big files I have could be:
import os, sys
import time
import multiprocessing


def feeder(feederQ):
    for i in range(1,4):

def preparing_file(feederQ):
    ## last message 28 FEB 2017

    datadirectory = "/1_archive"

    while True:

            f = feederQ.get()
            if f is None:
                # Poison pill means shutdown
                print('%s: Exiting' %
            with lock:
                    start = time.time()
                    print('I am solving the file number '+str(f)+'. It will take a lot of memory.')
                    end = time.time()
                    print('Data was saved at emojiproject0'+str(f)+'_corr.pkl after {} secs. DONE.'.format(end-start))
                    print("do something with error handling file")
            print("do something with error handling queue")

if __name__ == '__main__':
    lock = multiprocessing.Lock()
    feederQueue = multiprocessing.JoinableQueue()
    feederWorker = multiprocessing.Process(target=feeder, args=(feederQueue,))
    dataWorkers = [multiprocessing.Process(target=preparing_file, args=(feederQueue,)) for i in range(3)]
    for dataWorker in dataWorkers:
        dataWorker.deamon = True

    for i in range(3):


So far it breaks after finishing the task, giving up the lock to the next process.

This project is still not generalizable, but I think it will work for few processes, each corresponding to the files I have (3).

Jan 05 2018 19:31
I added a "killer pill" to the previous one based on: and it was much better.
Jan 05 2018 19:41
@RobertCC18 I hope you found what you were looking for in the:
:point_left: HelpDataViz channel? Asking at the forum or even using Google or the forum's internal search engine is very very advisable too. I personally can't help because I haven't done any of the exercises.
Jan 05 2018 21:52


The datasets of the main room are finally available in kaggle!

Hope you will enjoy them! Happy (DS) Coding!

Alice Jiang
Jan 05 2018 22:22
@RobertCC18 try creating a new object using[0] as object keys and[1] as object values
someObj = {};
function organize(dataArray){
  someObj[dataArray[0]] = dataArray[1];
Something like that might be a place to start, then run the data through that on Parse
It's been a minute since I've done any JS at all, so that might be really bad code, but idk :/