These are chat archives for FreeCodeCamp/DataScience

5th
Jan 2018
Josh Goldberg
@GoldbergData
Jan 05 2018 00:22
@evaristoc when do you want to get started?
Josh Goldberg
@GoldbergData
Jan 05 2018 00:27
@timjavins these security implications seem pretty draconian. Is it as bad as it reads?
Hèlen Grives
@mesmoiron
Jan 05 2018 10:20
For all of you I wish you the best for 2018!
evaristoc
@evaristoc
Jan 05 2018 12:15

PEOPLE

Following my comments in a yesterday's message about possible factors affecting data manipulation :point_up_2:, I took my script, changed it to avoid some redundancies or unneeded checks and this was the result:

100000 messages in around 0 mins 5 secs for an increase in userdict_kw length of 1703.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 5289.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 11467.
100000 messages in around 0 mins 1 secs for an increase in userdict_kw length of 14099.
(...)
27982 messages in around 0 mins 0 secs for an increase in userdict_kw length of 63072: Finished... 
ENDING TRANSFORMING DATA

Lineal, as it should be. If it lasted longer was because I decided to leave 1 min between analyses.

For us who are more analysts and not programmers by training, this is a reminder of the importance of getting better at it.

evaristoc
@evaristoc
Jan 05 2018 12:20

The changes above didn't implement any multiprocessing yet. For what I have read and experienced myself, most importantly is to solve as many inefficiencies in your non-parallel code as possible before trying any multiprocessing (unless you have plenty of experience on it as programmer).

Otherwise you will bring your inefficiencies with you to the parallel code that in the best case scenario would mean extra work, considering parallelism is not easy.

However, parallelism is for sure an extraordinary way to improve your code even more by assigning more resources to your program.


@GoldbergData Great!!! I will let you know for sure. What I am going to do is to upload the data to Kaggle and we could work on specific analyses that will be related to the projects?

The data in Kaggle won't be the whole data to be used in the projects I have in mind, so your analyses will be more dedicated to tune up the analytical tools we are going to use for the project as a whole. As soon as I get the complete data I will share that with you for the full analysis.

Keep you informed!

evaristoc
@evaristoc
Jan 05 2018 13:32

PEOPLE

Something about recovering memory in Python. Interestingly, multiprocessing is in fact a good option for that!

https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python

Why multiprocessing should be taken with care? Because forking and spawning processes using multiprocessing actually makes copies of an existing parent. If the parent is big, the children will be big too.

Additionally each new process will open Python again, and each Python process consumes memory (in my computer, iPython without libraries, just IDE: 17KB). If you open too many processes even with no much data but many libraries, you could add too much overload to memory.

Adding to the link above, another one about how difficult is to clear memory in Python (I think I mentioned already that the garbage collector was not enough):
https://www.sjoerdlangkemper.nl/2016/06/09/clearing-memory-in-python/

Josh Goldberg
@GoldbergData
Jan 05 2018 13:37
@evaristoc sounds great! Thank you!
CamperBot
@camperbot
Jan 05 2018 13:37
goldbergdata sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 394 | @evaristoc |http://www.freecodecamp.org/evaristoc
evaristoc
@evaristoc
Jan 05 2018 14:22

PEOPLE

More about parallel processing with Python.
https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b

I will focus on Kaggle. In fact, on solutions about how to easily solve the datasets I am uploading in Kaggle. Keep you updated.

Robert Crawford
@RobertCC18
Jan 05 2018 14:43
Hi, I am having trouble parsing the data for the D3 project Visualize Data with a Bar Chart. The json data has no column definition that I can tell, here is the link https://raw.githubusercontent.com/FreeCodeCamp/ProjectReferenceData/master/GDP-data.json
Both the the Date and Value columns are not specified.
evaristoc
@evaristoc
Jan 05 2018 18:47
Following above, a possible solution to the memory overhead with those inefficient big files I have could be:
import os, sys
import time
import multiprocessing
#sys.exit()

##https://stackoverflow.com/questions/20887555/dead-simple-example-of-using-multiprocessing-queue-pool-and-locking
##https://www.ploggingdev.com/2017/01/multiprocessing-and-multithreading-in-python-3/

def feeder(feederQ):
    for i in range(1,4):
        feederQ.put(i)

def preparing_file(feederQ):
    ##https://gitter.im/FreeCodeCamp/FreeCodeCamp?at=58b600e17ceae5376a526d13 last message 28 FEB 2017

    datadirectory = "/1_archive"

    while True:

        try:
            f = feederQ.get()
            if f is None:
                # Poison pill means shutdown
                print('%s: Exiting' % multiprocessing.Process.name)
                feederQ.task_done()
                break
            with lock:
                try:
                    start = time.time()
                    print('I am solving the file number '+str(f)+'. It will take a lot of memory.')
                    time.sleep(5)
                    end = time.time()
                    print('Data was saved at emojiproject0'+str(f)+'_corr.pkl after {} secs. DONE.'.format(end-start))
                    feederQ.task_done()
                    break
                except:
                    print("do something with error handling file")
                    feederQ.task_done()
                    break
        except:
            print("do something with error handling queue")
            feederQ.task_done()
            break




if __name__ == '__main__':
    lock = multiprocessing.Lock()
    feederQueue = multiprocessing.JoinableQueue()
    feederWorker = multiprocessing.Process(target=feeder, args=(feederQueue,))
    dataWorkers = [multiprocessing.Process(target=preparing_file, args=(feederQueue,)) for i in range(3)]
    feederWorker.start()
    for dataWorker in dataWorkers:
        dataWorker.deamon = True
        dataWorker.start()

    feederWorker.join()
    for i in range(3):
        dataWorker.join()
        dataWorker.is_alive()

    feederWorker.is_alive()

So far it breaks after finishing the task, giving up the lock to the next process.

This project is still not generalizable, but I think it will work for few processes, each corresponding to the files I have (3).

evaristoc
@evaristoc
Jan 05 2018 19:31
I added a "killer pill" to the previous one based on: https://docs.python.org/3/library/queue.html and it was much better.
evaristoc
@evaristoc
Jan 05 2018 19:41
@RobertCC18 I hope you found what you were looking for in the:
:point_left: HelpDataViz channel? Asking at the forum or even using Google or the forum's internal search engine is very very advisable too. I personally can't help because I haven't done any of the exercises.
evaristoc
@evaristoc
Jan 05 2018 21:52

PEOPLE

The datasets of the main room are finally available in kaggle!

https://www.kaggle.com/free-code-camp/all-posts-public-main-chatroom

Hope you will enjoy them! Happy (DS) Coding!

Alice Jiang
@becausealice2
Jan 05 2018 22:22
@RobertCC18 try creating a new object using d.data[0] as object keys and d.data[1] as object values
someObj = {};
function organize(dataArray){
  someObj[dataArray[0]] = dataArray[1];
}
Something like that might be a place to start, then run the data through that on Parse
It's been a minute since I've done any JS at all, so that might be really bad code, but idk :/