These are chat archives for FreeCodeCamp/DataScience

4th
Jan 2018
evaristoc
@evaristoc
Jan 04 2018 11:16

@GoldbergData I felt challenged by the idea of implementing a multiprocessing after your question and I have been working on that the last few days.

However, although I got some ideas about what to do, implementing multiprocessing brings additional levels of complexity and in general I have to re-write the whole code. There are several difficulties I have to deal with: concurrency, killing correctly the child and master processes... designing the correct multiprocessing that really solve my memory issue and at the same time is also fast is not straight forward.

Selecting the design and solving the issues might take a while and I don't want to delay the completion of the emoji project.

I will skip multiprocessing for the emoji project for now, but I still want to work on it. In fact I have other 2 ideas I will work on using chatroom data after finishing the emoji project that might benefit for using multiprocessing.

@GoldbergData Do you want to help? You can own any part you feel suit your interests:

  • data collection and ETL (Python; here is where multiprocessing will be used)
  • data analysis (better python but R is also welcome)
  • data visualization (Python and R for the article viz + d3.js and some web development for accompanying link(s))

My idea for all my projects is to write an article accompanied by links with interactive visualizations that will add information about the topic, or even open a different topic over the same data.

For one of the projects I am planning the visualization part to be like a visualization-based recommender.
evaristoc
@evaristoc
Jan 04 2018 11:28
@mcbarlowe the invitation is also extended to you?
evaristoc
@evaristoc
Jan 04 2018 22:55

PEOPLE

Sharing a common but always relevant experience with those planing to become Data Scientist.

I am still manipulating some files too big for my RAM memory that I need to transform into something simpler. The lack of memory is affecting the ability of my computer to deal with the required operations, slowing down the process.

@GoldbergData in a point suggested to use some multiprocessing. I decided to study his suggestion and realized that although a possible solution, there are some drawbacks that could worse the problem I have if wrongly implemented.

I also decided to check the problem I had with the slowdown a little bit more and measured the time it was taking to complete bunch of operations (filling/updating a container with data found in a data file of about 1.5mill posted messages). Here a summary of the simple logs:

100000 messages in around 0 mins for an increase in userdict length of 6370.
100000 messages in around 0 mins for an increase in userdict length of 11403.
100000 messages in around 1 mins for an increase in userdict length of 15402.
100000 messages in around 1 mins for an increase in userdict length of 19610.
100000 messages in around 2 mins for an increase in userdict length of 24731.
100000 messages in around 3 mins for an increase in userdict length of 29673.
100000 messages in around 4 mins for an increase in userdict length of 35137.
100000 messages in around 6 mins for an increase in userdict length of 40280.
100000 messages in around 7 mins for an increase in userdict length of 46249.
100000 messages in around 9 mins for an increase in userdict length of 52800.
100000 messages in around 11 mins for an increase in userdict length of 59170.
100000 messages in around 12 mins for an increase in userdict length of 66123.
100000 messages in around 13 mins for an increase in userdict length of 72125.
100000 messages in around 15 mins for an increase in userdict length of 78390.
100000 messages in around 17 mins for an increase in userdict length of 84186.
100000 messages in around 17 mins for an increase in userdict length of 89229.
100000 messages in around 19 mins for an increase in userdict length of 93911.

Notice that in the last steps, although the differences between the previous and current lenght of userdict is relatively similar (~6,000 records), the time still increase.

A possible explanation to that is not only the lack of memory (which I noticed it didn't change a lot) but something that could go ignored: the data structure where I am collecting the data might be inefficient (a simple Python dict). In other words, it is possible that my program was having less memory to quickly search through the collector while it grew in number of records because its inefficient design.

So it seems that problems with data management ares at least three-folded:

  • Lack of capacity (memory and CPU) == HARDWARE
  • Inefficient code == SOFTWARE
  • Also very much ignored but important: Inefficient data structure == SOFTWARE

In theory, a Data Science (probably together a Data Engineer) should be able to help suggesting solutions for all those.

It is possible that you can mitigate the problem by dealing with only one of the above. For example, it seems that the current paradigm normally focuses on hardware. Not totally true everywhere: HDFS and Map-Reduce are actually software solutions for a low cost hardware plus a really bad data structure (JSON).

In any case, dealing with data might mean us to consider this kind of situations.

Timothy Javins
@timjavins
Jan 04 2018 22:57
Just popping in to make sure you guys are informed:
global security threat of the moment: Spectre, Meltdown
You just gotta do some reading. It's complex, but not super scary like people make it seem.
There are fairly easy precautions to implement.
evaristoc
@evaristoc
Jan 04 2018 23:00
@timjavins thanks!
CamperBot
@camperbot
Jan 04 2018 23:00
evaristoc sends brownie points to @timjavins :sparkles: :thumbsup: :sparkles:
:cookie: 142 | @timjavins |http://www.freecodecamp.org/timjavins
Josh Goldberg
@GoldbergData
Jan 04 2018 23:12
@evaristoc yes I will help. This sounds fun!
@evaristoc I am probably better suited for the last two with R. I know python but I am better versed in R