These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
@GoldbergData I felt challenged by the idea of implementing a multiprocessing after your question and I have been working on that the last few days.
However, although I got some ideas about what to do, implementing multiprocessing brings additional levels of complexity and in general I have to re-write the whole code. There are several difficulties I have to deal with: concurrency, killing correctly the child and master processes... designing the correct multiprocessing that really solve my memory issue and at the same time is also fast is not straight forward.
Selecting the design and solving the issues might take a while and I don't want to delay the completion of the emoji project.
I will skip multiprocessing for the emoji project for now, but I still want to work on it. In fact I have other 2 ideas I will work on using chatroom data after finishing the emoji project that might benefit for using multiprocessing.
@GoldbergData Do you want to help? You can own any part you feel suit your interests:
My idea for all my projects is to write an article accompanied by links with interactive visualizations that will add information about the topic, or even open a different topic over the same data.
Sharing a common but always relevant experience with those planing to become Data Scientist.
I am still manipulating some files too big for my RAM memory that I need to transform into something simpler. The lack of memory is affecting the ability of my computer to deal with the required operations, slowing down the process.
@GoldbergData in a point suggested to use some multiprocessing. I decided to study his suggestion and realized that although a possible solution, there are some drawbacks that could worse the problem I have if wrongly implemented.
I also decided to check the problem I had with the slowdown a little bit more and measured the time it was taking to complete bunch of operations (filling/updating a container with data found in a data file of about 1.5mill posted messages). Here a summary of the simple logs:
100000 messages in around 0 mins for an increase in userdict length of 6370. 100000 messages in around 0 mins for an increase in userdict length of 11403. 100000 messages in around 1 mins for an increase in userdict length of 15402. 100000 messages in around 1 mins for an increase in userdict length of 19610. 100000 messages in around 2 mins for an increase in userdict length of 24731. 100000 messages in around 3 mins for an increase in userdict length of 29673. 100000 messages in around 4 mins for an increase in userdict length of 35137. 100000 messages in around 6 mins for an increase in userdict length of 40280. 100000 messages in around 7 mins for an increase in userdict length of 46249. 100000 messages in around 9 mins for an increase in userdict length of 52800. 100000 messages in around 11 mins for an increase in userdict length of 59170. 100000 messages in around 12 mins for an increase in userdict length of 66123. 100000 messages in around 13 mins for an increase in userdict length of 72125. 100000 messages in around 15 mins for an increase in userdict length of 78390. 100000 messages in around 17 mins for an increase in userdict length of 84186. 100000 messages in around 17 mins for an increase in userdict length of 89229. 100000 messages in around 19 mins for an increase in userdict length of 93911.
Notice that in the last steps, although the differences between the previous and current lenght of
userdict is relatively similar (~6,000 records), the time still increase.
A possible explanation to that is not only the lack of memory (which I noticed it didn't change a lot) but something that could go ignored: the data structure where I am collecting the data might be inefficient (a simple Python dict). In other words, it is possible that my program was having less memory to quickly search through the collector while it grew in number of records because its inefficient design.
So it seems that problems with data management ares at least three-folded:
In theory, a Data Science (probably together a Data Engineer) should be able to help suggesting solutions for all those.
It is possible that you can mitigate the problem by dealing with only one of the above. For example, it seems that the current paradigm normally focuses on hardware. Not totally true everywhere: HDFS and Map-Reduce are actually software solutions for a low cost hardware plus a really bad data structure (JSON).
In any case, dealing with data might mean us to consider this kind of situations.
evaristoc sends brownie points to @timjavins :sparkles: :thumbsup: :sparkles: