These are chat archives for FreeCodeCamp/DataScience

2nd
Jan 2018
evaristoc
@evaristoc
Jan 02 2018 11:37

Hmmm... I had to redo my Emoji project.

I had been using a Python library, emoji, to capture the emojis in the messages.

It turned out that it only captures those characters written in unicode utf-8 form (eg. 😂). I thought it was capturing those emojis written with keywords but it is not. To capture emojis written as keywords (eg. :joy:) I had to use a bit of regular expressions instead (fortunately a simple one).

The thing is that the number of emojis written as keywords is much larger, and their distribution is likely different to the one I reported using the emoji library.

That also means that the data to extract is much larger because now my program has to visit more messages. That is having a huge impact on the running time of the program. Yesterday I was making a test and after 3h it was still running over one raw dataset without signs of termination, and I have 3 raw datasets, each with around 2mill messages.

So I am taking a sample instead, otherwise I won't never finish this project.

A lot of work this project... also because I am using my PC and CPU's for that.

Josh Goldberg
@GoldbergData
Jan 02 2018 13:05
@evaristoc wow! That’s insane. And a. Great listen learned for this kind of issue. I thought UTF captures most text. I wonder what is used by keyboard that is not using UTF? A sample of the data set seems reasonable choice given how large it is turning out. Are you using parallel processing? How long is the run time with the sample?
evaristoc
@evaristoc
Jan 02 2018 13:05

The other thing is I was scraping unicode.org for the first project to match some additional info about the emojis. Now, with the name shortcut that it is the most prevalent in Gitter chatrooms, the unicode.org info is not the best reference... I have to prepare a different scrap for a different source. Fortunately that source exists. Still, more work to do.


People

I am commenting this process to you because it might happen to someone planning to make a career as Data Scientist. As I mentioned before, NO: it is not only about data analysis. Actually, depending also the kind of job you have and where, you might be dealing with lots of data extraction and manipulation.

Analysing data in kaggle is probably the nicest part of it. Nice because the data usually comes clean. However cleaning data, solving issues with data, etc... is already a challenge itself.

So if you are planing to become a Data Scientist or similar, be ready for get your hand dirty, sometimes VERY dirty.

Josh Goldberg
@GoldbergData
Jan 02 2018 13:09
@evaristoc I appreciate you sharing this. Especially since I just read about text encodings in http://r4ds.had.co.nz/data-import.html
evaristoc
@evaristoc
Jan 02 2018 13:32

@GoldbergData

Thanks! No, I decided not to use parallel because I needed a quick deployment and although I know parallel I don't practice it frequently.

Parallel is a possible option I was planning to explore. I have used it before with tremendous success.

However I selected a random sample of 50% of messages (about 800,000 records) over one of the raw datasets and it ran considerably quicker than going through the whole dataset.

My impression is that my problem is RAM memory-related: I had to open the full raw dataset to create another data structure that also demand large amount of memory. The more memory is required by both datasets, the less "allocable" memory the computer has to complete an operation, which also increase the running time per operation.

I think if my project is running faster now with sampling it is probably because the second data structure I am now creating is smaller (800,000 records instead of 2,000,000) so there is more RAM memory available.

I think a parallel solution won't necessarily help with the memory overload. I think streaming/generators in chuncks could be more applicable. I haven't seriously tried streaming in Python yet. I failed to understand the garbage collector of Python. In my experiments Python didn't completely delete the previous chuncks, something I needed to recover memory.

This is why I have to leave the Python IDLE to run the program over the next raw dataset: leaving the IDLE clean up the RAM memory.

The raw dataset is not distributed but a plain file.

CamperBot
@camperbot
Jan 02 2018 13:32
evaristoc sends brownie points to @goldbergdata :sparkles: :thumbsup: :sparkles:
:cookie: 127 | @goldbergdata |http://www.freecodecamp.org/goldbergdata
evaristoc
@evaristoc
Jan 02 2018 13:41

@GoldbergData Great! Encoding is something we are usually ignoring, but it can really affect your ETL.

I have substantial experience with that, getting "damaged" files during transformations because we ignored encodings as well as specific "forbidden" characters between programs.

It can steal your sleep, for sure.

And well, this project I am doing is not an exception...

Correcting a phrase above, @GolbergData:

This is why I have to leave the Python IDLE to run the program over the next raw dataset: leaving the IDLE clean up the RAM memory.

I wanted to say

This is way I have to EXIT the Python IDLE after running the program over one raw dataset.


PEOPLE

Anyway: if any of you have an idea how to run Python in chunks effectively, I WILL VERY HAPPY TO HEAR FROM!!

evaristoc
@evaristoc
Jan 02 2018 14:28

Responding to myself, a possible solution:
https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb

But I think I had to save the very very raw data (the one coming from Gitter) with a different structure.

What I think is that by opening the file by chunks I will break the json "objects" making it useless.
But as SO reference suggests, opening the pickled JSON files might be consuming twice its size of RAM memory.
evaristoc
@evaristoc
Jan 02 2018 14:36
I need something that access the file line by line and delete that line completely from RAM.
Josh Goldberg
@GoldbergData
Jan 02 2018 14:48
When I heard chunks I think of knitr files. You could insert python code in rmarkdown files, and run each chunk until you’re finished? @evaristoc
evaristoc
@evaristoc
Jan 02 2018 14:54

My file is a pickled JSON, but I think I made a mistake: all the json lines were saved within a list structure after downloading all the data.

Thus, the only way to see the json is to open the list.

I think chunks are not a good solution: they break the lines. I need a structure that behaves as a generator.

I am not sure but I think I should have used an append over a simple file to save the json lines in the first place instead. Files of that kind behave as generators with readline method. Then convert each line to json using loads (each) instead of load (full) method.

But I am not sure if someone here has an idea.
I asked in the Python room and no-one has responded yet...
Josh Goldberg
@GoldbergData
Jan 02 2018 14:57
You could unlist the data? @evaristoc
evaristoc
@evaristoc
Jan 02 2018 14:58
But then I will still have one line of data and won't be a json structure when opening.
One solution would be to find a way to convert the file into a generator before it opens.
No idea how to do that.
Josh Goldberg
@GoldbergData
Jan 02 2018 15:00
I’m not familiar with generators unfortunately. I am curious what the data looks like. Can you make a sample available for me?
@evaristoc
evaristoc
@evaristoc
Jan 02 2018 15:02
I think it is like this:
[jsonobj, jsonobj, jsonobj]
It should have been:
jsonobj
jsonobj
jsonobj
then I could have used file.readline(json.loads(jsonobj)). Files are generators, and readline is a yield
For the data structure I have I am using:
data = pickle.load(file)

for e in data:
    ...
Where data is the iterator. The only way to access the iterator is by opening the full file.
I will try to see if my idea can help me to save the other files that I have to build over the very very raw data, that are also jsons.
evaristoc
@evaristoc
Jan 02 2018 15:11
Well... actually the very very raw data is saved as:
jsonobj
jsonobj
jsonobj
So I still don't know if I can use it as generator instead.
Hmmm....
Josh Goldberg
@GoldbergData
Jan 02 2018 15:33
Hmm. Will be interested in your solution. @evaristoc
evaristoc
@evaristoc
Jan 02 2018 16:39
Let's see, @GoldbergData. I think I will use something like this:
http://www.blopig.com/blog/2016/08/processing-large-files-using-python/
goel5
@goel5
Jan 02 2018 20:04
i'm learning datascience. just finished a course 'intro to datascience using python'. from datacamp. Can anyone guide further. i'm a beginner
evaristoc
@evaristoc
Jan 02 2018 20:15

@GoldbergData
This is maybe what I had should done instead: saving each read line (or bunches) and appending them one by one. But because I waited until the end of the collection to dump all the data into a pickle, apparently the only way to open my pickled data is using a read over the full dataset.

So back to the beginning. New lesson to learn...

evaristoc
@evaristoc
Jan 02 2018 20:27

When I used pickling it was because from former experience using other python transformations (like shelving) or just appending pickles, took a lot of time: the program had to write into disk every single line.

The very very raw data already took a lot of time to download, I didn't want to increase the running time so I put everything I could into memory and flushed when finished.

But now I still have problems with memory.

Luckily it is not that problematic - I can still work on the file whole, only that slowly. But I think the best solution right now for me is to continue with the sampling instead of looking the full dataset if I want to finish this project any time soon.

evaristoc
@evaristoc
Jan 02 2018 20:48

And here more about pickling and why NOT to pickling:
https://stackoverflow.com/questions/26394768/pickle-file-too-large-to-load

I think it is the quickest but apparently not the best way to store data. Hmmm...

evaristoc
@evaristoc
Jan 02 2018 21:45

PEOPLE:

A lesson for all of us working with large datasets in Python?

Before serializing a file with Python (pickles, json, etc) think first about the conveniences of saving the file in a full dump or line by line.

Serializing line by line might take a bit more time for big files but it will help to open it back as a generator format. This will help a lot your memory availability.

People trend to use sqlite3 as well. Again, my experience is that it slow your code while writing to disc. Additionally there are situations where you are better off with a flexible format (like JSON) rather than using a structure-based format like any SQL.

So appropriate serializing is something to take in consideration.

evaristoc
@evaristoc
Jan 02 2018 21:57

@GoldbergData
I tried to use the pickled json file just by unplicking it and saving it as a single file. However the unpickled json file when read as a simple document is saved in a list, so it is like this:

[jsonobj, jsonobj, jsonobj]

To convert the file into a list, I need to open the whole file again (json.load(all_the_file)).

One way to handle this from the current position without opening the whole file is chunking:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

In that case, every single chunk is a string. If I can capture the end and begining of every jsonobj in the chunk (or chunks), I can use json.loads(jsonobj) to restore it into a json.

I already did a test and worked.

However there is A LOT of regular expressions and while-loops and therefore very prone to error.

The best option is always to save it again line by line.

I will think about it. Those are the files that will go to Kaggle. I might probably leave them so and see what students in Kaggle are capable to do with those files.

They can add a bit of challenge to some people.

Matthew Barlowe
@mcbarlowe
Jan 02 2018 22:05
So when you say save line by line do you mean for each file it contains only one line?
@evaristoc
evaristoc
@evaristoc
Jan 02 2018 22:15

@mcbarlowe
Instead as I did, collecting all data in memory and making a single dump at the end, for every record you collect do:

recordline = process_record(record)
with open("data_store", "a") as f_out:
    serialization_method.dump(recordline, f_out)

And when opening, you have to open each from data_store

with open("data_store", "r") as f_in:
     for l in f_in:
          recordline = serialization_method.load(l)
As you see, you have the overhead of dumping every line in the data_store file. From my experience, this add time and it is longer than writing into a single file at the end.
evaristoc
@evaristoc
Jan 02 2018 22:22
The "a" of "append" in the first example is key, @mcbarlowe.
evaristoc
@evaristoc
Jan 02 2018 22:28

@mcbarlowe although some people suggest not to use it, I wouldn't discharge pickling. I suspect that in some cases is the best way to avoid data corruption.

If you have a utf-8 format and save it into text, for example, it could be that if your system doesn't support some characters you might corrupt the data when opening the file again.

In order to guarantee the storing files are preserving the characteristics of the objects, I think the best way is to append the records as pickles. They are binaries.

Matthew Barlowe
@mcbarlowe
Jan 02 2018 22:29
Yeah but but you should be able to encode it into utf8 if writing to text
I would think I know you can read it in that way
evaristoc
@evaristoc
Jan 02 2018 22:30
What if you don't know what for data you have?