Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 16 20:32

    dockimbel on master

    FEAT: [R/S] removes literal arr… (compare)

  • Sep 16 07:45
    qtxie milestoned #4035
  • Sep 16 07:45
    qtxie labeled #4035
  • Sep 16 07:45
    qtxie labeled #4035
  • Sep 16 07:45
    qtxie labeled #4035
  • Sep 16 07:43
    meijeru opened #4035
  • Sep 16 07:39
    qtxie transferred #4033
  • Sep 16 06:50
    meijeru closed #4033
  • Sep 16 06:50
    meijeru commented #4033
  • Sep 16 06:01
    qtxie commented #4030
  • Sep 16 05:59
    qtxie commented #4030
  • Sep 16 05:57
    qtxie commented #4030
  • Sep 16 05:49
    qtxie commented #4030
  • Sep 16 05:48
    qtxie commented #4030
  • Sep 16 05:46
    qtxie labeled #4030
  • Sep 16 05:46
    qtxie assigned #4030
  • Sep 16 05:46
    qtxie review_requested #4030
  • Sep 16 00:03
    greggirwin commented #4034
  • Sep 16 00:03

    greggirwin on master

    FIX: RECYCLE does not have a do… Merge pull request #4034 from e… (compare)

  • Sep 16 00:03
    greggirwin closed #4034
Boleslav Březovský
@rebolek
However you simple example parse data ";" would fail on a lot of real world examples. You really need more complex handling.
Gregg Irwin
@greggirwin
@GiuseppeChillemi what is the driving need for that feature? That is, what can't you do today that makes you want it?
Boleslav Březovský
@rebolek

@pekr So, tell me - is this really complex?

>> save/as %test.csv [["hello world" 123]["nazdar světe" 456]] 'csv

>> print read %test.csv
"hello world",123
"nazdar světe",456

>> probe load/as %test.csv 'csv
[["hello world" "123"] ["nazdar světe" "456"]]
== [["hello world" "123"] ["nazdar světe" "456"]]

:smile:

I was trying to follow "simple things should be simple complex things should be possible" principle.
Petr Krenzelok
@pekr
Once again, I have nothing against the particular implememtation, just trying to point out, that an underlying implementation seems a bit complex and sometimes we can see even simpler stuff being revoked for its seeming complexity. But you are right, I have only historically used one concrete scenario ...
What I am not fully about to consider, is those various record/block modes. Are we sure we will use similar options for SQL/ODBC codec?
Boleslav Březovský
@rebolek
The underlying code is more complex than simple parse data ";"because it has to correctly handle escaping, custom delimiters and other stuff. Just read this for some fun :) http://secretgeek.net/csv_trouble
Various record/block modes - this is where I'm mostly interested in user input. I can throw out ~50% of code and support block of blocks only. But I believe other modes have their purpose.
GiuseppeChillemi
@GiuseppeChillemi
@greggirwin RED datatypes could be used as building block for quite everything, I agree. But when you create complex structures, simple operations like select/next/back/tail/at and PATHs are no more directly usable on the work and the target data. You have to add more steps making your expression convoluted. A datatype which add code to the basic actions let you create shorter code and done with the very basic of RED words and concepts. It's not a matter of "show me what can't be done", it's a matter of simplicity while expressing and manipulating data.
GiuseppeChillemi
@GiuseppeChillemi
note: are no more directly usable on the work and the target data -> are no more directly usable to work on the target data
xqlab
@xqlab
The help text of load, save and other funtions with the /asrefinement should show all possible encoding types.
Boleslav Březovský
@rebolek
@xqlab you can get all available codecs from system object:
>> print collect [foreach [name codec] system/codecs [keep form name]]
png jpeg bmp gif json csv
Oldes Huhuman
@Oldes
@xqlab You can get supported types using: ? system/codecs. But in Red it is in a block which does not gives nice output. In R3 it is an object.
xqlab
@xqlab
Thanks

Is this intended that the csv functions are not symmetric i.e.

>> load-csv to-csv  [["hello world" 123]["nazdar svete" 456]] 
== [["hello world" "123"] ["nazdar svete" "456"]]

?

Oldes Huhuman
@Oldes
That is how CSV works. One should not use it when don't have to talk with foreign apps.
Vladimir Vasilyev
@9214
@xqlab codec is not supposed to load all possible values, that's beyond its scope.
Boleslav Březovský
@rebolek
@xqlab yes, that's intended. Functions that try to convert values to appropriate datatype may be added later (or may not, that's not decided).
xqlab
@xqlab
system/codecsdoes not show UTF-8used with readand write
Oldes Huhuman
@Oldes
because read and write don't use real codecs for now. They were there before first codec implemented.
GaryMiller
@GaryMiller
In real world large data load application CSV files most often can be comma, delimited, pipe delimited, or tab delimited. Some csv files have comment character such a # in the first column position of the record used to document different sections of the CSV file. # is commonly used but is often redefinable to handle files created by other systems. More and more often UNICODE characters in strings being loaded are occurring. Embedded single quotes within double quoted strings are common and to be expected as column delimiters such as commas. Both integer values and floating point numbers are often included. Most Python CSV handling reads the csv file and maps it into a list of objects where the object attributes are the column names. Numeric values are mapped directly into numeric variables to avoid data conversion overhead later when processing.
Boleslav Březovský
@rebolek

@GaryMiller Thanks for your input!

CSV files most often can be comma, delimited, pipe delimited, or tab delimited.

CSV codec uses comma as default, different character can be selected using /with refinement. I did some work on delimiter auto-detection, but that’s experimental feature, not available in master branch.

Some csv files have comment character...

Thanks, I’ll look into it.

UNICODE

If the file is UTF-8, Red handles it well.

Embedded single quotes within double quoted strings are common and to be expected as column delimiters such as commas.

Quote chars are fully supported, you can choose if you want to use single or double quote (or something different, if you want).

Most Python CSV handling reads the csv file and maps it into a list of objects where the object attributes are the column names.

This can be achieved with /as-records refinement.

Numeric values are mapped directly into numeric variables...

I need to run some test to see how much code it adds, and how it will slow down the conversion.

Gregg Irwin
@greggirwin

I thought about comments as well @GaryMiller. It's a bit of a pain to work around with remove-each read/lines + rejoin, but as we talked about more options, the number of refinements grows and we end up looking at deeper changes, which are probably not where we want to go. e.g. load/as could take a spec block as the type arg, which contains option settings.

Auto-loading fields is another deep design question, which we discussed. Either you take a big hit by processing every field, or you add another option to guide what to load. @GaryMiller do you happen to know, or know who to ask, how Pythonistas feel about their CSV solution? We're happy to learn from their experience.

@xqlab the current doc string has e.g. to denote that not all codecs are listed. However, there's a precedent with checksum which does list them all, and that can be useful. A big difference is that checksum is a fixed func, while the codec system is dynamic. As soon as people start writing their own codecs, the doc string either has to stay incomplete, or be dynamic itself, when a codec is registered. If we keep the standard codecs small (and we should), I agree that listing them all there is a good idea.
Gregg Irwin
@greggirwin

@GiuseppeChillemi can you please give me a concrete example of what you want to do, that makes the datatype solution more expressive. That will help me a lot, to understand what's in your head.

This is a good general design note as well. For me, the greatest benefit from Red's wealth of datatypes comes from their literal forms, by far. It's like a periodic table. And lexical space is tight. We're already seeing limits on new forms without adding special characters and sigils that are less human friendly. People can, of course, but remember that new forms also mean changes to the lexer, which is not something we can easily make dynamic and robust. We need to define Red's syntax, for interoperability and longevity. And for our own sanity. :^)

Having spent a fair number of years doing OOP, I do know what that view looks like. But nobody has proven that it's objectively better. It's good for some things, but not for others; most of the time it's just different, not better or worse.

There's a great book by Bertrand Meyer, an OOPer if ever there was one, called Reusable Software. In it, he argues for naming consistency. He explains it as a Linnaean Classification model. Things are grouped because they share common characteristics, and you use the same names to talk about them. An example is the series in Red. We have insert, append, remove, take, etc., which work the same on all of them. We don't have enqueue/dequeue and push/pop, and we shouldn't. You can write those, when you want the meaning, but you don't need new queue!/stack! datatypes to do it. It's tempting, I know. I've done it.

The interesting thing, and where Red's difference comes into play, is that Red is aaaaalllll about the data. How that data is processed is context sensitive. OOP, and the idea of adding more datatypes with associated actions, conflates data and behavior. Of course, we do this too, and have objects, rather than a strict separation at all times. But our objects are different. Datatypes are static by design, while objects are really data containers and some of that data is evaluated for its behavior (e.g. functions). You can do the same thing with blocks, though you lose the implicit context aspect of the words in the object. That is what makes Red special, so it's key. Of course, we can mix and match and create any other system or model on top of this.

As to your thought (I know it was a question, not a feature request), it leans toward this), where it has to be thought about deeply, and championed by someone who really wants it.

GaryMiller
@GaryMiller
@Greg Irwin Pythonistas used to use the standard library. But for big jobs now they usually call NumPy or Pandas which are much faster since they are based on optimized statically compiled code. Invoking them is usualy a few lines shorter than the standard library. This article gives a comparison of the three different approaches. https://machinelearningmastery.com/load-machine-learning-data-python/
Rudolf Meijer
@meijeru
@GaryMiller Please be careful with @-names; your above post was signalled to a totally different person than you intended, and who probably does not know what to do with it...
Gregg Irwin
@greggirwin

Thanks for the link @GaryMiller!

I often have to redo @ names because Gitter selects the wrong one.

Gregg Irwin
@greggirwin

Some good considerations in that article @GaryMiller.

  • The most common format for machine learning data is CSV files. (if that's true ;^)
  • Pandas lets you specify names to go with the data, if they aren't included. I don't know what the result looks like, but we don't have an option for that, which would go with as-records/as-columns.
  • A comment talked about auto conversion errors. Once we have HOFs, mapping conversions will be short to write, and shouldn't suffer a performance hit, compared to doing it inside the codec. Looks like numpy has a converters option for this.

Reading more:

obj = StringIO("1, 2, 3\n4, 5, 6")
n, m, p = np.loadtxt(obj, delimiter =', ', usecols =(0, 1, 2), unpack = True)
print("value of n: ", n)
print("value of m: ", m)
print("value of p: ", p)

; Output
value of n:  [ 1.  4.]
value of m:  [ 2.  5.]
value of p:  [ 3.  6.]

For us, that would be (minus float auto-loading):

>> load-csv/as-columns "1, 2, 3^/4, 5, 6"
== #(
    "A" ["1" "4"]
    "B" [" 2" " 5"]
    "C" [" 3" " 6"]
)

This is the crux of the issue, and why CSV is a great and terrible thing. It's used for 3 types of data, none of which is "the one".

  • Spreadsheet model: rows, cols (named or numbered), cells
  • DB Model: rows of records, often with named columns
  • Column model: pivoted spreadsheet model.

As I commented on the PR at one point, if we have to pick only one, it has to be a block of blocks. The question is, are we better of minimizing the codec and putting all the other logic in a module? i.e., load/as will only return a block of blocks. Ever. The rest is up to a blessed module or users themselves.

Your mission, dear community, should you choose to accept it: Mock up what that looks like in your mind. What's the best design if we go that way?

Boleslav Březovský
@rebolek

specify names to go with the data, if they aren't included

if names are not included, they are currently auto-generated, adding support for custom names, is very small change

Gregg Irwin
@greggirwin
A small change, yes, but another user-facing option.
Boleslav Březovský
@rebolek
Right.

To avoid spaces in data, you can use /trim:

>> load-csv/as-columns/trim "1, 2, 3^/4, 5, 6"
== #(
    "A" ["1" "4"]
    "B" ["2" "5"]
    "C" ["3" "6"]
)

It's on by default, as it's more "correct". But maybe going with more useful option would be preferable and have the opposite refinement of /trim?

OTOH they use unpack = True which is equivalent of /trim.
Boleslav Březovský
@rebolek
I've studied Python CSV parsers when working on Red one, so the features should roughly match.
GaryMiller
@GaryMiller

Also a very nice Pandas feature for large CSV files is working directly with a zipped file.

import pandas as pd
import zipfile

zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))

You can also process multiple CSV files from inside of of one zip file like this

with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)

Gregg Irwin
@greggirwin
I would like a standard zip codec, but it's not a great fit for the codec model by itself. It could load the content info, but then we need supporting funcs to do something with them, unless load returns a full object that includes them, making it a much more complex codec. So zip may be a module, as convenient as it would be to build in.
Boleslav Březovský
@rebolek
Archives are a better fit for ports than codecs.
Gregg Irwin
@greggirwin
Agreed.
GiuseppeChillemi
@GiuseppeChillemi
@greggirwin Thanks for your long answer. The real fact is that I have many ideas and usage example trying to combine into my mind but I am somehow intimidated from the high level odience. Also my other fear regards a psychology effect: one you introduce something and start receive the first negative feelings and answers becouse you are not good at introducind ideas, it will be harder to let other see them as positive later. So my wish was to make a good presentation to not damage the ideas with my errors (if the ideas are relly interesting). Ok, yes, I should win my fears and write about my views and let other go beyond my "simple" words and errors. I need a more little courage and write down my thoughts.
Gregg Irwin
@greggirwin
I'm patient. Take your time. :^)
nedzadarek
@nedzadarek
@GiuseppeChillemi @greggirwin If we care only about meanings we can do this (at least in the interpreter):
  stack!: block!
; block!
  push: function [s [stack!] x][append s x]
; func [s [stack!] x][append s x]
  st: make stack! []
; []
  push st 1
; [1]
  push st 2
; [1 2]
Gregg Irwin
@greggirwin
That way lies madness. It won't keep you from passing a block you don't intend to treat as a stack. And it will create the false impression that it has stack-like behavior. The helper funcs (e.g. push/pop) are enough.
xqlab
@xqlab
@greggirwin If the doc string does not contain all codecs it can contain at least a reference where to find them e.g. see also system/codecs
I prefer simple block over block of blocks as long as there is no native find/deep
Gregg Irwin
@greggirwin
@xqlab something like "For full list use extract system/codecs 2"?
And I hear some brain cylinders pumping now... Can we embed code in doc strings for help to use? ;^)
Boleslav Březovský
@rebolek
That would be cool.
>> ? now
USAGE:
     NOW 

DESCRIPTION: 
     Returns current date and time (17-9-2019, 9:55).
It would be security risk, but cool one :smile: