dockimbel on master
FEAT: [R/S] removes literal arr… (compare)
greggirwin on master
FIX: RECYCLE does not have a do… Merge pull request #4034 from e… (compare)
@pekr So, tell me - is this really complex?
>> save/as %test.csv [["hello world" 123]["nazdar světe" 456]] 'csv >> print read %test.csv "hello world",123 "nazdar světe",456 >> probe load/as %test.csv 'csv [["hello world" "123"] ["nazdar světe" "456"]] == [["hello world" "123"] ["nazdar světe" "456"]]
parse data ";"because it has to correctly handle escaping, custom delimiters and other stuff. Just read this for some fun :) http://secretgeek.net/csv_trouble
Is this intended that the csv functions are not symmetric i.e.
>> load-csv to-csv [["hello world" 123]["nazdar svete" 456]] == [["hello world" "123"] ["nazdar svete" "456"]]
@GaryMiller Thanks for your input!
CSV files most often can be comma, delimited, pipe delimited, or tab delimited.
CSV codec uses comma as default, different character can be selected using
/with refinement. I did some work on delimiter auto-detection, but that’s experimental feature, not available in master branch.
Some csv files have comment character...
Thanks, I’ll look into it.
If the file is UTF-8, Red handles it well.
Embedded single quotes within double quoted strings are common and to be expected as column delimiters such as commas.
Quote chars are fully supported, you can choose if you want to use single or double quote (or something different, if you want).
Most Python CSV handling reads the csv file and maps it into a list of objects where the object attributes are the column names.
This can be achieved with
Numeric values are mapped directly into numeric variables...
I need to run some test to see how much code it adds, and how it will slow down the conversion.
I thought about comments as well @GaryMiller. It's a bit of a pain to work around with
remove-each read/lines +
rejoin, but as we talked about more options, the number of refinements grows and we end up looking at deeper changes, which are probably not where we want to go. e.g.
load/as could take a spec block as the
type arg, which contains option settings.
Auto-loading fields is another deep design question, which we discussed. Either you take a big hit by processing every field, or you add another option to guide what to load. @GaryMiller do you happen to know, or know who to ask, how Pythonistas feel about their CSV solution? We're happy to learn from their experience.
e.g.to denote that not all codecs are listed. However, there's a precedent with
checksumwhich does list them all, and that can be useful. A big difference is that
checksumis a fixed func, while the codec system is dynamic. As soon as people start writing their own codecs, the doc string either has to stay incomplete, or be dynamic itself, when a codec is registered. If we keep the standard codecs small (and we should), I agree that listing them all there is a good idea.
@GiuseppeChillemi can you please give me a concrete example of what you want to do, that makes the datatype solution more expressive. That will help me a lot, to understand what's in your head.
This is a good general design note as well. For me, the greatest benefit from Red's wealth of datatypes comes from their literal forms, by far. It's like a periodic table. And lexical space is tight. We're already seeing limits on new forms without adding special characters and sigils that are less human friendly. People can, of course, but remember that new forms also mean changes to the lexer, which is not something we can easily make dynamic and robust. We need to define Red's syntax, for interoperability and longevity. And for our own sanity. :^)
Having spent a fair number of years doing OOP, I do know what that view looks like. But nobody has proven that it's objectively better. It's good for some things, but not for others; most of the time it's just different, not better or worse.
There's a great book by Bertrand Meyer, an OOPer if ever there was one, called Reusable Software. In it, he argues for naming consistency. He explains it as a Linnaean Classification model. Things are grouped because they share common characteristics, and you use the same names to talk about them. An example is the series in Red. We have
insert, append, remove, take, etc., which work the same on all of them. We don't have
push/pop, and we shouldn't. You can write those, when you want the meaning, but you don't need new
queue!/stack! datatypes to do it. It's tempting, I know. I've done it.
The interesting thing, and where Red's difference comes into play, is that Red is aaaaalllll about the data. How that data is processed is context sensitive. OOP, and the idea of adding more datatypes with associated actions, conflates data and behavior. Of course, we do this too, and have objects, rather than a strict separation at all times. But our objects are different. Datatypes are static by design, while objects are really data containers and some of that data is evaluated for its behavior (e.g. functions). You can do the same thing with blocks, though you lose the implicit
context aspect of the words in the object. That is what makes Red special, so it's key. Of course, we can mix and match and create any other system or model on top of this.
As to your thought (I know it was a question, not a feature request), it leans toward this), where it has to be thought about deeply, and championed by someone who really wants it.
Some good considerations in that article @GaryMiller.
convertersoption for this.
unpackoption, which transposes the result and lets you assign it, like so:
loadtxt, which is the problem we face, and they're only concerned with the numeric view.
obj = StringIO("1, 2, 3\n4, 5, 6") n, m, p = np.loadtxt(obj, delimiter =', ', usecols =(0, 1, 2), unpack = True) print("value of n: ", n) print("value of m: ", m) print("value of p: ", p) ; Output value of n: [ 1. 4.] value of m: [ 2. 5.] value of p: [ 3. 6.]
For us, that would be (minus float auto-loading):
>> load-csv/as-columns "1, 2, 3^/4, 5, 6" == #( "A" ["1" "4"] "B" [" 2" " 5"] "C" [" 3" " 6"] )
This is the crux of the issue, and why CSV is a great and terrible thing. It's used for 3 types of data, none of which is "the one".
As I commented on the PR at one point, if we have to pick only one, it has to be a block of blocks. The question is, are we better of minimizing the codec and putting all the other logic in a module? i.e.,
load/as will only return a block of blocks. Ever. The rest is up to a blessed module or users themselves.
Your mission, dear community, should you choose to accept it: Mock up what that looks like in your mind. What's the best design if we go that way?
To avoid spaces in data, you can use
>> load-csv/as-columns/trim "1, 2, 3^/4, 5, 6" == #( "A" ["1" "4"] "B" ["2" "5"] "C" ["3" "6"] )
It's on by default, as it's more "correct". But maybe going with more useful option would be preferable and have the opposite refinement of
unpack = Truewhich is equivalent of
You can also process multiple CSV files from inside of of one zip file like this
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)
zipcodec, but it's not a great fit for the codec model by itself. It could load the content info, but then we need supporting funcs to do something with them, unless
loadreturns a full object that includes them, making it a much more complex codec. So
zipmay be a module, as convenient as it would be to build in.
helpto use? ;^)