These are chat archives for nrinaudo/kantan.csv

14th
May 2016
杨博 (Yang Bo)
@Atry
May 14 2016 17:35
Hi, @nrinaudo ,
I found that all the examples work with file system. I wonder if kantan.csv is able to work with Spark's RDD. For example, giving a HDFS file URL and a row definition case class MyCsvRow(cell0: String, cell1: Int), is kantan.csv able to load the HDFS file to a RDD[MyCsvRow]?
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:09
@Atry honestly, I've no idea. I've never used Spark in my life. What type does your CSV data come as? If it's a File or URL, for instance, you're good to go
otherwise, have a look at CsvInput, which allows you to turn any type into something that can be read as CSV data
Here, for example, is how kantan.csv turns all instances of URL into sources of CSV data.
杨博 (Yang Bo)
@Atry
May 14 2016 18:12
The problem is that RDD is not a in-memory collection. It is a handle of distributed typed data.
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:12
yeah, that's fine. File or URL don't contain the data either, just the path to it
can you open a java.io.Reader on RDD ?
杨博 (Yang Bo)
@Atry
May 14 2016 18:12
No
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:12
ok. So RDD is an actual scala type, then?
杨博 (Yang Bo)
@Atry
May 14 2016 18:13
Yes
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:13
fine. I'm off for dinner for a bit, but I'll have a look and see if I can come up with something
杨博 (Yang Bo)
@Atry
May 14 2016 18:13
Does kantan.csv provide the ability to convert a String of row to a case class?
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:13
sure, there's a CsvInput instance for String
杨博 (Yang Bo)
@Atry
May 14 2016 18:14
I mean one row
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:14
"a,b,c\nd,e,f".asCsvRows[(Char,Char,Char)](',', false)
杨博 (Yang Bo)
@Atry
May 14 2016 18:14
Not for the entire CSV
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:14
oh
not simply, no
but let me have a look at RDD and I'll get back to you later
杨博 (Yang Bo)
@Atry
May 14 2016 18:15
Is there a Decoder from a String to a case class?
Nicolas Rinaudo
@nrinaudo
May 14 2016 18:15
I really need to be off for a bit, my wife is giving me the eye. Catch you in a bit.
杨博 (Yang Bo)
@Atry
May 14 2016 18:15
Thank you!
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:00
ok, it seems I might have been misunderstanding your question from the get go. You're not asking me how to turn a RDD into CSV, but CSV into a RDD, right?
杨博 (Yang Bo)
@Atry
May 14 2016 19:01
Yes. And I wonder if it works for both way
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:02
you mean read from a RDD, write as CSV ?
杨博 (Yang Bo)
@Atry
May 14 2016 19:04
I asked about reading a CSV file into a RDD. At the same time, I also wonder how to write a RDD as a CSV.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:05
right. The second part is easy: it seems you can get an iterator off of a RDD, you can just pass that to, say, new File("output.csv").writeCsv(',')
杨博 (Yang Bo)
@Atry
May 14 2016 19:08
Not that easy because the file usually on HDFS instead of native file system
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:08
that's still an instance of File though
or Path
杨博 (Yang Bo)
@Atry
May 14 2016 19:09
Not an an instance of File or Path, unfortunately.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:09
I don't understand, HDFS is just a file system like another. You mount it and access it like a local file system
杨博 (Yang Bo)
@Atry
May 14 2016 19:10
HDFS is not a regular file system, it acts as a file-system-like API.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:10
can you open a Writer on it?
杨博 (Yang Bo)
@Atry
May 14 2016 19:11
No
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:11
how do you write to it?
杨博 (Yang Bo)
@Atry
May 14 2016 19:12
Because it is distributed. Hence it should not be written and read between local memory.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:12
so, you have a file system you cannot read from or write to? I think we might be having a language issue there
This stack overflow post seems to imply you can get an OutputStream on HDFS
杨博 (Yang Bo)
@Atry
May 14 2016 19:14
Yes. But that's not recommended when working with Spark
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:14
so how do you write to it?
A RDD is distributed and a HDFS file is also distributed. You may want to write or read the file on clusters instead of on local machine.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:18
sure, but in theory (and according to what I google), you're abstracted from that complexity and you can get actual OutputStream and InputStream on HDFS
if you need to use a different abstraction layer, then kantan.csv cannot deal with it at the time
杨博 (Yang Bo)
@Atry
May 14 2016 19:19
OK
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:19
well, for the writing part I suppose you could serialise your CSV data as a String in memory and then save that, but that will probably end up being prohibitively expensive
杨博 (Yang Bo)
@Atry
May 14 2016 19:20
That's why I wonder if I could serialise one row into a String
So that I can get a RDD[String]and then write the RDD into HDFS
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:22
you could hack something together, I suppose. Say you have a RowEncoder[A]. That's essentially a A => Seq[String]
杨博 (Yang Bo)
@Atry
May 14 2016 19:22
I see
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:23
If you're working with non-text data, you can then just call mkString(","), if you want your column separator to be ,
something like this, maybe:
def transform[A: RowEncoder](input: RDD[A]): RDD[String] = input.map(a => RowEncoder[A].encode(a).mkString(","))
杨博 (Yang Bo)
@Atry
May 14 2016 19:23
Does seq.mkString(",") work? Is there any escaping issue about the comma?
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:24
there absolutely is an issue about the comma
well, I suppose you could also consider that each entry in the RDD is a CSV stream of one row exactly?
hang on, yeah, of course that would work
杨博 (Yang Bo)
@Atry
May 14 2016 19:26
Thank you!
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:26
def transform[A: RowEncoder](input: RDD[A]): RDD[String] = input.map(a => List(a).asCsv(','))
you might need to trim the result, I forget whether kantan.csv adds a trailing line break or not
as for turning a CsvReader[A] into a RDD[A], frankly, I've no idea. I'd need to learn quite a bit more about spark than I know right now.
Not necessarily a bad idea, and it might result in a spark-specific kantan.csv module, but not going to happen overnight
If you know of some resources that could get me started quickly, it'd be much appreciated
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:31
@Atry how big is your CSV data? Can you load it all in memory?
杨博 (Yang Bo)
@Atry
May 14 2016 19:33
If I wanted to load the data into memory, I would not choose Spark.
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:34
I'm asking because I see that you're not supposed to create RDD from iterators
杨博 (Yang Bo)
@Atry
May 14 2016 19:35
Maybe no
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:35
ok, so I think I misunderstood your first question then. What type do you have to begin with?
杨博 (Yang Bo)
@Atry
May 14 2016 19:36
A String of HDFS file URL
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:37
you mean something like "hdfs://localhost:1234/path/to/file.csv", right?
杨博 (Yang Bo)
@Atry
May 14 2016 19:37
Yes
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:38
and assuming you wanted to read, say, all lines from that URL, how would you do it?
(bear with me, I know nothing about Spark's APIs)
with something like a StreamingContext ?
Nicolas Rinaudo
@nrinaudo
May 14 2016 19:49
Alright, I'll have a closer look. It's not looking good, mind - I've seen a few things, such as the JSON module, that seem to hint that your data should be split by lines, which is an assumption that just does not work for CSV
thanks for pointing me in the right direction, I'll let you know if I can come up with something
杨博 (Yang Bo)
@Atry
May 14 2016 19:51
Thank you!
Nicolas Rinaudo
@nrinaudo
May 14 2016 20:00
You're quite welcome, I appreciate your patience and suggestions