These are chat archives for nrinaudo/kantan.csv

6th
Jun 2018
Anton Kulaga
@antonkulaga
Jun 06 2018 06:41
Looks like kantan.csv IDEA issue is still not resolved :(
Nicolas Rinaudo
@nrinaudo
Jun 06 2018 06:43
it's not a kantan.csv IDEA issue though. It's a general IDEA issue that you just happen to have encountered using kantan.csv
Anton Kulaga
@antonkulaga
Jun 06 2018 06:44
yes, that is why I suggest to put more votes there=)
I have a question, is there a way to decrease kantan.csv memory consumption? For instance, if I need only several columns in CSV, how can I get them without putting everything to memory? I have to deal with csv-s of hundreds of megabytes
Nicolas Rinaudo
@nrinaudo
Jun 06 2018 06:45
I think that a better way of getting it fixed would be to not tie it to a relatively obscure library but to show that IDEA is bad with syntax enrichment
well, first, kantan.csv works in a way that unless you want to, you only ever hold a single row in memory at any given time
so the size of the CSV file is not very relevant, the size of each ROW is relevant
second, yes and no. When decoding a CSV row, kantan.csv must have the entire row as a Seq[String] in memory before passing it to a RowDecoder. But you can certainly decode rows to a type that doesn't use all cells
to take a concrete example, if you have a CSV where each row is composed of a hundred strings, and you only need the first one:
  • there will be, for a moment, one hundred strings in memory
  • if you decode to String, the remaining 99 strings will be discarded (and gc-ed out, if you have good gc options) immediately
let me know if that's not clear enough
Anton Kulaga
@antonkulaga
Jun 06 2018 06:49

In my use case I have large .tsv-s with

 val headers = Seq("Name",    "Length",    "EffectiveLength",    "TPM",    "NumReads")

There I only want to get Name and TPM.
Currently I do it with:

  type SimpleSalmon = (String, Int, Double, Double, Double)

p.toIO.unsafeReadCsv[Vector, SimpleSalmon](config.withHeader).map(v=>v._1 -> v._4)

is there a way to dicrease memory consumption here?

p is Ammonite path to the File
Nicolas Rinaudo
@nrinaudo
Jun 06 2018 06:50
there are multiple ways to reduce the memory usage here, yes. As I said, you don't load the entire file unless you ask for the entire file to be loaded. And you do
do you need the entire file in memory?
Anton Kulaga
@antonkulaga
Jun 06 2018 06:51
I need pairs Name->TMP in memory
Nicolas Rinaudo
@nrinaudo
Jun 06 2018 06:51
ok, so then instead of taking each row in memory, why don't you only take the data you need in memory?
by which I mean write a custom RowDecoder[SimpleSalmon] that only takes the first and fourth cell
something like:
final case class SimpleSalmon(name: String, tpm: Double)
implicit val simpleSalmonDecoder: HeaderDecoder[SimpleSalmon] = HeaderDecoder.decode("Name", "TPM")(SimpleSalmon.apply _)

p.toIO.unsafeReadCsv[Vector, SimpleSalmon](config.withHeader)
(off the cuff, might not compile)
Nicolas Rinaudo
@nrinaudo
Jun 06 2018 06:57
or, if you're really really keen on decoding all cells (because you might want to validate the entire CSV file, even the fields you don't use, say), you could also use asCsvReader, do the mapping on the iterator-like structure this returns, and then load the whole thing in memory with toList