These are chat archives for kite-sdk/kite

23rd
Sep 2014
Manthosh Kumar
@manthosh
Sep 23 2014 11:48

My application continuously writes to avro file using hive module with partitions. I need to query those using impala. But files are persisted in hdfs only after writer.close() is called. Before close() is called I only see the temporary file. writer.sync() didn't work either. What should be done for this?

I tried periodically calling close(), but ended up creating many avro files, each of very small size, in each partition. If I need to periodically call close() and initialize new writer, then how often should I call close() for efficient querying?

Joey Echeverria
@joey
Sep 23 2014 16:27
What is your SLA for having the data available for query?
Manthosh Kumar
@manthosh
Sep 23 2014 19:51
It's like producer and consumer.
Joey Echeverria
@joey
Sep 23 2014 19:52
Sure, but how quickly must data from a producer be available to a consumer?
Manthosh Kumar
@manthosh
Sep 23 2014 19:55
For hourly aggregation
Joey Echeverria
@joey
Sep 23 2014 19:56
For that use case, your best bet is to use a dataset partitioned by hour and to execute the query at the top of the next hour.
If you care about a real time view, then you should split your flow
one to HDFS and one to a streaming system like Spark streaming
that would do the real-time part of the aggregation
Manthosh Kumar
@manthosh
Sep 23 2014 20:01
Thanks. What's the average size of an avro file If I don't want real time aggregation?
Joey Echeverria
@joey
Sep 23 2014 20:34
if you'r aiming to process it with Impala
the rule of thumb is you want roughly 1GB files
it's also useful to set the block size to 1GB
so each file is a single block
Manthosh Kumar
@manthosh
Sep 23 2014 20:36
Thanks.