apt-get -y install python
@yongtang ok thanks for the pointers. I have gone ahead opening 2 (test) issues (below) to complete porting contrib.cloud/*, just so it can be tracked and make it more explicit what is missing from the original contrib.cloud. Outside of porting these tf_py_test targets, the rest however are already ported.
@henrytansetiawan Oh by the way, to invoke pytest it should be
TFIO_DATAPATH=bazel-bin python -m pyrest -s -v tests/test_xxx.py
The earlier message missed the
dataset = tf_io.VideoDataset(filename=['file1.mp3', 'file2.mp3'], batch=10)
TensorShape([Dimension(None), Dimension(None), Dimension(None), Dimension(3)])
@yongtang After summer break, we finally have some time now to implement our sophisticated "Streaming Kafka ML" demo in the next few weeks and want to leverage TensorFlow IO.
We have some test data from car sensors here: https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference/blob/master/testdata/car-sensor-data.csv
While this is CSV right now (and easy to process similar to our earlier example (https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference/blob/master/python-scripts/autoencoder-anomaly-detection/Sensor-Kafka-Consumer-and-TensorFlow-Model-Training.py), we will use Avro format for the whole pipeline. I.e. the car sensors will already produce Avro messages.
Is it possible to easily also use a consumer for TensorFlow I/O which deserializes Avro data? We will probably use KafkaAvroSerializer (https://docs.confluent.io/current/schema-registry/serializer-formatter.html). I think this should be no problem as TensorFlow I/O Kafka Plugin probably does not require a specific deserializer?
Can I come back to you when we have all the details ready?
(in the MVP, we will just implement the pipeline from car via mqtt and kafka broker to consumer, but in V2 in a few weeks, we want to add TensorFlow I/O for model training...)
Thanks @kaiwaehner for the update.
I agree CSV is not a good format for IoT message.
In TensorFlow I/O we have Avro support for file format. It is not the same as Avro deserializer, though it could be straightforward to add one.
If you have some sample message I can play with, I probably could add Avro message support easily. (The TenorFlow I/O itself is generic so it does not necessarily tied to Avro. However, it would be really interesting to see tensorflow-io genreating deserialized Avro message out of the box.)
Let me know if there is any update, and in the meantime, I will take a look at Avro deserializer and see if I could get started early.
@AlexisBRENON The tf.data is best suited when you already have preprocessed data stored in file (parquet/tfrecord/etc) and is ready to be fed into tf.keras. In that case multiple columns may not help a lot. In other situations, tf.data gives you an iterable and further processing might be limited.
Though you may have use cases for multiple columns for generic data engineering. In such a case you could still zip multiple columns with parquet. However, I would expect subpar performance as parquet is naturally RowGroup based.
@AlexisBRENON Can you open an issue on GitHub? The issue could be easily addressed with some simple C++ maneuver I think.
(feature, label) could use
zip method to form a tuple dataset.
Under the current implementation there might be some performance penalties with the above mentioned method. But that could be resolved easily as well, just need several PRs. One PR is on tensorflow's core repo:
I haven't had enough time to complete this PR recently but I plan to get this one updated and get merged in the next week or so.
It's time for our next monthly meeting agin tomorrow. There are several important items that really would like community to help:
1) Documentation for SIG I/O and linkage to tensorflow.org (sent with previous email)
2) StructTensor RFC (https://github.com/tensorflow/community/pull/151). The StructTensor is very much related to our work with columnar data formats. (also sent with previous email)
Would really appreciate community to join and discuss in the upcoming meeting.
The SIG IO monthly meeting is scheduled for tomorrow 09/12 Thursday, 11:00 AM -12:00 Pacific Time.
Below is the link to the meeting docs we could build up: