Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 19:33
    nevillelyh commented #2308
  • 19:33
    nevillelyh reopened #2308
  • 19:14
    anish749 synchronize #2464
  • 19:01
    nevillelyh opened #2525
  • 18:46
    anish749 commented #2464
  • 18:45
    anish749 commented #2464
  • 18:29
    anish749 edited #2464
  • 18:27
    anish749 commented #2464
  • 17:51

    nevillelyh on pubsubio-refactoring

    (compare)

  • 17:51

    nevillelyh on master

    refactor PubsubIO for more type… (compare)

  • 17:51
    nevillelyh closed #2457
  • 16:40

    anish749 on anish749-patch-1

    (compare)

  • 16:37

    nevillelyh on master

    Fix hash full outer join doc (#… (compare)

  • 16:37
    nevillelyh closed #2524
  • 16:36

    nevillelyh on anish749-patch-1

    Update scio-core/src/main/scala… (compare)

  • 16:36
    nevillelyh synchronize #2524
  • 16:35
    anish749 opened #2524
  • 16:35

    anish749 on anish749-patch-1

    fix doc (compare)

  • 16:25
    nevillelyh synchronize #2457
  • 16:25

    nevillelyh on pubsubio-refactoring

    refactor PubsubIO for more type… (compare)

Yang Zhuohan
@billstark
Thanks for the help!
Daniel Robert
@drobert
maybe a longshot, but if anyone here is using Ratatool and the magnolia-based CaseClassGenerator ... I have a case class with a Long parameter, and deriveGen[MyCaseClass] fails with magnolia: could not find Gen.TypeClass for type Long. I'm surprised Long wouldn't work out of the box, but curious how I might fix this
Daniel Robert
@drobert
ah. It's because the CaseClassGenerator is aware of how to interpret the positions and types of the arguments of a case class, but the developer has to specify the appropriate Gen for each required type. Makes sense.
also, Ratatool is incredible and is saving me a huge amount of boilerplate
Daniel Robert
@drobert
actually, is anyone using Ratatool with Scio right now? I'm on scio 0.8.0-beta2 and ratatool 0.3.13 and I can't get tests to run due, presumably, incompatible ratatool versions (error is java.lang.NoSuchMethodError: magnolia.TypeName.<init> etc.).
I see spotify/ratatool#176 from Neville noting the version difference, and I'm including exclude ("me.lyh", "magnolia") in my dependency of ratatool, but I'm still gett this incompatibility. Is this solveable right now?
Daniel Robert
@drobert
^ update: it was pebcak. I wasn't properly excluding a transitive dependency on the older magnolia version from ratatool. things are ok now.
Neville Li
@nevillelyh
my fork is only for scala 2.11, for scala 2.12 the conflict is from magnolia 0.11 vs 0.12 transitive deps, ping it for now until all our libs are published with the latest version
Daniel Robert
@drobert
Is there a changelog published anywhere for magnolia? I learned most of the incompatibilties (and even the fact that .12 was released) from twitter :-/
Mimi Mikusek
@MikusekMimi_twitter
I am new to scio and wondering if there's a way to mock bigtable connection in unit tests?
tosun-si
@tosun-si
Hi, i am working on Apache beam and SCIO few weeks ago.
I very like it, thanks. I have a little problem. I use the last version of SCIO 0.8.2 beta. I use the saveAsTypedBigquery méthod with Scala macro. The code works and compile well with sbt but not with Intellij. I have the last version of Intellij and i have had the SCIO plugin, but the problem remains. I work on Ubuntu OS. Have you any ideas about this issue ? Thanks for your help
tosun-si
@tosun-si
@MikusekMimi_twitter if you want mock the bigtable connection, maybe you can use scalamock or mockito for Scala.
tosun-si
@tosun-si
Today i updated the Scio IntelliJ plugin and this work 👍🏻
tosun-si
@tosun-si
I had an another problem :
Daniel Robert
@drobert
@tosun-si I didn't know there was such a plugin. What does it do?
tosun-si
@tosun-si
@drobert it is a plugin (SCIO) that gives some support in IntelliJ, for example, for Scala macro generated by SCIO :
But the issue don't remains after updating the plugin.
Neville Li
@nevillelyh
scio-parquetis designed to work with Avro compiled classes (.java files) not the macro annotation
parquet has so many in memory representations so it really depends on ur use case, the avro one (projecting into SpecificRecord) is not perfect and has a lot of quirkiness
Yang Zhuohan
@billstark
Guys, sorry for bothering again. I have one more question against loading parquet data. I think @nevillelyh has provided a way to do projection and predicate. However, I believe these two are suitable for single parquet file. What if, for instance, we have a time-partitioned structure like bucket/year=2019/month=xx/input.parquet and we want to do push-down filtering on "month", would it be possible? Or it would be good enough to just specify month at the beginning of file loading as load("bucket/year=2019/month=11/*.parquet")?
Neville Li
@nevillelyh
So month is part of the file pattern not a schema field then yes use a file pattern wild card
Jacob Brouwer
@jabrouwer82
Hello! My team is building our own internal set of ScioIOs and one issue we've encountered is that saveAsCustomOutput requires an explicit name. Is there a reason it doesn't use the same name generation machinery as other transformations like applyTransform, and is there a way we can get access to that machinery without using private methods?
Neville Li
@nevillelyh
the name is also used as unique id for JobTest IO stubbing so it's better to have it
Jacob Brouwer
@jabrouwer82
I understand why it might be better to have an explicit name in some cases, but couldn't you just use withName to add a constant name?
saveAsCustomOutput is the only method on SCollection that forces the user to come up with a unique name
To clarify, does that name have to be unique, or can we just pick an arbitrary string to name these for users of our custom ScioIOs?
Nick Aldwin
@NJAldwin

is there a way we can get access to that machinery without using private methods?

I would love if the CallSites was made less private. We ended up having to copy the entire file to use it (which is an option you could consider here)

Neville Li
@nevillelyh
you can, but if u use JobTest#input(CustomIO("name"), data) name needs to match
@NJAldwin feel free to file issue, if u have use case snippet that'd be great
Jacob Brouwer
@jabrouwer82
So if we're not using JobTest then the names don't need to be unique?
@NJAldwin Unfortunately, we've been declaring extension methods in the scio package, eg package com.spotify.scio.ingest { ... }, and I'm trying to cut down on our reliance on scio internals
Neville Li
@nevillelyh
@jabrouwer82 correct, also if u need to extend scio internals, feel free to file issues with snippet illustrating ur use case, maybe it's better to make our code more extendable than hack around it
Jacob Brouwer
@jabrouwer82
I agree, I just wanted to reach out here to see if we were just missing something that already existed first. Thanks!
Jacob Brouwer
@jabrouwer82
I created a new issue, please let me know what I can do to help with this! spotify/scio#2447
Jacob Brouwer
@jabrouwer82
Just a quick followup, I opened a pr with what I think are the changes I want, spotify/scio#2449, please feel free to ping me if there's anything I can do with this pr!
Daniel Robert
@drobert

question about Coders. We saw in production a huge amount of GC. Heap dumps showed it to be kryo-related. I re-read https://labs.spotify.com/2019/05/30/scio-0-7-a-deep-dive/ and saw that there can be a memory leak using kryo in a streaming pipelines, of I'm started replacing the kryo fallback coders with more explicit coders.

Two questions:
1) given the leak, should kryo serialization be avoided entirely?
2) I can't seem to register or utilize a custom coder for a RabbitMqMessage (from our Source). the DirectRunner, during its immutability checks, goes through a ser/deser process, and I see kryo fallback deserialization in place rather than any custom coders I've written (or the default SerializableCoder.of) in use. I don't see how I can use an implicit here since this isn't code directly in my control ; how do I impact this/avoid kryo?

Nejc Ilenic
@inejc
Hello, I also have a question regarding coders. I have this step in my pipeline: .withName(“CombinePoints”).combine[List[Point]](x => List(x))((combined, x) => x :: combined)(_ ++ _) which essentially causes streaming job updates to fail under load with Workflow failed. Causes: The new job is not compatible with …. The original job has not been aborted. The Coder or type for step CombinePoints/Combine.perKey(anon$2)/GroupByKey has changed.. Updates fail without changing any of the code. Thanks for the help.
Jayadeep Jayaraman
@cejj
Hello, I have a requirement to read json files from pubsub and write to GCS and then create external tables in BQ. This would be a streaming job and I am not able to figure out how to create the BQ table with --autodetect option in Scio
I see a feature of Tap which can be used to perform light weight orchestration but from the documentation it says that this can only run at the end of the job which in the case of streaming job will not work, any other ideas on how I can achieve this ?
Daniel Robert
@drobert
ignore my number 2 above. Just number 1: should kryo serialization be avoided entirely for long-running streaming jobs (given the memory leak)?
Neville Li
@nevillelyh
@drobert yes esp for custom POJOs or in memory/latency critical jobs, kryo serializer is stateful and is hard to manage within beam's coder framework
Daniel Robert
@drobert
excellent. ok, thanks
I got this information principally from a blog post, would you be open to me PRing/contributing some notes on this directly to the Scio docs?
Neville Li
@nevillelyh
@cejj u need to create tables before running the job, using BQ's java API or gcloud CLI
@drobert sure, we have this page about kyro, but that's before the new magnolia based coder: https://spotify.github.io/scio/internals/Kryo.html, so either that or maybe an FAQ topic
@inejc more details plz, what changed between the jobs?
Nejc Ilenic
@inejc
@nevillelyh nothing changed, that's what's bothering me. If I update the job with exactly the same state (code and parameters) it will fail, given that the already running job has processed some data (or is processing at the time of the update). The problematic pipeline is very similar to this (withGlobalWindow followed by combine).
Neville Li
@nevillelyh
file issue with snippet plz