Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Leo
@leobenkel

Hello everyone ! I would like to announce an open source library I wrote to merge Spark with the functionality of ZIO easily. All the boiler plate is already set up and you just have to take advantage of ZIO speed and improved performances: https://github.com/leobenkel/Zparkio , one example of performance gain can be seen here: https://github.com/leobenkel/Zparkio/blob/master/ProjectExample_MoreComplex/src/main/scala/com/leobenkel/zparkioProfileExampleMoreComplex/Transformations/UserTransformations.scala

Let me know what you think :)

PhillHenry
@PhillHenry
@edmondo1984 Do you know why "Spark has inferred a String"? And why do you not "want to performa a select using from_json"?
Edmondo Porcu
@edmondo1984
@PhillHenry my json has an irregular structure, so you cannot infer a single schema for the whole dataset, you will need to split it in two parts (and I am a newbie with Spark, I have to say). I thought that simply performing a single parsing function df.map(parse) would be easier since my dataset is small enough
PhillHenry
@PhillHenry
Yes, this question is asked quite a lot in this forum and I've never seen an answer.
It seems to me (and I might be wrong) if you don't have a regular JSON schema, you can't have a regular Spark schema and there's no way around this. This isn't a failing of Spark, it just simply can't be done with any technology (how can a product know what schema you want when you don't even know?)
My $0.02.
Edmondo Porcu
@edmondo1984
@PhillHenry but once I am in a row, I have a string type which I know holds a json, and it can be either an array or object
how can I test that? Should I introduce an additional library?
@PhillHenry also I don't want to mix the two approaches
PhillHenry
@PhillHenry
Test whether it is an array or an object? If that's all you went then using a JSON library to parse it in a map or filter function seems OK to me FWIW
PhillHenry
@PhillHenry
@drorsssssss Doesn't something like Ambari do this?
Frank Dekervel
@kervel
hello, i'm still not sure i understand how spark memory allocation works. i have a job on a dataset with 11 partitions, and i got a kryo buffer overflow + out of memory. i increased the number of partitions to 40 and my errors are gone. I didn't increase worker nodes or memory size. i don't understand why this would make a difference
Eric K Richardson
@ekrich
@kervel Are you deploying your Spark app as a fat jar with dependencies other than the ones that are included with Spark?
drorsssssss
@drorsssssss
@PhillHenry I’ve searched over the net for solutions and found that Ganglia and graphana+prometheus are most used. I’m using CDH, but found nothing about it(I guess I can build something of my own though)
@kervel The fact you increased the number of partitions made each partition to have smaller size of data in it. Then when a task submitted to executor, it processes less data which means less memory consumption per executor(1 task per partition).
As a disclaimer I would say it depends on the number of cores per executor.
Frank Dekervel
@kervel
@ekrich i'm using a fat jar indeed (using sbt assembly). but my fat jar is becoming obese (>200MB), and i'm using spark in an interactive environment (notebooks) so a fast startup is comfortable. in the mean time, i tried just adding my jar to /opt/spark/jars instead on both driver and executor images, but i have the impression that's not the way to do it (spark-shell still works but polynote refuses to start)
@drorsssssss well yes, but i didn't increase the number of executors. So every executor will have more partitions to process, so i would think the total size of my dataset divided by the total number of executors remains the same.
@ekrich also, sbt-assembly works nicely on my linux laptop but is insanely slow on windows for some reason (i guess lots of small files). so a build takes ages.
Eric K Richardson
@ekrich
@kervel Ok, wanted to check. Maybe you could split up your pipeline into different apps to reduce the dependencies if you have different sources and sinks?
Maybe a memory thing on Windows so swap is excessive?
Frank Dekervel
@kervel
unfortunately i can't really split up in my case, i have one source (tabular data) and one sink (geotrellis raster)
but i'm wondering how other people do spark scala development. doing a full "sbt assembly" on every iteration, or making just a jar, or running directly from sbt
Eric K Richardson
@ekrich
When I was doing it, I was just running the driver locally but then assembly and spark-submit for the cluster test but I have only a few months experience. Seems like allot of dependencies though.
Godsgift ThankGod AKARI
@giftakari
Hi Everyone, I just started learning Scala but am not new to Programming
Am seeking some tips and advance on things one can do with Scala and how to succeed as a Scala Developer
Eric K Richardson
@ekrich
@giftakari Welcome, you should try this chat https://gitter.im/scala/scala
Eric K Richardson
@ekrich
@kervel You could try more memory to sbt to see if you can improve the Windows assembly - maybe like this https://leonard.io/blog/2015/06/setting-sbt-memory-options/
Frank Dekervel
@kervel
Ok, will try tomorrow..
@giftakari this video got me up to speed quickly https://youtu.be/4QIgEMvUfIE
Godsgift ThankGod AKARI
@giftakari
Thanks @ekrich and @kervel
Alejandro Drago
@Aleesminombre_twitter
Hello Group, someone who used K - means with scala ?
i want to computing the closest cluster center for each observation
for a datafrae
dataframe*
Alejandro Drago
@Aleesminombre_twitter
I have built this function in R
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
but i need a Little help to translate to Scala
S. Guzeloglu
@guzeloglusoner

Hello guys!

I would like to log execution times of functions/methods on my SparkSQL code. I saw that spark.time(df.show) provides us "Time Taken " on Console. spark.time returns Unit, so I couldn't get the time taken value.

I wonder whether it is the right way to get the execution time and if is there any other ways to achieve this?

Thanks in advance!

sumitya
@sumitya

Hello guys!

I would like to log execution times of functions/methods on my SparkSQL code. I saw that spark.time(df.show) provides us "Time Taken " on Console. spark.time returns Unit, so I couldn't get the time taken value.

I wonder whether it is the right way to get the execution time and if is there any other ways to achieve this?

Thanks in advance!

@guzeloglusoner - whats your use case ? , probably checking the function time wont be much useful. if the spark-sql is run from the function. in the spark it woul be running multiple stages, tasks, which of them might be straggling. which in turn get you more/less function execution time. job execution time can be extracted from sparkUI.

sanjay sharma
@sanjubaba1984_twitter
Hi experts I would like to know before writing spark job what common things we should keep in mind
Ethan
@esuntag
Keep in mind which operations will run remotely, and which will pull everything down to the driver. Don't collect on a dataset with millions of rows, and also remember that if you only have a few MB of data you can just use a List instead of RDD.
D Cameron Mauch
@DCameronMauch
Anyone know where to ask Zeppelin questions?
I’m trying to run the 0.8.2 docker container, and it asks for a login/password...
Can’t find anything about this anywhere.
The previous versions don’t do this and automatically log you in as anonymous.
Deepak Singh
@dpk3d
Hi Guys can you help me with available libraries to convert XML XSD shcemas to AVRO Schemas Scala/Spark?
Alejandro Drago
@Aleesminombre_twitter
Hello guys I am working with kmeans and I need to assign the closest cluster for each obseration. In this function in R they make that but I need to do thid in scala
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
Godsgift ThankGod AKARI
@giftakari
Hi Everyone,
what are resources to learn Apache Spark ?
astrabelli
@astrabelli