by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Michael K. Edwards
@mkedwards
compile is the default
Eric K Richardson
@ekrich
Not added to the end of the dependency?
Michael K. Edwards
@mkedwards
and results (with the use of maven-assembly-plugin) in incorporation in the fat jar
it's a scope
Eric K Richardson
@ekrich
Are you using Maven not sbt?
Michael K. Edwards
@mkedwards
correct
that may have been a poor choice to begin with :D
Eric K Richardson
@ekrich
ok, should be okay then - have you unzipped the assembly to inspect?
Michael K. Edwards
@mkedwards
yes
this dependency is in the sql-kafka component
relatively newly added
part of a switch to a different connection pool strategy in the singleton connection pool tracker
bit of a misleading error message, but you really helped
looks like I'm past that!
Eric K Richardson
@ekrich
Ok, good luck - as long as the classfile versions are compatible but not seeing any evidence of that but Scala builds Java 8 version by default.
Michael K. Edwards
@mkedwards
it seems that the actual query logic is running now, on actual data from Kafka
wow, that's a relief
I thought I was going to be fighting this for days or weeks
Eric K Richardson
@ekrich
Fantastic, glad you made progress - sometimes just another set of eyes helps.
Michael K. Edwards
@mkedwards
thank you! and yes, sometimes just a nudge from somebody who isn't tunneled and has different troubleshooting tactics swapped in
yep, there it goes, writing to Kafka
except the topic isn't created, but that's an easy fix
Eric K Richardson
@ekrich
:thumbsup:
adampauls
@adampauls
Hi, I'm trying to understand why Spark seems to want to serialize data inside an RDD for each task that's executed against an RDD. I asked a longer question here, but the short question is: if I have an RDD[Foo] that I've persisted with MEMORY_ONLY, Spark will serialize Foos each time call any action on that RDD. I would expect that the semantics of MEMORY_ONLY mean that I only need to serialize each Foo in the RDD once. Furthermore, I've tested this all in local mode, and I'm a little bit sad that Foo needs to be serialized at all, given the performance penalty, but as far as I can tell, that is expected behavior.
David
@dmatton
Hello, I've got a problem to insert a dataframe to a postgresql table. I would like to use an sql statement and a batch system... But I don't find anything working... Does anybody have any advice ? Thanks
Almaz Murzabekov
@Almaz-KG

Hello community!

I'm trying taste spark3 on out STAGE environment, and for that I'm trying to submit SparkPi example on our YARN service, but always getting this error

LogType:stderr
Log Upload Time:Mon Aug 03 16:45:15 +0300 2020
LogLength:88
Log Contents:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster

What I should change to fix it?

hadoop version
Hadoop 2.7.3.2.6.1.0-129

Virashree H Patel
@VirashreeP_twitter
Hi guys, I just joined this channel. It's my first time here. I am new to scala and especially new to testing spark code using scalatest. I am using companion objects in my application. Something like below. I want to test the method loadFianlDF that uses date variable from its companion object. How can I test this method using scalatest? I see on the scalatest page to use org.scalamock.annotation.mockObject. But this annotation class does not seem to be part of scalamock when I tried importing it in my test class. Any suggestions on how can I test this method? I have not done mock testing before so i am not being able to figure this out. Any help will be appreciated.
class Foo{

  def loadFinalDF(df:DataFrame): DataFrame ={
    val finaldf: DataFrame = df
      .withColumn("Date", lit(date)).distinct()
    finaldf
  }
}

object Foo {

  import org.apache.spark.sql.SparkSession
  val spark = SparkSession.builder().getOrCreate()
  import spark.implicits._

  val someDF = Seq(
          (8, "bat"),
          (64, "mouse"),
          (-27, "horse")
          ).toDF("number", "word")

  val obj = new Foo
  private val date = "2019-03-01"
  val finalDF = obj.loadFinalDF(someDF)

}
Aviral Srivastava
@kebab-mai-haddi

Hey everyone!

I wanted to understand how can I add rows to an existing partition in Spark?

Like I have 10 rows in created_year=2016/created_month=10/created_day=10. Now, today, I get some more data which has the same values in the year, month, and, day columns. So, that should go in the same folder.

I want to add the new rows to that partition with the intention of keeping 1 s3 object (1 file) per day. How do I do that?

Virashree H Patel
@VirashreeP_twitter
@kebab-mai-haddi I think you can use df.write with savemode append
Oğuzhan Kaya
@oguzhan10
Hello, can I change spark job's consuming ram and core number in scala code, when spark job is working?
Eric Peters
@er1c
Is there a scala sql value interpolation syntax for escaping values? since a dataframe isn't strongly typed, I guess I was curious if it automatically handled escaping a string value vs an int value in a sql statement
Luke LaFountaine
@lukelafountaine
If my Spark application writes some files to a data sink and then fails for whatever reason, there are a bunch of extra files leftover in the sink. What is the best practice when it comes to cleaning up those files? Are there common solutions/patterns for this problem? I realize that the output files for a run contain the same UUID and this could be used to manually find the output files to remove. Another option would be to write to a different location and then after the job has finished successfully, move the files to the final location. As a new Spark user, any guidance or direction on this would be very helpful, as I don't want to reinvent the wheel or overlook Spark best practices. Thanks!
krishna challa
@krishna_challa_gitlab
Hi All,
I am running the spark jobs in kubernetes cluster with kubernetes resource manager. Basically from the UI screen user submits the spark job we have given provision terminate this job from UI. This functionality is working fine in cdh and HDP environment while running in yarn cluster. I want implement the same functionality in kubernetes cluster how to Implement it? To kill the yarn application I have used YarnClient in kubernetes also there are lot of java clients available. But I want to know is there any built in functionality available to kill the jobs while running in kubernetes cluster mode?
Oğuzhan Kaya
@oguzhan10
How can I tune memory core consumption during spark structured streming job.
Oğuzhan Kaya
@oguzhan10
Is anyone run flask in spark-submit?
kelvin-qin
@kelvin-qin
Has anyone encountered the problem that Spark cannot read Hive3.1 managed tables?
I went to Jira to check the related questions, and it seems that the big guys do not agree with spark to integrate hive3. So, does anyone have a solution that can be referred to?
I find it incredible. Flink supports Hive3, but Spark is indifferent. Batch processing is Spark's advantage.
kelvin-qin
@kelvin-qin
@oguzhan10 I don't know if I understand your question.spark-submit --help can give some parameters,including memory and core.
@oguzhan10 During the work of Spark Job, you cannot adjust parameters through code.If your job has not started yet, you can refer to the Spark configuration page, which is available on the official website.
Oğuzhan Kaya
@oguzhan10
I found answer from here @kelvin-qin spark job core memory adjustments. if job is idle adjust the memory usage https://www.programmersought.com/article/58571515055/
Oğuzhan Kaya
@oguzhan10
Hello, I am trying to write streaming data to hdfs file. But I am getting error. val query = df3.writeStream.format("csv").outputMode("complete").option("path", "hdfs://IP:8020/test").option("checkpointLocation","hdfs://IP:8020/test").start() error is Data source csv does not support Complete output mode;
But when I change output mode to append I got this error Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
PhillHenry
@PhillHenry
@oguzhan10 Makes sense. You can't have a Complete file and you need watermarking (ie, at what point are you allowed to ignore later data) on an Append
Adam Davidson
@Adi255
Hi. I'm trying to use Spark 3 in client mode with kubernetes (via a Jupyter notebook) and it mostly works well except it's not possible to write dataframes/RDDs to the local file system of the client. The underlying exception is 'java.io.IOException: Mkdirs failed to create file'. I'm not sure what mechanism is used to get data from the executors back to the driver's file system but obviously it's not working. I can however .collect() dataframes to the client and get the data that way, so it should be possible.
PhillHenry
@PhillHenry
@Adi255 What's your underlieing distributed file system? I'd write your dataframes to the DFS as usually then use tools that DFS offers to pull it down locally (az in Azure, hadoop fs -cp ... I think for HDFS etc)
collect is asking for trouble as you can get OOMEs
Adam Davidson
@Adi255
@PhillHenry we have HDFS however a lot of our users would rather just write out CSV to the local filesystem; I know collect should be used very carefully
I'm pretty sure you could do this kind of thing with non-k8s spark deployments (i.e. write to the client's filesystem from distributed jobs)
abharath9
@abharath9
Hi all! Is spark dataframe/dataset show function a side effect function?