Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Apr 07 2021 11:49
    Victhern17 opened #18
  • Sep 30 2020 10:51
    mrdata13 closed #17
  • Sep 30 2020 10:50
    mrdata13 opened #17
  • Jul 22 2020 03:39
    Hiroyuki93 edited #16
  • Jul 22 2020 03:34
    Hiroyuki93 edited #16
  • Jul 22 2020 03:33
    Hiroyuki93 opened #16
  • Jul 07 2020 05:30
    abhisheksrinivasan2811 opened #15
  • Aug 22 2019 06:39
    robotsp opened #14
  • Mar 13 2019 10:14
    jtlz2 opened #13
  • Nov 02 2018 09:00
    jadianes commented #12
  • Nov 02 2018 09:00
    jadianes closed #12
  • Nov 02 2018 09:00
    jadianes commented #12
  • Nov 02 2018 09:00
    jadianes commented #12
  • Nov 01 2018 15:31
    ammarasmro opened #12
  • Sep 10 2018 08:43
    kmr0877 opened #11
  • Jan 14 2018 20:29
    johnbutler123 opened #10
  • Sep 12 2017 09:32
    jadianes closed #9
  • Sep 06 2017 16:17
    quinsulon opened #9
  • Jun 09 2017 21:42
    zigeuner opened #8
  • Jan 02 2017 09:57
    Swarup17 commented #7


When I'm submitting the code in client mode its working :

spark-submit --master yarn sparkPOC.py > sparkPOC.log

While submitting the code in cluster mode it always fails :

spark-submit --master yarn-cluster --deploy-mode cluster --driver-cores 8 --driver-memory 30g --num-executors 216 --executor-cores 10 --executor-memory 12g sparkPOC.py > sparkPOC.log

16/06/19 12:10:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/19 12:10:55 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/06/19 12:10:57 WARN Client: spark.yarn.am.extraJavaOptions will not take effect in cluster mode
Exception in thread "main" org.apache.spark.SparkException: Application application_1464009396577_187043 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Please let me know does cluster mode work with Python?

What is the command equicalnt in spark of a Counter([1, 2, 1, 1]) in python
Hi , i have pulled the image and i am following the Read.md to run the Pyspark , but i am not not able to connect in my localhost
hello ... I have a query regarding pyspark streaming
i am trying to connect a pyspark streaming via an online cherrypy API ...
anyone active

I'm Running a Pyspark script to Create a hive table with partitions and bucketing enabled. I achieved the partition side, but unable to perform bucketing on it ! Can any one suggest How to perform bucketing for Hive tables in pyspark script.

This is what i included in the script

out.write.partitionBy('partition_key').bucketBy(4,'ClaimType').saveAsTable('default.f_claimhdr2',format='parquet',compression='snappy',mode='overwrite', path='s3')

This is the error I'm struck with

DataFrameWriter' object has no attribute 'bucketBy

Could some one help me with this issue ?
Sanjeev Singh
how to run pySpark on zeppline
Prakritidev Verma

Hi everyone,
I'm new to spark and I want to do ETL using spark. I want to use spark because I can speedup my ETL process.

Data -> tar files (1GB each). total data is around 4 TB hosted on S3
tar files -> contains pdf files
task -> extract text from pdf files.

My ETL pipeline would be : s3(data)->EMR(cluster)->(spark job) ->S3(save back to S3)

I need help on how to read those tar files in spark and process each pdf using tika.

John Damilola
Hi everyone... please has anyone worked on a genome dataset using pyspark before? I need some assistance please.
Alexey Protchenko
Hi everyone. Does someone know if there would be possibility in Spark to use DataSet instead of DataFrame in Python?
@Severyanin As far as I know, there are no datasets in pyspark
Alexey Protchenko
@veeru-007 , that's why I'm asking about future =)
somebody know about connectino between jupyterhub + CDH Spark 2.2 ?
Ernesto Espinosa
@EderPinedaClaros I need to do the same thing
but all on google cloud platform
@jadianes I want build an recommendation System app for my use case. I saw your tutorial: https://github.com/jadianes/spark-movie-lens
I’m not sure what database are you using. Does Spark come with a database. Also the Edx courses are not available now. Can you point me to right learning resource?
Hello there, What is the best way to implement Z-score normalization in spark 2.4.7 ?