A kernel that enables applications to interact with Apache Spark (http://toree.apache.org/)
Hi, we are testing the Apache Toree with Jupyter Enterprise Gateway on k8s. It seems that when the kernel is launched, the dependencies specified by "--packages" are downloaded in the enterpise-gateway and then the driver has no access to it.
Another solution we tried was using the magicline %AddDeps, then they are downloaded into the /tmp of the Driver pod, but it fails to use them. I think it is related to the SparkContext already being created, since we are trying to load the "hadoop-azure" and similars to access the Azure blob storage.
Does anybody have some suggestions on how can I solve this dependencies issue? Many thanks! :)
TOREE_SPARK_OPTS
in the kernel.json
with volume information: https://spark.apache.org/docs/2.4.6/running-on-kubernetes.html#dependency-management%AddJar
for now, and I don't think is viable to store all the dependencies there.
--packages ${MVN_DEPS}
to the __TOREE_SPARK_OPTS__
in kernel.json
, Spark Toree downloads the dependencies to the enterprise_gateway
pod and are forgotten there. Since the gateway serves multiple users, we can't put a shared volume on the $HOME/.ivy2
folder of the gateway.
%AddDeps
, and setting a shared volume where they are downloaded. However, in my case, the Spark driver was not detecting the packages providing another FileSystem for Hadoop (I also tried adding the --transitive
flag. For example:
// We run on Spark 2.4.6 compiled for Scala 2.11
%AddDeps io.delta delta-core_2.11 0.6.1
%AddDeps org.apache.hadoop hadoop-azure 2.7.3
%AddDeps com.azure azure-storage-blob 12.8.0
%AddDeps com.azure azure-storage-common 12.8.0
Marking io.delta:delta-core_2.11:0.6.1 for download
Obtained 2 files
Marking org.apache.hadoop:hadoop-azure:2.7.3 for download
Obtained 2 files
Marking com.azure:azure-storage-blob:12.8.0 for download
Obtained 2 files
Marking com.azure:azure-storage-common:12.8.0 for download
Obtained 2 files
import org.apache.hadoop.fs.azure.NativeAzureFileSystem // Just to verify it can be found
val df=spark.read.format("delta").load("wasb://######") // This line fails
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:160)
at org.apache.spark.sql.delta.sources.DeltaDataSource$.parsePathIdentifier(DeltaDataSource.scala:252)
at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:153)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 50 elided
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 64 more
Since I’m not a Spark expert, I can’t tell if you’re running into issues with k8s and spark in general or something with the way EG works. Here are some other pieces of information you might find helpful.
The image for the ‘vanilla’ Scala and spark Scala kernelspecs is the same image. As a result, you could try launching the vanilla Scala image and configure the spark context yourself to see if that helps, or you can modify the spark scala kernelspec and set --RemoteProcessProxy.spark-context-initialization-mode=None
so that it doesn’t create a spark context. The difference between these is that with the former, we control the kernel-pod configuration, whereas Spark controls the pod with the latter and you need to specify pod-related settings in SPARK_OPTS (_TOREE_SPARK_OPTS).
Also note that any envs prefixed with KERNEL_
flow from the client to EG and are available to the kernel. As mentioned, the vanilla kernels are associated with a kernel-pod.yaml template file (in their scripts directories) that can be altered for things like mounts, init containers, etc. and the user-specific “names” could be specified via a KERNEL_
env value. (Note that KERNEL_USERNAME
is already used by many applications for this.)
In EG 3.0 - which will support Spark 3.0, Spark on Kubernetes enables “pod templates” which will enable the convergence of the two launch approaches (the Spark 2.x on k8s controls the pod creation - so we can’t use our templated approach for that). PR jupyter/enterprise_gateway#559 has been waiting around this - but we’re not there yet.
Hello. I am trying to use toree 0.4.0 with jupyterhub 1.1.0. Everything looks great, but i can't see any exceptions traces from the code that i'm running in nb. What can i try, how can i debug this issue?
I am facing similar issue. I can see few lines out of whole stack trace. I am using toree 0.6.0.
sc
) I think it would be too late to apply magics for this. It looks like toree’s support for magics are more at the line and cell level with the notion that the context is already established. Copying @lresende for confirmation/correction.
Hi! Another problem using 0.5.0-rc4 with Spark 3.2.0. Please advise how to resolve.
Exception in thread "main" scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found.
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:24)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:25)
at scala.reflect.internal.Mirrors$RootsBase.$anonfun$getModuleOrClass$5(Mirrors.scala:61)
at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:61)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:198)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:198)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:199)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:199)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1251)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1250)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1408)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1407)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1450)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1450)
at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1506)
at scala.tools.nsc.Global$Run.<init>(Global.scala:1213)
at scala.tools.nsc.interpreter.IMain.compileSourcesKeepingRun(IMain.scala:432)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compileAndSaveRun(IMain.scala:814)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compile(IMain.scala:772)
at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:637)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.$anonfun$start$1(ScalaInterpreterSpecific.scala:291)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.start(ScalaInterpreterSpecific.scala:282)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.start$(ScalaInterpreterSpecific.scala:266)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.start(ScalaInterpreter.scala:43)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.init(ScalaInterpreter.scala:94)
at org.apache.toree.boot.layer.InterpreterManager.$anonfun$initializeInterpreters$1(InterpreterManager.scala:35)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:214)
at org.apache.toree.boot.layer.InterpreterManager.initializeInterpreters(InterpreterManager.scala:34)
at org.apache.toree.boot.layer.StandardComponentInitialization.initializeComponents(ComponentInitialization.scala:87)
at org.apache.toree.boot.layer.StandardComponentInitialization.initializeComponents$(ComponentInitialization.scala:69)
at org.apache.toree.Main$$anon$1.initializeComponents(Main.scala:35)
at org.apache.toree.boot.KernelBootstrap.initialize(KernelBootstrap.scala:102)
at org.apache.toree.Main$.delayedEndpoint$org$apache$toree$Main$1(Main.scala:35)
at org.apache.toree.Main$delayedInit$body.apply(Main.scala:24)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at org.apache.toree.Main$.main(Mai