Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Pol Santamaria
    @polsm91
    Another option could be downloading the packages with %AddDeps, and setting a shared volume where they are downloaded. However, in my case, the Spark driver was not detecting the packages providing another FileSystem for Hadoop (I also tried adding the --transitive flag. For example:
    // We run on Spark 2.4.6 compiled for Scala 2.11
    %AddDeps io.delta delta-core_2.11 0.6.1
    %AddDeps org.apache.hadoop hadoop-azure 2.7.3
    %AddDeps com.azure azure-storage-blob 12.8.0
    %AddDeps com.azure azure-storage-common 12.8.0

    Marking io.delta:delta-core_2.11:0.6.1 for download
    Obtained 2 files
    Marking org.apache.hadoop:hadoop-azure:2.7.3 for download
    Obtained 2 files
    Marking com.azure:azure-storage-blob:12.8.0 for download
    Obtained 2 files
    Marking com.azure:azure-storage-common:12.8.0 for download
    Obtained 2 files

    import org.apache.hadoop.fs.azure.NativeAzureFileSystem // Just to verify it can be found
    val df=spark.read.format("delta").load("wasb://######") // This line fails

    java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:160)
    at org.apache.spark.sql.delta.sources.DeltaDataSource$.parsePathIdentifier(DeltaDataSource.scala:252)
    at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:153)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    ... 50 elided
    Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    ... 64 more

    Pol Santamaria
    @polsm91
    I assume the reason is that the SparkContext was already created when the Azure packages were added, and the Hadoop FileSystem submodule needs to be reloaded in some way :/
    Kevin Bates
    @kevin-bates

    Since I’m not a Spark expert, I can’t tell if you’re running into issues with k8s and spark in general or something with the way EG works. Here are some other pieces of information you might find helpful.

    The image for the ‘vanilla’ Scala and spark Scala kernelspecs is the same image. As a result, you could try launching the vanilla Scala image and configure the spark context yourself to see if that helps, or you can modify the spark scala kernelspec and set --RemoteProcessProxy.spark-context-initialization-mode=None so that it doesn’t create a spark context. The difference between these is that with the former, we control the kernel-pod configuration, whereas Spark controls the pod with the latter and you need to specify pod-related settings in SPARK_OPTS (_TOREE_SPARK_OPTS).

    Also note that any envs prefixed with KERNEL_ flow from the client to EG and are available to the kernel. As mentioned, the vanilla kernels are associated with a kernel-pod.yaml template file (in their scripts directories) that can be altered for things like mounts, init containers, etc. and the user-specific “names” could be specified via a KERNEL_ env value. (Note that KERNEL_USERNAME is already used by many applications for this.)

    In EG 3.0 - which will support Spark 3.0, Spark on Kubernetes enables “pod templates” which will enable the convergence of the two launch approaches (the Spark 2.x on k8s controls the pod creation - so we can’t use our templated approach for that). PR jupyter/enterprise_gateway#559 has been waiting around this - but we’re not there yet.

    Luciano Resende
    @lresende
    Another simple test that could be done is to check what is the Spark behavior when doing this outside of the kernel environment, e.g. when you submit an app passing the opts and see how Spark behave
    Pol Santamaria
    @polsm91
    Many thanks @kevin-bates for the extensive reply. I will experiment with your suggestions and also explore how it is going to work in EG 3.0, we run Spark 2.X but we plan on upgrading to 3.X before 2021.
    Todd Tao
    @muxuezi
    spark 3.0 support?
    Luciano Resende
    @lresende
    is available on the release candidate
    Leiglad
    @Leiglad
    Hello. I am trying to use toree 0.4.0 with jupyterhub 1.1.0. Everything looks great, but i can't see any exceptions traces from the code that i'm running in nb. What can i try, how can i debug this issue?
    Leiglad
    @Leiglad
    Oh, i see there is already similar issue on bug tracker
    https://issues.apache.org/jira/projects/TOREE/issues/TOREE-522
    Any workarounds?
    Leiglad
    @Leiglad
    Oh and my jupyterhub version is 6.1.4
    Devendra Singh
    @devendrasr
    hi there,
    I am unable to view complete stacktrace in my notebook in case of errors thrown by the scala kernel. It shows only 3-4 lines out of all the stack trace. Could you please help me configure toree to show complete stack trace in such cases?

    Hello. I am trying to use toree 0.4.0 with jupyterhub 1.1.0. Everything looks great, but i can't see any exceptions traces from the code that i'm running in nb. What can i try, how can i debug this issue?

    I am facing similar issue. I can see few lines out of whole stack trace. I am using toree 0.6.0.

    Luciano Resende
    @lresende
    i believe I fixed that on master
    i should be able to try another rc before end of the week, i was still trying to fix another issue related to magics before i do that
    Stanislav G.
    @StanislavKabish
    Hi,
    How do I add autocomplete for Spark? For me only highlighting work (.
    Arthur Stemmer
    @apstemmer
    Hi! Which version of Toree would be able to support Spark 3.x? Would 0.5.0 suffice or do I need to build from master?
    Luciano Resende
    @lresende
    the 0.5.0 was not approved yet, it’s just an rc, but that would work or master…
    i have been using the rc for a little while, but need to find time to wrap up the release
    Arthur Stemmer
    @apstemmer
    Great, that works. Although it does seem like the kernel is not providing error information in the notebook on say a syntax error (it seems to fail silently), is that expected?
    Luciano Resende
    @lresende
    Are you using classic notebook? Or lab?
    Arthur Stemmer
    @apstemmer
    I'm using a classic notebook via Jupyter Enterprise Gateway
    Arthur Stemmer
    @apstemmer
    Is there a list containing past occurrences of security vulnerabilities (CVEs) publicly available?
    Rahul Goyal
    @rahul26goyal
    Hi
    does Toree support magic commands similar to sparkmagic: %config which will let notebook users specify spark configs dynamically?
    Kevin Bates
    @kevin-bates
    Since the toree kernel’s startup creates the spark context (using parameters conveyed via the kernel spec and available via sc) I think it would be too late to apply magics for this. It looks like toree’s support for magics are more at the line and cell level with the notion that the context is already established. Copying @lresende for confirmation/correction.
    Rahul Goyal
    @rahul26goyal
    I agree with you @kevin-bates that the toree kernel would already have started the spark context and we can not apply config level magics...based on my limited understanding of "sparkmagic kernel", this problem is solved by delaying the creation of spark context until needed..that give scope to execute magic cells with "%config" and the kernel accumulates it.. it is not possible to do that same with Toree?
    Kevin Bates
    @kevin-bates
    I don’t think so - but need @lresende to confirm.
    Luciano Resende
    @lresende
    if you set toree options to not start the context, you can create your own with SparkSession
    Kevin Bates
    @kevin-bates
    :+1:
    Rahul Goyal
    @rahul26goyal
    how do i do that @lresende .. is there any link for reference?
    Luciano Resende
    @lresende
    Note that I just ssaw a comment there which implies that there might be a bug on the code … would appreciate help investigating and providing a patch
    David M.
    @david1155
    Hi
    I have installed Apache Toree on Jupyter:
    pip install toree
    jupyter toree install --spark_home=/usr/local/spark-3.2.0-bin-hadoop3.2/
    but kernel crashes after start:
    Exception in thread "main" java.lang.NoClassDefFoundError: scala/App$class
    at org.apache.toree.Main$.<init>(Main.scala:24)
    at org.apache.toree.Main$.<clinit>(Main.scala)
    at org.apache.toree.Main.main(Main.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.ClassNotFoundException: scala.App$class
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 15 more

    Kindly ask to advise how to fix it.
    Zakk
    @a1mzone

    Hi @s3uzz, my guess would be that you are using perhaps a version like 0.4 ? Which is still on Scala 2.11 and if you have hadoop 3.2.x you have Scala 2.12.

    Compile a newer version for Scala 2.12 support.

    David M.
    @david1155
    Thank you. My fault, version from pypi.org is 0.4.0 from Aug 2020. Just followed documentation...
    5 replies
    David M.
    @david1155

    Hi! Another problem using 0.5.0-rc4 with Spark 3.2.0. Please advise how to resolve.

    Exception in thread "main" scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found.
    at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:24)
    at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:25)
    at scala.reflect.internal.Mirrors$RootsBase.$anonfun$getModuleOrClass$5(Mirrors.scala:61)
    at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:61)
    at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:198)
    at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:198)
    at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:199)
    at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:199)
    at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1251)
    at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1250)
    at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1408)
    at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1407)
    at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1450)
    at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1450)
    at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1506)
    at scala.tools.nsc.Global$Run.<init>(Global.scala:1213)
    at scala.tools.nsc.interpreter.IMain.compileSourcesKeepingRun(IMain.scala:432)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compileAndSaveRun(IMain.scala:814)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compile(IMain.scala:772)
    at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:637)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.$anonfun$start$1(ScalaInterpreterSpecific.scala:291)
    at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.start(ScalaInterpreterSpecific.scala:282)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific.start$(ScalaInterpreterSpecific.scala:266)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.start(ScalaInterpreter.scala:43)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.init(ScalaInterpreter.scala:94)
    at org.apache.toree.boot.layer.InterpreterManager.$anonfun$initializeInterpreters$1(InterpreterManager.scala:35)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:214)
    at org.apache.toree.boot.layer.InterpreterManager.initializeInterpreters(InterpreterManager.scala:34)
    at org.apache.toree.boot.layer.StandardComponentInitialization.initializeComponents(ComponentInitialization.scala:87)
    at org.apache.toree.boot.layer.StandardComponentInitialization.initializeComponents$(ComponentInitialization.scala:69)
    at org.apache.toree.Main$$anon$1.initializeComponents(Main.scala:35)
    at org.apache.toree.boot.KernelBootstrap.initialize(KernelBootstrap.scala:102)
    at org.apache.toree.Main$.delayedEndpoint$org$apache$toree$Main$1(Main.scala:35)
    at org.apache.toree.Main$delayedInit$body.apply(Main.scala:24)
    at scala.Function0.apply$mcV$sp(Function0.scala:39)
    at scala.Function0.apply$mcV$sp$(Function0.scala:39)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
    at scala.App.$anonfun$main$1$adapted(App.scala:80)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at scala.App.main(App.scala:80)
    at scala.App.main$(App.scala:78)
    at org.apache.toree.Main$.main(Mai

    Should I downgrade Java version? I use openjdk 11.0.13 2021-10-19
    WARN Main$$anon$1: No external magics provided to PluginManager!
    [init] error: error while loading Object, Missing dependency 'class scala.native in compiler mirror', required by /modules/java.base/java/lang/Object.class
    David M.
    @david1155
    ok, I have to downgrade Java 11 -> 8
    6 replies
    David M.
    @david1155
    Hi!
    I install Toree with 'jupyter toree install --spark_home=${SPARK_HOME} --interpreters=Scala,PySpark,SQL --python_exec=/opt/conda/bin/python3.8'
    However, in Jupyterlab I see only two kernels: Scala and SQL.
    Kindly ask how to debug. Thanks in advance.
    Toree version 0.4.0
    David M.
    @david1155
    It seems that I have to downgrade Python 3.8->3.7
    David M.
    @david1155
    [ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
    Zakk
    @a1mzone

    Hi @s3uzz

    Yes you could try an older python, I am still using 3.6 as mentioned previously.

    Below is a small example bash script I use to install

    #!/bin/sh
    export VERSION=0.1.0
    export SPARK_HOME=/path/to/sparkHome
    jars="file:///path/to/jar"
    jars="$jars,file:///path/to/jar"
    
    jupyter-toree install \
        --replace \
        --debug \
        --user \
        --kernel_name "project $VERSION" \
        --spark_home=${SPARK_HOME} \
        --spark_opts="--master yarn --jars $jars"
    David M.
    @david1155
    Zakk, thank you. Maybe intepreter PySpark is not available in version 0.4.0> I use
    jupyter toree install --spark_home=${SPARK_HOME} --interpreters=Scala,PySpark,SQL --python_exec=/opt/conda/bin/python3.8
    and get error
    [ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
    5 replies
    amitzo
    @amitzo
    Hi, I am running toree 0.5.0 rc4 with Spark 3.1.2, everything is working fine execept there is a problem with errors not displaying in the cell output while they do show up in the text if I do "Download as notebook .pynb". For example if I type "blah" in the cell and run it I see a blank response in the notebook, but this value in the downloaded file - {
    "cell_type": "code",
    "execution_count": 2,
    "id": "68eb215f",
    "metadata": {},
    "outputs": [
    {
    "ename": "Compile Error",
    "evalue": "<console>:26: error: not found: value blah\n blah\n ^\n",
    "output_type": "error",
    "traceback": []
    }
    ],
    "source": [
    "blah"
    ]
    } Is this a bug or am I missing some configuration? How do I get these types of error to display in the cell output?
    Rahul Goyal
    @rahul26goyal
    Hi team
    the issue that @amitzo has brought up, we are seeing similar issue and i think this is preventing a the end users from getting to know what exactly happened in the backend..
    is there a ticket open already on this ? what is the plan in general to improve and address user experience gaps from the community.
    I will be more than happy to help out if someone can guide on this.
    thanks