Hey @shadaj:matrix.org , firstly, I really appreciate the work that you have put in for building Scalapy! After using it extensively in day-to-day tasks, it has proven to be a great asset in reducing manual conversion efforts from Pandas code to Scala.
I have been trying to run a custom PySpark script from a Scala based notebook on Databricks but facing an issue when I try to pass a spark session to a function in a custom python package. ScalaPy throws a type mismatch error. Have attached the error below for your reference.
Importing required libraries using ScalaPy
val pd = py.module("pandas")
val s3fs = py.module("s3fs")
val py_spark_sql = py.module("pyspark.sql")
val pyspark_package = py.module("pyspark_tier2_test.pyspark_tier2_test")
This is the error I get when I pass the spark session to the function in my custom package:
val result_df = pyspark_package.py_driver_func(py_spark_sql.SparkSession)
command-3273291744808514:1: error: type mismatch;
found : org.apache.spark.SparkSession
required: me.shadaj.scalapy.py.Any
Pandas works perfectly with ScalaPy but I have an requirement to make pyspark scripts run with Scalapy in order to make things more scalable and distributed!
Can you please suggest a fix or head me in the right direction? Any help will be much appreciated!
py_spark_sql.SparkSession
should automatically be py.Any
since it's just a member of another Python module. I wonder if the Databricks notebook environment is doing something funky. Could you try printing out the type of py_spark_sql.SparkSession
(py_spark_sql.SparkSession.getClass
)?
py_spark_sql.SparkSession
in a variable before using it?
"Exception in thread "main" me.shadaj.scalapy.py.PythonException: <class 'ModuleNotFoundError'> No module named 'xgboost'"
xgboost
right next to numpy
and the modules that do work?
3.9.9
and anaconda 3-2011.11
. Has anyone had experience with this approach and be able to share any pointers? Many thanks in advance!
To make this slightly easier, I've removed pyenv from the equation and pushed this skeleton example to github.
At this point, upon executing runMain hello
in the sbt shell the error begins with:
java.lang.UnsatisfiedLinkError: Unable to load library 'python3':
dlopen(libpython3.dylib, 0x0009): tried: '/Applications/IntelliJ IDEA CE.app/Contents/jbr/Contents/Home/bin/../lib/jli/libpython3.dylib' (no such file) ...
And it's correct, that file doesn't exist, it's actually /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9.dylib
, but how do I get this to be picked up?
Next, I've switched to my local install of anaconda by changing the path to python in build.sbt
:
lazy val python = Python("/opt/anaconda3/bin/python3.9")
and the existing example works fine.
However, when I then try to experiment with numpy
at runtime a particular library can't be loaded:
[info] INTEL MKL ERROR: dlopen(/opt/anaconda3/lib/libmkl_intel_thread.1.dylib, 0x0009): Library not loaded: @rpath/libiomp5.dylib
I notice that /opt/anaconda3/lib/libiomp5.dylib
does exist, although /opt/anaconda3/lib/libmkl_intel_thread.1.dylib
does not.
Has anyone experienced a similar problem?
[info] INTEL MKL ERROR: dlopen(/opt/anaconda3/lib/libmkl_intel_thread.1.dylib, 0x0009): Library not loaded: @rpath/libiomp5.dylib
[info] Referenced from: /opt/anaconda3/lib/libmkl_intel_thread.1.dylib
[info] Reason: tried: '/Applications/IntelliJ IDEA CE.app/Contents/jbr/Contents/Home/bin/../lib/jli/libiomp5.dylib' (no such file), '/usr/lib/libiomp5.dylib' (no such file).
[info] Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.
On the executable /opt/anaconda3/bin/python3.9
it would appear (from using otool
) that LC_RPATH
is correct:
Load command 14
cmd LC_RPATH
cmdsize 272
path /opt/anaconda3/lib (offset 12)
and in /opt/anaconda3/lib/libmkl_intel_thread.1.dylib
itself, I see:
Load command 10
cmd LC_LOAD_DYLIB
cmdsize 48
name @rpath/libiomp5.dylib (offset 24)
I'm in a world of macos/rpath pain now and well out of my depth, but none of the above looks incorrect to me.
Would anyone care to venture why it doesn't pick up @rpath/libiomp5.dylib
from /opt/anaconda3/lib
?
javaOptions
specifically for the Docker image (I believe javaOptions in Universal
should work) to -Djna.library.path=$pythonLibsDir
where $pythonLibsDir
is replaced with the Python installation path that python3-config
prints in the container
import ai.kien.python.Python
Python().scalapyProperties.fold(
ex => println(s"Error while getting ScalaPy properties: $ex"),
props => props.foreach { case(k, v) => System.setProperty(k, v) }
)
pip
tool, ScalaPy just forwards to the underlying Python implementation when loading modules
I've been experimenting with ScalaPy and it is amazing! I did try to import the python library wandb and get the following exception
[error] Exception in thread "main" me.shadaj.scalapy.py.PythonException: <class 'IndexError'> list index out of range
Any ideas on why this might be happening? I'm using
wandb was installed via pip3 install wanbd
and I invoked it in my application via
import me.shadaj.scalapy.py
val wandb = py.Module("wandb")
thanks for any help!
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.$anonfun$throwErrorIfOccured$2(CPythonInterpreter.scala:328)
[error] at me.shadaj.scalapy.interpreter.Platform$.Zone(Platform.scala:10)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.$anonfun$throwErrorIfOccured$1(CPythonInterpreter.scala:314)
[error] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.withGil(CPythonInterpreter.scala:160)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.throwErrorIfOccured(CPythonInterpreter.scala:313)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.$anonfun$importModule$2(CPythonInterpreter.scala:233)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.withGil(CPythonInterpreter.scala:160)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.$anonfun$importModule$1(CPythonInterpreter.scala:230)
[error] at me.shadaj.scalapy.interpreter.Platform$.Zone(Platform.scala:10)
[error] at me.shadaj.scalapy.interpreter.CPythonInterpreter$.importModule(CPythonInterpreter.scala:229)
[error] at me.shadaj.scalapy.py.ModuleApply.apply(ModuleApply.scala:9)
[error] at me.shadaj.scalapy.py.ModuleApply.apply$(ModuleApply.scala:8)
[error] at me.shadaj.scalapy.py.Module$.apply(Module.scala:7)
[error] at me.shadaj.scalapy.py.package$.module(package.scala:14)
[error] at example.WandB$.delayedEndpoint$example$WandB$1(WandB.scala:8)
[error] at example.WandB$delayedInit$body.apply(WandB.scala:6)
[error] at scala.Function0.apply$mcV$sp(Function0.scala:39)
[error] at scala.Function0.apply$mcV$sp$(Function0.scala:39)
[error] at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
[error] at scala.App.$anonfun$main$1$adapted(App.scala:80)
[error] at scala.collection.immutable.List.foreach(List.scala:392)
[error] at scala.App.main(App.scala:80)
[error] at scala.App.main$(App.scala:78)
[error] at example.WandB$.main(WandB.scala:6)
[error] at example.WandB.main(WandB.scala)
[error] Nonzero exit code returned from runner: 1
wandb
when it's being imported. The Python stack traces aren't great right now when there are crashes in Python unfortunately, but let me try to see if I can reproduce locally.