Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ignacio
    @ghoto
    So I have this problem in a notebook that I'm writing, where I define an Integer in a Scala cell, and then I need to access it from a Python cell. Sometimes the object is not accessible from the Python cell, it neither appears in the autocompletion, and sometimes it does. The error I get is pretty short and I have no idea what is going on.
    Uncaught exception: NameError: name 'dim' is not defined (java.lang.RuntimeException)
    <string>.__polynote_run__(<string>:54)
    <ast>.<module>(<ast>:1)
    Ignacio
    @ghoto
    I have the exact same code in Cell 12 and 22, in 12 it passes, in 22 it fails.
    GOOD
    
         {
           "cell_type" : "code",
           "execution_count" : 12,
           "metadata" : {
             "cell.metadata.exec_info" : {
               "startTs" : 1573259447898,
               "endTs" : 1573259447961
             },
             "language" : "python"
           },
           "language" : "python",
           "source" : [
             "dim"
           ],
           "outputs" : [
             {
               "execution_count" : 12,
               "data" : {
                 "text/plain" : [
                   "9318"
                 ]
               },
               "metadata" : {
                 "name" : "Out",
                 "type" : "Long"
               },
               "output_type" : "execute_result"
             }
           ]
         },
    
    BAD
    
         {
           "cell_type" : "code",
           "execution_count" : 22,
           "metadata" : {
             "cell.metadata.exec_info" : {
               "startTs" : 1573259569448,
               "endTs" : 1573259569473
             },
             "language" : "python"
           },
           "language" : "python",
           "source" : [
             "dim"
           ],
           "outputs" : [
             {
               "ename" : "java.lang.RuntimeException",
               "evalue" : "NameError: name 'dim' is not defined",
               "traceback" : [
               ],
               "output_type" : "error"
             }
           ]
         },
    (copy/paste) from the notebook file
    jonathanindig
    @jonathanindig
    Where is dim defined?
    Ignacio
    @ghoto
    in cell 10
    val Row(dim: Int) = featuresWCountTransform(training).select(dimUDF($"features")).first()
    jonathanindig
    @jonathanindig
    Hmm, I guess something must be happening between cells 12 and 22?
    If you change cell 22 to a Scala cell does it work?
    Ignacio
    @ghoto
    I'm not executing nothing between the 2 cells though (there are other cells, but I'm not executing them). If I switch Cell#22 to Scala it evaluates the variable. Also, in python, when I'm at Cell#12 it shows the variable in the autocompletion, but in Cell#22 it doesn't show up when in Python mode (it does in Scala mode). Perhaps this can help to figure out what it's happening.
    Michael Pilosov
    @mathematicalmichael

    @mathematicalmichael what do you mean by run as root? I can run it just fine without "--privileged" (are you talking about the user inside the container?)

    oh I am not surprised it can be run without root, I was just referring to the docker image currently as is runs as the default user, which is root.

    @jonathanindig for real? Yeah, sure, I ... can actually do that ... surprisingly. Should I raise an issue first? what's your protocol?

    jeremyrsmith
    @jeremyrsmith

    @ghoto does it change anything if you don’t use the Row extractor? Like

    val dim = featuresWCountTransform(training).select(dimUDF($”features”).as[Int]).head()

    ?

    the extractor should work, just trying to figure out where the bug might be
    jonathanindig
    @jonathanindig

    @jonathanindig for real? Yeah, sure, I ... can actually do that ... surprisingly. Should I raise an issue first? what's your protocol?

    @mathematicalmichael just sending a PR is fine :) no process here!

    Ignacio
    @ghoto
    @jeremyrsmith I had to restart Polynote over the weekend, and now the same notebook is not having the issue (with the Row extractor). I wished I could reproduce it.
    Ignacio
    @ghoto
    for i in range(field_size - 1):
        for j in range(i+1, field_size):
            interaction_list.append(tf.multiply(embeddings[:,i,:], embeddings[:,j,:]))
            idx_i_mapped = sum([mapper(feat_index[:,i]) for mapper in category_mappers])
            idx_j_mapped = sum([mapper(feat_index[:,j]) for mapper in category_mappers])
            interaction_idx.append(c - (cats - idx_i_mapped + 1) * (cats - idx_i_mapped ) / 2 + idx_j_mapped - idx_i_mapped)

    Gave me NameError: name 'i' is not defined (java.lang.RuntimeException). I had to set the value for i and j before the for

    i = 0
    j = 0

    This is a bit odd..

    jonathanindig
    @jonathanindig
    A simple nested for seems to work for me:
    for i in range(10): 
        for j in range(i, 10):
            print(i, j)
    What line does it complain about i on?
    Ignacio
    @ghoto
    NameError: name 'i' is not defined (java.lang.RuntimeException)
    I need to define it before the for.
    jonathanindig
    @jonathanindig
    Oh dang I forgot that Python name errors don’t give you a line number :(
    Filed polynote/polynote#652 for that
    So @ghoto I tried to reproduce with some simple examples but wasn’t able to unfortunately. It’s definitely weird that you had to predefine those variables. Have you been able to come up with a simpler reproduction?
    Ignacio
    @ghoto
    I can't reproduce it either.. perhaps there is a way I could debug these errors, is there? I could modify the code to print stuff to stdout
    jonathanindig
    @jonathanindig
    I think we can at least figure out which i it’s not happy about by adding some print statements inside the for loop
    unfortunately debugging the python internals is kind of tricky
    For example, this code will print “outer” and then “inner” before erroring:
    for i in range(10):
        print("outer")
        for j in range(100):
            print("inner")
            print(x, j) # <-- pretend the `x` on this line is the problematic `i`
    Ignacio
    @ghoto

    Also I think JEP is playing some tricks to me

    ws = [1.0, 1.0, 1.0]
    print(type(ws))
    pythonDF.randomSplit(ws)

    I get the error TypeError: Error converting parameter 1: Expected [D but received a JavaObject. (java.lang.RuntimeException).

    ws type is <class 'list'> as it should be
    jeremyrsmith
    @jeremyrsmith
    hmmm, it looks like pythonDF may be a Scala DataFrame rather than a python one? Can you do type(pythonDF)? How was that created?
    I say that because if pythonDF was a Scala DataFrame then its randomSplit method would expect a Java array of doubles (hence the [D)
    Ignacio
    @ghoto
    oh, this is the one in the example..
    from pyspark.sql import DataFrame
    pythonDF = DataFrame(scalaDF, sqlContext) # sqlContext is provided by Polynote
    pandaDF = pythonDF.toPandas()
    pandaDF.head()
    This is how it's defined
    jeremyrsmith
    @jeremyrsmith
    hmm, that certainly seems like it ought to work. :disappointed:
    Ignacio
    @ghoto
    (Scala Spark to Pandas) example notebook
    jonathanindig
    @jonathanindig
    @jeremyrsmith I’m wondering whether creating Python DFs from Scala ones might have some weird edge cases - I think this method might only work superficially. I don’t think it’s a 100% supported way to do things (not sure if there’s a better way to do it though)
    Stanis Shkel
    @sshkel
    quick question about sql kernel. Is it possible to get data displayed immediately when you do select * from blah? Right now it's storing it inside an Out variable and I end up having another cell below to do Out.show or use the data preview button.
    jeremyrsmith
    @jeremyrsmith
    Early on we had it display some data right away, but it’s really less flexible that way – much of the time it’s not that useful to just see a few rows of the data in the notebook, so we just treat it as a DataFrame and that way you can choose what to do with it
    Open to suggestions on how that could be configured though
    Ignacio
    @ghoto

    I reviewed PySpark code regarding randomSplit and I think is more of a problem on their side. What they do is to use the Scala dataframe method for randomSplit, therefore they convert the Python Array to a Scala list, then I think JEP doesn't handle this transformation well.

    rdd_array = self._jdf.randomSplit(_to_list(self.sql_ctx._sc, weights), long(seed))

    Here _jdf is the scalaDF given at creation DataFrame(scalaDF, sqlContext).
    weights is a Python Array of doubles that gets converted to a scala List[Double]..
    If you call from the notebook

    pythonDF._jdf.randomSplit(weights)

    you get the result you were supposed to get.

    jeremyrsmith
    @jeremyrsmith
    Hmm, but that conversion should be handled by Py4J in the pyspark case. If it’s going through jep then something’s wrong...
    Ignacio
    @ghoto
    That was my guess.. but I think you might be right
    from pyspark.sql.column import _to_list
    
    ws = [0.1, 0.9]
    ws_sparky = _to_list(SparkContext, [0.1, 0.9])
    print(type(ws))
    print(type(ws_sparky))
    # training_data._jdf == trainingData the original dataframe in scala
    training_data._jdf.randomSplit(ws) # OK
    training_data._jdf.randomSplit(ws_sparky) # Fails and it shouldn't (this is the equivalent of running training_data.randomSplit in PySpark)
    Output:
    <class 'list'>
    <class 'py4j.java_gateway.JavaObject'>
    jeremyrsmith
    @jeremyrsmith
    What’s the type of training_data._jdf?
    Ignacio
    @ghoto
    <class 'jep.PyJObject'>
    jeremyrsmith
    @jeremyrsmith
    Maybe the issue is that it’s supposed to be a Py4J JavaObject but is actually a jep PyJObject? That would be something we’d have to handle specially
    ah, yeah that’s gotta be the problem
    interesting that a lot of stuff does work regardless of using the wrong interop mechanism… the joys of dynamic typing :smile:
    Ignacio
    @ghoto
    :joy:
    jeremyrsmith
    @jeremyrsmith
    So I guess the fix would be to specially handle spark stuff in the python interpreter when using pyspark, so that we can put it into python with py4j rather than jep
    Ignacio
    @ghoto
    the problem is that the original Dataframe in Scala, for the python cells is a PyObject. So perhaps, all DataFrames should be converted to Py4jObjects? I'm trying to think how to make that possible, perhaps it's somewhere in the python interpreter..