by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jul 09 20:58
    vpipkt edited #495
  • Jul 09 20:58
    vpipkt edited #495
  • Jul 09 20:58
    vpipkt edited #495
  • Jul 09 20:58
    vpipkt synchronize #495
  • Jul 09 20:32
    vpipkt synchronize #495
  • Jul 09 18:03
    vpipkt synchronize #495
  • Jul 09 17:24
    vpipkt edited #495
  • Jul 09 17:24
    vpipkt synchronize #495
  • Jul 09 16:23
    vpipkt edited #495
  • Jul 09 16:23
    vpipkt synchronize #495
  • Jul 09 14:24
    vpipkt synchronize #495
  • Jul 09 13:58
    vpipkt edited #495
  • Jul 09 13:50
    vpipkt synchronize #495
  • Jul 09 13:44
    vpipkt review_requested #495
  • Jul 09 13:44
    vpipkt edited #495
  • Jul 09 13:40
    vpipkt edited #495
  • Jul 09 13:36
    vpipkt edited #495
  • Jul 08 20:13
    vpipkt synchronize #495
  • Jul 08 18:49
    vpipkt synchronize #495
  • Jul 08 17:42
    vpipkt labeled #495
Michał Gołębiewski
@mjgolebiewski
ldconfig -p | grep libgdal
root@dp-test-01:~# ldconfig -p | grep libgdal
    libgdal.so.20 (libc6,x86-64) => /usr/local/lib/libgdal.so.20
    libgdal.so (libc6,x86-64) => /usr/local/lib/libgdal.so
Simeon H.K. Fitch
@metasim
huh
Simeon H.K. Fitch
@metasim
the JVM should be able to find it then :/
I'd try the spark job with UDF calling gdal_version next.
Michał Gołębiewski
@mjgolebiewski
im actually not sure how to do it
Simeon H.K. Fitch
@metasim
OK... you'll have to give me a a bit to put out thi fire
Simeon H.K. Fitch
@metasim
FWIW, This is what I thought I'd have you do (but it doesn't work):
from pyrasterframes import *
from pyrasterframes.utils import create_rf_spark_session, gdal_version
import pyspark.sql.functions as F

spark = create_rf_spark_session()

df = spark.range(1, 10, numPartitions=10)

@F.udf
def check_version(dummy):
    return gdal_version()

df.select(check_version(df.id)).show()
The reason is that gdal_version() works by talking with the JVM via the SparkContext, and the process forked to evaluate the UDF doesn't have one.
Thinking of another way.
In case you're interested, this is what happens behind the scenes:
https://dzone.com/storage/temp/9998386-figure2.png
Simeon H.K. Fitch
@metasim
@mjgolebiewski I'm about half way through figuring something out, but I have to step away for a whole series of meetings :( Just know I'm looking into it.
Simeon H.K. Fitch
@metasim
I'm coming up empty figuring out how to do this in Python. I know how to do it in Scala, so maybe that's the route to at least test this?
Michał Gołębiewski
@mjgolebiewski
@metasim hmm well i lack any expirience with Scala. however, if you think its the only way to fix it i can try to set up environment. at this point i need to get it to work by all means
i dont know how important it is but spark ui shows only failed jobs on executor machines, driver and master are all good
Simeon H.K. Fitch
@metasim
That's important ^^^
Michał Gołębiewski
@mjgolebiewski
i tried calling spark session directly from pyspark based on local jars:
dpuser@dp-test-03:~$ pyspark --master spark://dp-test-01.tap-psnc.net:7077 --py-files /opt/data-platform/rasterframes/pyrasterframes_2.11-0.9.0-python.zip --conf spark.jars=/opt/data-platform/rasterframes/pyrasterframes/jars/pyrasterframes-assembly-0.9.0.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=org.locationtech.rasterframes.util.RFKryoRegistrator --conf spark.kryoserializer.buffer.max=500m
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.6.9 (default, Apr 18 2020 01:56:04)
SparkSession available as 'spark'.
>>> import pyrasterframes
>>> spark = spark.withRasterFrames()
>>> df = spark.read.raster('https://landsat-pds.s3.amazonaws.com/c1/L8/158/072/LC08_L1TP_158072_20180515_20180604_01_T1/LC08_L1TP_158072_20180515_20180604_01_T1_B5.TIF')
20/06/25 13:40:33 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
im not even sure if calling RF without jupyter will help, but this is the error im getting while calling any functions
Michał Gołębiewski
@mjgolebiewski
adding --conf spark.sql.catalogImplementation=in-memory helped
@metasim at this point im basically sure my master and driver are okay but 2 nodes are returning errors
also pyspark shell returns same errors as jupyter so its not jupyters fault
sometimes it works, most of the time it doesnt - it seems like my nodes are not working as they should
Michał Gołębiewski
@mjgolebiewski
i tried to run it with --master yarn --deploy-mode client and it just works... im not touching it, i dont want to break anything. @metasim thanks for your help :)
Michał Gołębiewski
@mjgolebiewski

analysing processes by spark standalone and yarn gives a hint what was happening (and it confirms what you said yesterday):
spark standalone:

hadoop   17359 51.0  7.6 4700676 625464 ?      Sl   14:50   0:22 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/opt/data-platform/spark-current/conf/:
/opt/data-platform/spark-current/jars/*:
/opt/data-platform/hadoop-2.7.3/etc/hadoop/:
/opt/data-platform/hadoop-2.7.3/share/hadoop/common/lib/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/common/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/hdfs/:
/opt/data-platform/hadoop-2.7.3/share/hadoop/hdfs/lib/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/hdfs/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/yarn/lib/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/yarn/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:
/opt/data-platform/hadoop-2.7.3/share/hadoop/mapreduce/* 
-Xmx1024M 
-Dspark.driver.port=40359 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
    --driver-url spark://CoarseGrainedScheduler@dp-test-03:40359 
    --executor-id 2 
    --hostname 192.168.0.228 
    --cores 3 
    --app-id app-20200625145053-0019 
    --worker-url spark://Worker@192.168.0.228:34149

yarn:

hadoop   17777  0.0  0.0  13316  3012 ?        Ss   15:10   0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-server 
-Xmx1024m 
-Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/dpuser/appcache/application_1592828471999_0002/container_1592828471999_0002_01_000002/tmp 
'-Dspark.driver.port=39827' 
-Dspark.yarn.app.container.log.dir=/opt/data-platform/hadoop-2.7.3/logs/userlogs/application_1592828471999_0002/container_1592828471999_0002_01_000002 
-XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
    --driver-url spark://CoarseGrainedScheduler@dp-test-03:39827 
    --executor-id 1 
    --hostname dp-test-01 
    --cores 1 
    --app-id application_1592828471999_0002 
    --user-class-path file:/tmp/hadoop-hadoop/nm-local-dir/usercache/dpuser/appcache/application_1592828471999_0002/container_1592828471999_0002_01_000002/__app__.jar 
    --user-class-path file:/tmp/hadoop-hadoop/nm-local-dir/usercache/dpuser/appcache/application_1592828471999_0002/container_1592828471999_0002_01_000002/pyrasterframes-assembly-0.9.0.jar 1>/opt/data-platform/hadoop-2.7.3/logs/userlogs/application_1592828471999_0002/container_1592828471999_0002_01_000002/stdout 2>/opt/data-platform/hadoop-2.7.3/logs/userlogs/application_1592828471999_0002/container_1592828471999_0002_01_000002/stderr
spark one had no class paths mentioned - no clue why tho
Jason T Brown
@vpipkt
Tomorrow we have a talk at Spark+AI Summit tomorrow at 10:00 AM PDT / 1:00 PM EDT ... conference is all virtual and free! There are some other cool spatial data sessions too! Please check us out, it will be somewhat broad and high level for a data scientist audience. But hopefully interesting background on the project! Thanks all!
https://databricks.com/session_na20/bring-satellite-and-drone-imagery-into-your-data-science-workflows
Simeon H.K. Fitch
@metasim
:metal:
Jason T Brown
@vpipkt
Also since i know many people will ask... the slide and notebook used in the demo are here: https://github.com/vpipkt/spark_ai_summit_satellite
Stephen S
@stephensx
Awesome, @vpipkt ... look forward to watching the videos!
Jason T Brown
@vpipkt
Our live session just ended. Some very good feedback from the audience
Yuri de Abreu
@yurigba

I have used GDAL to read GRIB files and use them on RasterFrames, however there is an issue in the library degrib in GDAL that needs a fix and I'll open an issue there when I have some time. When they fix this, I think RasterFrames will be able to read weather data without problems.

As of now I am using cfgrib to convert to a NetCDF format and read using xarray, then convert to GeoTIFF and then read on RasterFrames. Kinda cumbersome.

Jason T Brown
@vpipkt
@yurigba thanks so much for that. personally i have not used GRIB files
Yuri de Abreu
@yurigba
image.png
This is RasterFrames loading a sample grib file
The issue with degrib is related to the way that it calculates the raster extent
Hajji Hicham
@hajjihi
@vpipkt thanks for the excellent presentation and demo, only if the video background was different, but I think it was imposed by databricks I guess
Jason T Brown
@vpipkt
I appreciate it very much @hajjihi and yes they had their own production team working on many aspects of the videos
Michał Gołębiewski
@mjgolebiewski
@vpipkt great talk! just finished watching it
Yuri de Abreu
@yurigba
I see that GeoTrellis is already bumping a placeholder issue to add support to spark 3.0, how is this going for RasterFrames? Is this a priority in next RasterFrames versions?
Jason T Brown
@vpipkt
we have discussed it a little internally but we have not yet done a more formal exploration of what would be involved .... from Spark+AI Summit we know there have been a lot of changes in Spark SQL but unsure how extensively these will effect the project yet
Yuri de Abreu
@yurigba
Right, thanks for the information.
tieuthienvn1987
@tieuthienvn1987
image.png
Hi all, who can help me? I installed Pyrasterframes lib on spark with anaconda3, I got the following error:
tieuthienvn1987
@tieuthienvn1987
image.png
Yuri de Abreu
@yurigba
@tieuthienvn1987 have you tried to do conda install -c conda-forge gdal=2.4.4 before pip installing rasterframes?
tieuthienvn1987
@tieuthienvn1987
thank you. I will try to do it by your way. If there are errors again, please help me.
tieuthienvn1987
@tieuthienvn1987
image.png
tieuthienvn1987
@tieuthienvn1987
I tried to do, but got the following errors:
Yuri de Abreu
@yurigba
as far as I know rasterframes is not compatible with gdal 3.0.4. Try again with gdal 2.4.4
Jason T Brown
@vpipkt
yes i do not know of any testing done with gdal 3.x yet.
gdal 2.4.4 is what the underlying java library expects