import sqlContext._
brings that SQLContext#createDataFrame
to the same scope so it can be called right away.
--packages
option to include spark-mongodb dependency. e.g:cd somewhereyouextracted_sparkbins/
./bin/spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.1
Hi! I'm trying to get Spark-MongoDB working. I use the example in the First_Steps document, but get the exception:
# pyspark --packages com.stratio.datasource:spark-mongodb_2.10:0.10.3
>>> from pyspark.sql import SQLContext
>>> sqlContext.sql("CREATE TEMPORARY TABLE col_table USING com.stratio.datasource.mongodb OPTIONS (host 'host:port', database 'db', collection 'col')")
[...]
Exception in thread "Thread-1998" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/parse/VariableSubstitution
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2625)
at java.lang.Class.privateGetPublicMethods(Class.java:2743)
at java.lang.Class.getMethods(Class.java:1480)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:365)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:317)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
I'm using Spark 1.5.0 from cloudera. I'll provide any other information if needed. Could you point out where I am wrong here?
@darroyocazorla I've tried it, but no luck. According to log lines at pyspark startup.
2016-05-10 13:59:10,404 WARN [Thread-2] spark.SparkConf (Logging.scala:logWarning(71)) - Setting 'spark.executor.extraClassPath' to ' [ ... ] ' as a work-around.
Both spark.executor.extraClassPath and spark.driver.extraClassPath contain jars from /usr/lib/hive/lib/ . Maybe I'm still missing something?
How can I add query filters to load selective data from MongoDB using PySpark?
Using
reader = sqlContext.read.format("com.stratio.datasource.mongodb")
data = reader.options(host='ip:27017', database='db', collection='coll').load()
would load the whole collection into dataframe while I only want to use a chunk of the collection. Is there any equivalent of Mongo's find() method where I can specify query filters?