Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Apr 07 2020 06:32
    celestewang1026 closed #183
  • Apr 03 2020 11:09
    celestewang1026 edited #183
  • Apr 03 2020 11:09
    celestewang1026 edited #183
  • Apr 03 2020 11:08
    celestewang1026 edited #183
  • Apr 03 2020 11:07
    celestewang1026 opened #183
  • Feb 08 2019 14:22
    thirumalalagu commented #173
  • Aug 29 2018 12:54
    darroyocazorla unassigned #139
  • Aug 29 2018 12:54
    darroyocazorla closed #139
  • May 28 2018 17:38
    odigetti opened #182
  • May 18 2018 10:35
    compae unassigned #114
  • Dec 12 2017 10:51
    zzxzz12345 edited #181
  • Dec 12 2017 10:50
    zzxzz12345 reopened #181
  • Dec 12 2017 10:50
    zzxzz12345 closed #181
  • Dec 12 2017 10:30
    zzxzz12345 opened #181
  • Nov 01 2017 04:02
    lix09 edited #180
  • Nov 01 2017 03:51
    lix09 opened #180
  • Nov 01 2017 03:46
    lix09 opened #179
  • May 08 2017 08:19
    cyjj commented #173
  • Apr 26 2017 12:00
    Mazzjs opened #178
  • Apr 25 2017 04:08
    hblt-j commented #177
Andrew Kelley
Hello i am having some issues following your scala code example, get the error - not found: value createDataFrame. Am i missing an import or something?
Pablo Francisco Pérez Hidalgo
Hi @kelleyaj , the example for the Scala API is making use of spark-shell. Hence, the scope has sqlContext reference on it by default.
the line import sqlContext._ brings that SQLContext#createDataFrame to the same scope so it can be called right away.
If you want to easily reproduce the example environment you can download a binary distribution of spark and launch its spark shell using --packages option to include spark-mongodb dependency. e.g:
cd somewhereyouextracted_sparkbins/
./bin/spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.1
Pablo Francisco Pérez Hidalgo

Alternatively, you might create a SQLContextinstance by yourself in your spark application code:

val sqlContext = new SQLContext(sc) // or new HiveContext(sc)
import sqlContext._

I hope that helps.

Hi, could somebody help me , how to update the collection? I know how to insert and select.
Pablo Francisco Pérez Hidalgo
@wingerli You can update by inserting the data within a DataFrame thanks to MongodbDataFrame saveToMongodb method.
Pablo Francisco Pérez Hidalgo
Alternatively, you can use Casbah library which is already a dependency of Spark-MongoDB
Jesus Liebana Losada
Hi i've doubt this framework support batching?
i need to do "lazy" dataframe
Miguel Angel Fernandez Diaz
Hi @xuskorea, dataframe are lazy by design, that is, you can add transformation to the dataframe and those transformation are not applied until an output action is invoked
Spark-MongoDB makes batch operations using Spark and accessing efficiently to MongoDB

Hi! I'm trying to get Spark-MongoDB working. I use the example in the First_Steps document, but get the exception:

# pyspark --packages com.stratio.datasource:spark-mongodb_2.10:0.10.3
>>> from pyspark.sql import SQLContext
>>> sqlContext.sql("CREATE TEMPORARY TABLE col_table USING com.stratio.datasource.mongodb OPTIONS (host 'host:port', database 'db', collection 'col')")
Exception in thread "Thread-1998" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/parse/VariableSubstitution
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2625)
    at java.lang.Class.privateGetPublicMethods(Class.java:2743)
    at java.lang.Class.getMethods(Class.java:1480)
    at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:365)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:317)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

I'm using Spark 1.5.0 from cloudera. I'll provide any other information if needed. Could you point out where I am wrong here?

This message was deleted

@darroyocazorla I've tried it, but no luck. According to log lines at pyspark startup.

2016-05-10 13:59:10,404 WARN  [Thread-2] spark.SparkConf (Logging.scala:logWarning(71)) - Setting 'spark.executor.extraClassPath' to '  [ ... ]   ' as a work-around.

Both spark.executor.extraClassPath and spark.driver.extraClassPath contain jars from /usr/lib/hive/lib/ . Maybe I'm still missing something?

Miguel Angel Fernandez Diaz
Hi @tunasalat, as @darroyocazorla mentioned above, this is an issue that is not related to the Spark-MongoDB datasource, you will be facing the same problem is you use any other datasource. Therefore, it's a problem of how Cloudera distribution makes use of Spark so you'll find the proper support for this issue in the Cloudera forums
@miguel0afd thanks! I'll try my luck there.
Ankit Gohil

How can I add query filters to load selective data from MongoDB using PySpark?

reader = sqlContext.read.format("com.stratio.datasource.mongodb")
data = reader.options(host='ip:27017', database='db', collection='coll').load()

would load the whole collection into dataframe while I only want to use a chunk of the collection. Is there any equivalent of Mongo's find() method where I can specify query filters?

Miguel Angel Fernandez Diaz
Hi @gohilankit
the load method is a lazy one, therefore, you can add transformations (also lazy) to the variable data and they won't be applied until an output action is called
Hi, all, I have a question about this lib. how do I access to an array inside a document? like , I have a document: { "day":"2016-06-06", "items":[{"ts":123, "val":123}]} , how can I access to items.ts?
Pablo Francisco Pérez Hidalgo
Hi @Misfit-John , according to SparkSQL syntax you can try with: SELECT items[0].ts FROM ....
Hi, I am using Java, Spark and the spark mongo library for one of our applications. I was wondering is there a way I can query using mongo syntax in addtion to sql like statements ... I need to perform an unwind operation on the collection and its simple using a mongo query
Miguel Angel Fernandez Diaz
Hi @dipayan90, there is no possibility for using Mongo syntax from our connector. SparkSQL gave birth with the idea of being an SQL abstraction for any data source and unify the way to query the data. Therefore, Spark is not ready for these kind of actions. In addition, there is no other Spark datasource that allows to use native syntax
@miguel0afd I ended up using spring mongo template to get me the aggregate results I wanted and then converted that into a dataframe ... Anyways thanks for your response... That said https://github.com/mongodb/mongo-hadoop allows you to pass in mongo queries using "mongo.input.query" parameter .... How ever one of their caveats is that they dont allow complex aggregate queries.
Execuse me, I wonder whether we can only discuss Spark-MongDB here? Can I ask question about SparkSQL here?
Reinis Vicups
Hi, is there a way to eliminate this exception 'CodecConfigurationException: Can't find a codec for class breeze.linalg.SparseVector' without coding my own Codec for breeze Vectors?
Pushpinder Heer
What options do i have when using Stratio to access mongo db 2.6.7