Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Dermot Haughey
@hderms
presumably spark works well with the automatic retrying of failures but everyone keeps telling me spark isn't well suited to lots of asynchronous IO that can fail
for reasons that aren't clear to me
Peng Cheng
@tribbloid
@hderms I belive that's because spark doesn't implement atomic retry.
only task retry, which always start from the beginning of the partition
In spookystuff the already completed IO is cached, and local retry will happen before partition retry, so such problem no longer exist
There is one thing Spark doesn't handle very well: order the IO request to honour robot.txt protocol. I'll add this in my next major release
Dermot Haughey
@hderms
cool man
thanks
Peng Cheng
@tribbloid
cheers mate
Roger Giuffrè
@webroboteu
hi,how it's going to refactor. Any news on the joint use with zepelling? I put on standby the api implementation to interop with it
Peng Cheng
@tribbloid
interop with zeppelin?
the thrift API still take some time. There is an interesting graph exploration languge introduced by GraphFrame 0.4, which I may end up using.
So far still only scala API is available.
BTW, does zeppelin has a interpreter API now?
Peng Cheng
@tribbloid
@webroboteu Speaking of scala API, I think its stable enough and all further changes will be backward compatible. But I still prefer you to wait for the thrift API as its much faster than Zeppelin's interpreter
Roger Giuffrè
@webroboteu

then you will use thrift to generate the scale code.

Is there something in the code?

Roger Giuffrè
@webroboteu
how do you rate the ability to interoperate nutch or storm-crawler with spookystuff regarding extraction of specific data?
Peng Cheng
@tribbloid
I don't think apache nutch is as scalable as modern big data stack, or good with deep web. Generally most crawlers can't handle deep web
haven't try stormcrawler yet
technically using thrift to generate RDD, since thrift & hive are written in java, and they both are already integrated into spark-hive-thriftserver, there is very little thing I need to do for integration. Most work are just writing the query parser.
If I'm lucky I can steal some from GraphFrame's parser, but not guaranteed
Eric Pugh
@epugh
I wanted to pick up trying out Spookystuff again, saw some new commits!
Peng Cheng
@tribbloid
How is it going so far? Haven't released for a long time. Trying to include most of the drone stuff in 0.4 release.
Roger Giuffrè
@webroboteu

Can you explain in general as in the scala , there is the conversion from a string type expression to Extractor [Any], then audited by the wget action parsing process?

// Shorthand of fetch
def wget (
ex: Extractor [Any]
filter: DocFilter = Const.defaultDocumentFilter,
Failsafe: Int = -1,
genPartitioner: GenPartitioner = spooky.conf.defaultGenPartitioner
): FetchedDataset = {

 var trace: Sep [Trace] = Wget (ex, filter)

 if (FailSafe> 0) = trace ClusterRetry (water, failsafe)

 this.fetch (
   trace,
   genPartitioner = genPartitioner
 )

}

Peng Cheng
@tribbloid

Sure, :)

Extractor from a Spark SQL Row should have no type on its own, the generic type is only helpful for implicit functions in DSL.

So Wget can read Extractor of any type and apply them on each data row to get the URL, then use it to fetch web pages. E.g. For a DataFrame that looks like:

url
1 http://a
2 http://b

you can write:

 Wget('url)

to fetch 2 pages into 2 data rows.

Does that answer your question?

Yours Peng

Roger Giuffrè
@webroboteu

304/5000
forgive me, not yet clear to me the ride. I'm studying scala and dsl.

Where can i find the Extractor Type and where is the point where the url string is converted to Extractor.
Can suggest step by step since the flow of the wget call to the construction of the extractor? I must first understand the logic of spark sql?
Do you have any pointers to be able to properly interpret your sources, handouts and other bibliographical sources?

Thank you

Roger Giuffrè
@webroboteu
you should apply a architecture documentation if you want to facilitate the programmer to contribute. However, I decided that I implement an additional method that goes from a simple java string, which is then deserialized in yaml format and from there, directly to scala I shall undertake the mapping with the query.
Peng Cheng
@tribbloid
Agreed, I believe I have some from previous presentation. But its about algorithms not architecture. Right now I'll have to give simple answers.
Roger Giuffrè
@webroboteu
hi, what do you think about this to create a external dsl? http://www.antlr.org/tools.html
Roger Giuffrè
@webroboteu
before deducting the grammar of the query I was wondering if you can put it quickly available
Peng Cheng
@tribbloid
@webroboteu Extractor is like Expression in Spark SQL, except that it is applied to FetchedRow (with both structured data and unstructured web pages) instead of InternalRow. It can have generic type parameter so scala compiler can decide what kind of method member it has, and fail early at compilation time when such method member doesn't exist.
So far, the extractor syntax is different from that of Spark SQL dataframe API DSL, that's because SpookyStuff doesn't use InternalRow for structured data (InternalRow was introduced in 1.4, after SpookyStuff's architecture was finished). This will be fixed in 5.0, which could allow dataframe DSL to be used out-of-the-box
Peng Cheng
@tribbloid
@webroboteu The ANTLR looks sweet! But right now its overkill because SpookyStuff 'DSL' is built on scala operators, not a real DSL with complete parser. This is frontend work, not quite my specialization
Roger Giuffrè
@webroboteu
hi, a stupi question about string interpolation in scala. The notation "'{} is equivalent to s"${'}. Is right?
Roger Giuffrè
@webroboteu
i'm writing a wrapper induction strategy in spark. Is a unsuperivised method. With this layer i can auto-generate a spookystuff query in my web service. Good
Peng Cheng
@tribbloid
@webroboteu yes "abc{d}" is x"abc${'d}", this is Scala's string interpolation feature.
@webroboteu sounds like a great idea, are you using 0.4 API? the 0.4 is close to release by now.
Roger Giuffrè
@webroboteu
hi, no 0.3.2 in this moment. Can i use it with spark 1.5.1?
in github can you tell me the coordinates to check out the 0.3.2 release? I find only a release candidate.
Roger Giuffrè
@webroboteu
is this? tribbloid/spookystuff@3f29682
Roger Giuffrè
@webroboteu

var result = sc.parallelize(scala.collection.mutable.Seq(
"http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA"
),1)
.flatMap(url => scala.collection.mutable.Seq("#comments").map(tag => tag+"\t"+url+tag))
.tsvToMap("type\turl")
.fetch(
Visit("http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA")
+> Click("ul.PostTabs li.active a")
+> WaitFor("div.tab-pane.active li.review")
)
.flatExtract(S"div.tab-pane.active li.review")(
x"${A"li.comment div.title".text}: ${A"li.comment div.body".text}" ~ 'comment,
A"time.posted-on".text ~ 'date_status,
A"div.details > div.aside".text ~ 'stars,
A"div.details > div.Helpful div.vote-stats".text ~ 'useful,
A"div.byline a:nth-of-type(1)".text ~ 'user_name,
A"span.type".text ~ 'user_location,
A"div.byline a:nth-of-type(2)".text ~ 'review_count
)
.toDF()
result.rdd.collect()

I have this error when i use fetch with parallel collection:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 13.0 (TID 49) had a not serializable result: scala.collection.immutable.MapLike$$anon$2
Serialization stack:

- object not serializable (class: scala.collection.immutable.MapLike$$anon$2, value: Map('type -> #comments, 'url -> http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA#comments))
- field (class: com.tribbloids.spookystuff.row.DataRow, name: data, type: interface scala.collection.immutable.Map)
- object (class com.tribbloids.spookystuff.row.DataRow, DataRow(Map('type -> #comments, 'url -> http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA#comments),Some(c101ea74-1c5b-4e9f-a7ee-9c8ccfab3c51),0,false))
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at HelloWorld$.main(test.scala:55)
at HelloWorld.main(test.scala)

Roger

Roger Giuffrè
@webroboteu
this is the only version that seems to give guarantees: https://github.com/tribbloid/spookystuff/tree/release-0.3.2
Do you know when the 0.4.0 version is officially released because I'll have to update the query engine since you changed the apis
Peng Cheng
@tribbloid
Strange, what I have is a different erro:
.ActionException: 
{
| Visit(http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA,0 seconds,true)
+> Click(ul.PostTabs li.active a,0 seconds,true)
}
Snapshot: saved to: /home/peng/git/datapassport/spookystuff/core/temp/spooky-unit/errorDump/Visit/http/www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA/0_seconds/true/ErrorDump/Click/ul.PostTabs_li.active_a/0_seconds/true/--1854029617/d38ce84b-9a78-4ff8-9466-5da92b18d6a5
Screenshot: saved to: /home/peng/git/datapassport/spookystuff/core/temp/spooky-unit/errorDump/Visit/http/www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA/0_seconds/true/ErrorScreenshot/Click/ul.PostTabs_li.active_a/0_seconds/true/--1854029617/24ceb323-9f86-4971-a2b7-84e0b57eb01d
+- org.openqa.selenium.WebDriverException: {"errorMessage":"Can't find variable: Sizzle","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"75","Content-Type":"application/json; charset=utf-8","Host":"localhost:18164","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_121)"},"httpVersion":"1.1","method":"POST","post":"{\"script\":\"return Sizzle(arguments[0])\",\"args\":[\"ul.PostTabs li.active a\"]}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/e508a590-0ce6-11e7-acc0-c9a91156371c/execute"}}
Command duration or timeout: 645 milliseconds
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: org.openqa.selenium.phantomjs.PhantomJSDriver
Capabilities [{applicationCacheEnabled=false, rotatable=false, phantomjs.page.settings.loadImages=false, handlesAlerts=false, databaseEnabled=false, version=2.1.1, platform=LINUX, browserConnectionEnabled=false, proxy={proxyType=direct}, nativeEvents=true, acceptSslCerts=false, phantomjs.page.settings.resourceTimeout=60000, driverVersion=1.2.0, phantomjs.page.settings.userAgent={hCode=1404654269, class=com.tribbloids.spookystuff.SpookyConf$$anonfun$$lessinit$greater$default$6$1}, locationContextEnabled=false, webStorageEnabled=false, browserName=phantomjs, takesScreenshot=true, driverName=ghostdriver, javascriptEnabled=true, cssSelectorsEnabled=true}]
Session ID: e508a590-0ce6-11e7-acc0-c9a91156371c
   +- org.openqa.selenium.remote.ScreenshotException: Screen shot has been taken
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: driver.version: RemoteWebDriver
      +- org.openqa.selenium.WebDriverException: {"errorMessage":"Can't find variable: Sizzle","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"75","Content-Type":"application/json; charset=utf-8","Host":"localhost:18164","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_121)"},"httpVersion":"1.1","method":"POST","post":"{\"script\":\"return Sizzle(arguments[0])\",\"args\":[\"ul.PostTabs li.active a\"]}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/e508a590-0ce6-11e7-acc0-c9a91156371c/execute"}}
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: driv
This is another sizzle compatibility error (0.3.2 works because sizzle selector wasn't used). But apparently the error you mentioned has been fixed in a previous patch.
Could you make sure that you have update the project to be the latest master?
Roger Giuffrè
@webroboteu
ok, i will retry
Vishesh Mangla
@Teut2711
how to webscrape websites where loginform pop ups and where we can either fill the form or close the login window?
pops up*