Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark
Codeship Error: tribbloid/spookystuff/54099151
add some missing test cases and comments (master) tribbloid/spookystuff@fe78a50 by tribbloid
Codeship Error: tribbloid/spookystuff/54097218
Now it it logically impossible to create new batch without registering it for cleanup.
Cleanable.batches delegates to its lifespan
disable printing passed tests in test-reports.sh (master) tribbloid/spookystuff@fbade73 by tribbloid
Codeship Error: tribbloid/spookystuff/54092556
ghostdriver log set to a file
retry validateBeforeAndAfterAll
selenium version upgrade to 4.1.4
dependent ecosystems are also upgraded accordingly (master) tribbloid/spookystuff@0ef2098 by tribbloid
Codeship Error: tribbloid/spookystuff/54092550
ghostdriver log set to a file
retry validateBeforeAndAfterAll (selenium4/dev1) tribbloid/spookystuff@d9c8501 by tribbloid
Codeship Success: tribbloid/spookystuff/54091735
disable all logging in BeforeAndAfterShipping
default console log level changed to WARN (phantomjs-fix/dev2) tribbloid/spookystuff@9fec992 by tribbloid
Codeship Success: tribbloid/spookystuff/54090828
instancesShouldBeClean condition is improved to also check Compound types (phantomjs-fix/dev2) tribbloid/spookystuff@6b0d911 by tribbloid
Codeship Success: tribbloid/spookystuff/54090040
fix a bug of Compound Lifespan
should be fully working now
TestHelper lifespan changed to use Hadoop ShutdownHookManager
reorganize Lifespan impls
Cleanable API has been rewritten to avoid redundant registration of clean up hooks
EXPERIMENTAL
a unit test no longer relies on manual check (phantomjs-fix/dev2) tribbloid/spookystuff@7ef1a72 by tribbloid
Codeship Error: tribbloid/spookystuff/54090004
fix a bug of Compound Lifespan
should be fully working now
TestHelper lifespan changed to use Hadoop ShutdownHookManager
reorganize Lifespan impls
Cleanable API has been rewritten to avoid redundant registration of clean up hooks
EXPERIMENTAL
a unit test no longer relies on manual check (phantomjs-fix/dev2) tribbloid/spookystuff@5c4f822 by tribbloid
Can you explain in general as in the scala , there is the conversion from a string type expression to Extractor [Any], then audited by the wget action parsing process?
// Shorthand of fetch
def wget (
ex: Extractor [Any]
filter: DocFilter = Const.defaultDocumentFilter,
Failsafe: Int = -1,
genPartitioner: GenPartitioner = spooky.conf.defaultGenPartitioner
): FetchedDataset = {
var trace: Sep [Trace] = Wget (ex, filter)
if (FailSafe> 0) = trace ClusterRetry (water, failsafe)
this.fetch (
trace,
genPartitioner = genPartitioner
)
}
Sure, :)
Extractor from a Spark SQL Row should have no type on its own, the generic type is only helpful for implicit functions in DSL.
So Wget can read Extractor of any type and apply them on each data row to get the URL, then use it to fetch web pages. E.g. For a DataFrame that looks like:
url | |
---|---|
1 | http://a |
2 | http://b |
you can write:
Wget('url)
to fetch 2 pages into 2 data rows.
Does that answer your question?
Yours Peng
304/5000
forgive me, not yet clear to me the ride. I'm studying scala and dsl.
Where can i find the Extractor Type and where is the point where the url string is converted to Extractor.
Can suggest step by step since the flow of the wget call to the construction of the extractor? I must first understand the logic of spark sql?
Do you have any pointers to be able to properly interpret your sources, handouts and other bibliographical sources?
Thank you
var result = sc.parallelize(scala.collection.mutable.Seq(
"http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA"
),1)
.flatMap(url => scala.collection.mutable.Seq("#comments").map(tag => tag+"\t"+url+tag))
.tsvToMap("type\turl")
.fetch(
Visit("http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA")
+> Click("ul.PostTabs li.active a")
+> WaitFor("div.tab-pane.active li.review")
)
.flatExtract(S"div.tab-pane.active li.review")(
x"${A"li.comment div.title".text}: ${A"li.comment div.body".text}" ~ 'comment,
A"time.posted-on".text ~ 'date_status,
A"div.details > div.aside".text ~ 'stars,
A"div.details > div.Helpful div.vote-stats".text ~ 'useful,
A"div.byline a:nth-of-type(1)".text ~ 'user_name,
A"span.type".text ~ 'user_location,
A"div.byline a:nth-of-type(2)".text ~ 'review_count
)
.toDF()
result.rdd.collect()
I have this error when i use fetch with parallel collection:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 13.0 (TID 49) had a not serializable result: scala.collection.immutable.MapLike$$anon$2
Serialization stack:
- object not serializable (class: scala.collection.immutable.MapLike$$anon$2, value: Map('type -> #comments, 'url -> http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA#comments))
- field (class: com.tribbloids.spookystuff.row.DataRow, name: data, type: interface scala.collection.immutable.Map)
- object (class com.tribbloids.spookystuff.row.DataRow, DataRow(Map('type -> #comments, 'url -> http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA#comments),Some(c101ea74-1c5b-4e9f-a7ee-9c8ccfab3c51),0,false))
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at HelloWorld$.main(test.scala:55)
at HelloWorld.main(test.scala)
Roger
.ActionException:
{
| Visit(http://www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA,0 seconds,true)
+> Click(ul.PostTabs li.active a,0 seconds,true)
}
Snapshot: saved to: /home/peng/git/datapassport/spookystuff/core/temp/spooky-unit/errorDump/Visit/http/www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA/0_seconds/true/ErrorDump/Click/ul.PostTabs_li.active_a/0_seconds/true/--1854029617/d38ce84b-9a78-4ff8-9466-5da92b18d6a5
Screenshot: saved to: /home/peng/git/datapassport/spookystuff/core/temp/spooky-unit/errorDump/Visit/http/www.urbanspoon.com/r/5/1435892/restaurant/Downtown/Bottega-Louie-LA/0_seconds/true/ErrorScreenshot/Click/ul.PostTabs_li.active_a/0_seconds/true/--1854029617/24ceb323-9f86-4971-a2b7-84e0b57eb01d
+- org.openqa.selenium.WebDriverException: {"errorMessage":"Can't find variable: Sizzle","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"75","Content-Type":"application/json; charset=utf-8","Host":"localhost:18164","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_121)"},"httpVersion":"1.1","method":"POST","post":"{\"script\":\"return Sizzle(arguments[0])\",\"args\":[\"ul.PostTabs li.active a\"]}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/e508a590-0ce6-11e7-acc0-c9a91156371c/execute"}}
Command duration or timeout: 645 milliseconds
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: org.openqa.selenium.phantomjs.PhantomJSDriver
Capabilities [{applicationCacheEnabled=false, rotatable=false, phantomjs.page.settings.loadImages=false, handlesAlerts=false, databaseEnabled=false, version=2.1.1, platform=LINUX, browserConnectionEnabled=false, proxy={proxyType=direct}, nativeEvents=true, acceptSslCerts=false, phantomjs.page.settings.resourceTimeout=60000, driverVersion=1.2.0, phantomjs.page.settings.userAgent={hCode=1404654269, class=com.tribbloids.spookystuff.SpookyConf$$anonfun$$lessinit$greater$default$6$1}, locationContextEnabled=false, webStorageEnabled=false, browserName=phantomjs, takesScreenshot=true, driverName=ghostdriver, javascriptEnabled=true, cssSelectorsEnabled=true}]
Session ID: e508a590-0ce6-11e7-acc0-c9a91156371c
+- org.openqa.selenium.remote.ScreenshotException: Screen shot has been taken
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: driver.version: RemoteWebDriver
+- org.openqa.selenium.WebDriverException: {"errorMessage":"Can't find variable: Sizzle","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"75","Content-Type":"application/json; charset=utf-8","Host":"localhost:18164","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_121)"},"httpVersion":"1.1","method":"POST","post":"{\"script\":\"return Sizzle(arguments[0])\",\"args\":[\"ul.PostTabs li.active a\"]}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/e508a590-0ce6-11e7-acc0-c9a91156371c/execute"}}
Build info: version: '2.53.0', revision: '35ae25b1534ae328c771e0856c93e187490ca824', time: '2016-03-15 10:43:46'
System info: host: 'peng-tribbloids', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-67-generic', java.version: '1.8.0_121'
Driver info: driv