sparkRthis is not the right channel… if you are using
sparklyrthen we can definetely help. To copy large data frames, you could try this feature: rstudio/sparklyr#1762 which attempts to copy data incrementally, as in:
# specify a callback that uploads subsets of the dataset instead of loading all at once iris_tbl <- copy_to(sc, function() iris, overwrite = T)
copy_to()was originally designed to ber a convenience function or to copy secondary tables that are not necessarily large.
my_file <- spark_read_csv(sc, "my-file", path = "./testfile.csv")results in an error
Error: org.apache.spark.sql.AnalysisException: Invalid view name: my-file;. Without the dash, it should work perfectly.
spark_applyand the benchmarks here look great. However even though the github readme instructs to use
install.packages("arrow")I could not find the package on CRAN. Any links/updates on the current status would be appreciated! thanks.
@javierluraschi thanks, I see arrow is now on CRAN, congratulations! Using the CRAN version, I come across errors with simple operations, example:
config <- sparklyr::spark_config() sc <- sparklyr::spark_connect(master = "local", config = config) mtcars_sp <- dplyr::copy_to(sc, datasets::mtcars, overwrite = TRUE) # Works fine if ("arrow" %in% .packages()) detach("package:arrow") mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect() # Error library(arrow) mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect()
Looking at the worker log, this seems to be relevant:
ERROR sparklyr: RScript (6891) terminated unexpectedly: object 'as_tibble' not found
R version 3.6.0, x86_64-redhat-linux-gnu (64-bit)
Packages: arrow_0.14.1 dplyr_0.8.3 sparklyr_1.0.1
digestfunction on all partitions of
tbl_spark, collect the hashes into a data.frame and apply
digestto it again. It works, but is extremely slow.
mutate = contents = concat_ws(‘,’, collect_list(column_name))and then use
sha2()from dplyr as well to compute the SHA2 hash