Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Prosper Burq
    @waterponey
    Hi, I'm currently looking at the docker-spark repo and I'm wondering how it behaves since it's only using the standalone spark cluster instead of using a regular cluster manager like yarn or mesos. Do you plan to integrate one of them later ?
    Ivan Ermilov
    @earthquakesan

    Hi @waterponey,

    I have played around with Spark/YARN integration and have a working setup (not pushed to BDE repos). If you can describe your use case (i.e. why do you need YARN/Mesos) in case of deploying Spark in docker, that would be helpful. In my case, I needed YARN for history manager and that's pretty much it.

    Prosper Burq
    @waterponey
    Well you can deploy a history manager for spark without yarn. It's just that in terms of ressource usage I'm not sure how you can free resources in this setup. Also, if you're doing streaming do you split the ressources at the swarm level or at the standalone spark level (meaning do you spawn another standalone cluster ?)
    Ivan Ermilov
    @earthquakesan
    In case if you are running a couple of other frameworks on the same hardware, true - you need YARN then
    in our workflows however, we deploy and destroy the frameworks as necessary
    e.g. if it's just a batch job, can spawn cluster with necessary resources, run the job and then destroy it
    Prosper Burq
    @waterponey
    ok I see. But I guess you can't use dynamic allocation if it's a "one job per cluster" kind of thing or do you have a way to do so ?
    Ivan Ermilov
    @earthquakesan
    you can control application cores and memory with spark.cores.max and spark.executor.memory, but the apps will run in FIFO order
    To have several apps run in parallel, you will need to use YARN.
    I did not understand your question regarding streaming, can you elaborate?
    Prosper Burq
    @waterponey
    I was wondering how it would work if you wanted to have several concurrent streaming application. One possibility would be to statically declare one spark cluster per app and define the resources that swarm would allow to each spark cluster at startup. The other would be to define one specific "streaming spark cluster" in swarm and let the spark ressource manager do the allocation to each streaming app.
    Ivan Ermilov
    @earthquakesan
    you will still need to specify amount of resources per executor for your apps, even when you use resource manager
    Prosper Burq
    @waterponey
    I'm not sure which level of resource manager you're talking about, swarm or spark ?
    Ivan Ermilov
    @earthquakesan
    do you mean docker swarm?
    Prosper Burq
    @waterponey
    yes
    I mean you deploy docker container containing a spark cluster on a swarm cluster, or am I missing something ?
    Ivan Ermilov
    @earthquakesan
    The proper setup would be as follows:
    1. Deploy Spark cluster with YARN/Mesos and restrict CPU/Memory usage in Swarm + configure resource managers to see that limit (unfortunately you need to do it manually/ansible). Let's say you give it 64 cores and 256G of RAM.
    2. When deploying spark apps inside your cluster restrict CPU usage and memory per application (e.g. number of executors, cores per executor, memory per executor). If you want to run 2 applications, then you will do the setup, where each app consumes 32 cores and 128G of RAM.
    that's correct, I am clarifying that you are not using swarm as a general term for a cluster %)
    Prosper Burq
    @waterponey
    ok so now I'm confused, why would I need to keep docker swarm if I have to deploy yarn or mesos ?
    Ivan Ermilov
    @earthquakesan
    you deploy YARN in swarm as well
    that's just another container
    Stephen Baynham
    @CannibalVox
    Hey I'm sorry, I'm trying to run https://github.com/big-data-europe/docker-hive and the data nodes are failing with Datanode denied communication with namenode because hostname cannot be resolved (ip=10.0.0.10, hostname=10.0.0.10): DatanodeRegistration(0.0.0.0:50010, datanodeUuid=d64d014a-4467-4065-95e6-596590148f75, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-ab7b488d-d3c3-470d-aea0-c6e6ac6708b9;nsid=2064164277;c=0)
    If you have any clarity I'd appreciate it!
    Ivan Ermilov
    @earthquakesan

    Hi Stephen! @CannibalVox

    Which docker-compose are you using? From master branch?

    SasidharT
    @SasidharT
    Hi Guys, whether anyone deployed spark cluster using docker-compose.yml on ECS in AWS
    ?
    Anton Kulaga
    @antonkulaga
    @earthquakesan is it possible for you to publish docker with spark 2.3.1 ? I have some minor dependency clashes with 2.3.0 container using SANSA-RDF dependencies
    Anton Kulaga
    @antonkulaga
    Any plands for alluxio docker-swarm configs?
    comboo
    @wings-xue
    I can't open http://<dockerhadoop_IP_address>:8088/ , and i see run.sh , there is not run commod about yarn ?
    Peter Viskovics
    @jr.visko_gitlab
    Hi,
    when running docker pull bde2020/hive I face this problem:
    hive_1 | mkdir: Call From 57c424f72c21/172.22.0.2 to 57c424f72c21:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    hive_1 | mkdir: Call From 57c424f72c21/172.22.0.2 to 57c424f72c21:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    hive_1 | chmod: Call From 57c424f72c21/172.22.0.2 to 57c424f72c21:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    hive_1 | chmod: Call From 57c424f72c21/172.22.0.2 to 57c424f72c21:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    and beeline cannot join to localhost:10000
    ➜ epam-qa-metrics git:(master) ✗ docker-compose exec hive bash
    root@57c424f72c21:/opt# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Connecting to jdbc:hive2://localhost:10000
    18/11/14 13:11:30 [main]: WARN jdbc.HiveConnection: Failed to connect to localhost:10000
    Could not open connection to the HS2 server. Please check the server URI and if the URI is correct, then ask the administrator to check the server status.
    Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0)
    Beeline version 2.3.2 by Apache Hive
    beeline> CREATE TABLE pokes (foo INT, bar STRING);
    No current connection
    this is my docker version:
    ➜ ~ docker -v
    Docker version 18.06.1-ce, build e68fc7a
    can anyone help please?
    Peter Viskovics
    @jr.visko_gitlab
    The issue is resolved, but I think it may be worth to mention on bde2020/hive that this is just part of the whole project, which can be found on https://github.com/big-data-europe/docker-hive
    Andrii Gakhov
    @gakhov
    Hi guys! For anyone interested in learning space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications, take a look at my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 978-3748190486). In this book, you can find algorithms and data structures for Membership querying (Bloom filter, Counting Bloom filter, Quotient filter, Cuckoo filter), Cardinality (Linear counting, probabilistic counting, LogLog, HyperLogLog, HyperLogLog++), Frequency (Majority algorithm, Frequent, Count Sketch, Count-Min Sketch), Rank (Random sampling, q-digest, t-digest), and Similarity (LSH, MinHash, SimHash). Check at Amazon or the book's webpage
    Zhoodar
    @zhoodar
    Hello there, does someone familiar with this isue big-data-europe/docker-hadoop#38
    It appeared when I tried to write data into remote hdfs.
    purbanow
    @purbanow
    Hi guys, When running a job with 1 spark worker + hadoop setup everything is going well by when im trying to run with 2 workers im getting : JvmPauseMonitor: Detected pause in JVM or host machine (eg GC)
    any ideas?
    Juan Santillana
    @ratasxy_twitter
    Hi I have a cuestiones in big-data-europe/hbase-docker for use the port 9090
    just should I expose the port?
    Xining Li
    @xiningli
    Hello, I am new here.
    Diego Quintana
    @diegoquintanav
    just should I expose the port?
    to connect a client to hdfs from the docker host, yes
    @ratasxy_twitter ^

    I'm also running into problems with that repo. How should I connect a client using the java API?

                    Configuration config = HBaseConfiguration.create();
                    config.set("hbase.zookeeper.quorum", "localhost");
                    config.set("hbase.zookeeper.property.clientPort", "2181");
                    HBaseAdmin.checkHBaseAvailable(config);

    Returns org.apache.hadoop.hbase.MasterNotRunningException: java.net.UnknownHostException: can not resolve hbase-master,16000,1592487871967

    Diego Quintana
    @diegoquintanav
    I'm getting that error
    Diego Quintana
    @diegoquintanav
    any ideas?
    billsteve
    @billsteve
    could you "ping hbase-master " or "telnet hbase-master 16000"?
    luigi-asprino
    @luigi-asprino
    Hi all, anyone can help me with this issue big-data-europe/docker-hadoop#79
    Anatoly Danilov
    @anatolyD
    Hey guys, the 3.0.0 is announced in README.md although i fail to find it in the docker registry, anyone had it working?