Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 17:48
    kevin-bates commented #968
  • May 07 22:37
    taeyeopkim1 commented #968
  • May 07 16:03
    kevin-bates closed #969
  • May 07 16:03
    kevin-bates commented #969
  • May 07 16:02
    kevin-bates commented #968
  • May 07 04:57
    taeyeopkim1 edited #969
  • May 07 04:56
    taeyeopkim1 opened #969
  • May 07 04:51
    taeyeopkim1 commented #968
  • May 06 16:27
    kevin-bates commented #968
  • May 06 01:32
    taeyeopkim1 commented #968
  • May 05 19:15
    kevin-bates commented #967
  • May 05 16:59
    kevin-bates commented #968
  • May 05 15:22
    taeyeopkim1 opened #968
  • May 05 11:34
    abinassahoo edited #967
  • May 05 06:21
    abinassahoo opened #967
  • May 05 01:59
    Carreau synchronize #950
  • May 05 01:56
    Carreau commented #950
  • May 05 01:54
    Carreau synchronize #950
  • May 04 17:51
    kevin-bates edited #853
  • May 04 17:50
    kevin-bates edited #852
Kevin Bates
@kevin-bates
Is there a reason you need 5 minute startup times?
Is this initial-image-seeding stuff? If so, you should configure the KernelImagePuller - unless you have a restriction on daemonsets.
ms-lolo
@ms-lolo
it is initial image seeding stuff. we have something equivalent to kernel image puller but it does not ensure images are always 100% fresh
for example, moments after the cluster scales up will have nodes available for kernels but not yet primed with all our different kernel images
similar to the few moments after our master branch builds when our kernel image is updated
oh, and when people are developing new kernel images which are not configured to be pull on every node by default
I am on EG 2.2 so looking forward to the better responsiveness improvements with the aync changes
ms-lolo
@ms-lolo
@kevin-bates does jupyter lab lack a cli option for the KERNEL_LAUNCH_TIMEOUT config? I'm checking jupyter-lab --help-all but only see these two options available:
--GatewayClient.connect_timeout=<Float>
    Default: 40.0
    The time allowed for HTTP connection establishment with the Gateway server.
    (JUPYTER_GATEWAY_CONNECT_TIMEOUT env var)

--GatewayClient.request_timeout=<Float>
    Default: 40.0
    The time allowed for HTTP request completion.
    (JUPYTER_GATEWAY_REQUEST_TIMEOUT env var)
oh I see. your PR creates that option based on the configs above
Kevin Bates
@kevin-bates
correct. iirc, KERNEL_LAUNCH_TIMEOUT is solely env-based (if set by user)
ms-lolo
@ms-lolo
interesting ok, thanks
I think we finally have a configuration up and running that allows me to iterate quickly on lab/eg instance configurations. it took eliminating our jupyter hub unfortunately
Kevin Bates
@kevin-bates
hmm - I’m curious why that was necessary (removal of hub).
ms-lolo
@ms-lolo

we spoke in the past about the issues with having all requests funnel through a single proxy. there seems to be a pattern of adding components to the jupyter stack that mostly act as reverse proxies to components further down (hub → lab → eg → kernel) which adds a huge amount of stability issues. hub sitting at the very top of this funnel made it responsible for a lot of issues. having every cell execution run through the hub gave us latency issues and a crash of the hub instance generally meant everyone lost their lab instances, which caused people to lose all their kernels. deploying a change to the hub is a huge hassle for this same reason.

ultimately, I don't want something monitoring my lab instances. kubernetes has native functionality to represent services, monitor them, and ensure they remain healthy. I don't need a custom solution for jupyter pods that differs from the other services running in my cluster.

on top of that, we have a central auth solution that covers our entire cluster and any service running inside it. which means we have little use for the hub multi-user features. I don't need to handle authentication in jupyter

it seems like there's generally a lot of tightly coupled complexity flowing through all of these components. I've been trying to get to a stable enough point to be able to try and contribute on some of these issues. at the very least, we've built some tooling to deploy custom kernelspecs and custom lab/eg images to my instance alone without affecting everyone else on the cluster. so I'm hoping I can actually create some custom builds now but I haven't tried packaging modified eg/lab code quite yet :)
ms-lolo
@ms-lolo
@kevin-bates 2.2 seems to be a huge improvement, but I've made a custom kernel that has a time.sleep(120) during startup to simulate slow launches and am noticing that if I launch two kernels, EG is not beginning to provision the second one until the first has fully started. The timeout setting has improved the issue though and eventually both kernels do run. am I missing some settings to make EG handle more than one request at a time?
seeing a gateway timeout error now that I'm trying to track down. unsure if this is on jupyter or one of the more general proxies
ms-lolo
@ms-lolo
it looks like calls to api/sessions are failing with the gateway timeout if EG is in the middle of provisioning a kernel
Kevin Bates
@kevin-bates
Can you please open an issue in EG and include pertinent logs (EG and Lab w/ DEBUG) and other information (e.g., what facet of the custom kernel is sleeping for 2 minutes relative to its launch, etc.) and we’ll try to work through these? For example, placing a 2 minute sleep that delays the kernel process starting by 2 minutes is not practical. However, placing that same delay after it has been launched is a better simulation.
ms-lolo
@ms-lolo
yeah I'll try to gather some useful logs and file a ticket. it mostly looks like there is still some single-threaded-ness somewhere during the launching of kernels before the connection between the kernel and EG is made
Kevin Bates
@kevin-bates
Ok - we’ll take a look. I know I’ve started 3-5 kernels simultaneously (depending on my envs capacity) where each takes about 5-15 seconds to fully start in YARN or k8s cluster and all the startup messages are interleaved in the EG logs (something that was not possible prior to EG 2.2).
ms-lolo
@ms-lolo
@kevin-bates after enabling all the debug logs and poking around, it seems like the above timeout configs were the only real issue. I noticed the EG seemed to be slowing down while it was handling slow kernel starts but I doubled the CPU/memory on those EG pods and I'm seeing the interleaved logs now. things look pretty good now. thanks again for the 2.2 work
Kevin Bates
@kevin-bates
That’s great news @ms-lolo, thank you for the update and the focused examination!
coder-forever-2020
@coder-forever-2020
Hello, I have integrated Jupyter Hub into Enterprise gateway both in k8s. I can start remote kernels from Jupyter Hub, yet in Jupyter Hub, the kernels are not not interactive.
That is if I input some code "print('test')" into ipynb web ui, the code it not returning any value.
Is there any suggestion to make the remote kernel interactive/
Thanks!
Kevin Bates
@kevin-bates
Hi @coder-forever-2020. This implies there’s a disconnect between the Notebook/Lab instance and EG or EG and the remote kernel. If the kernel names presented in the list appear to be kubernetes related, then your issue is probably the latter. Other factors that can come into play is the initial seeding of the kernel images can take a while and if the KernelImagePuller daemonset is not working correctly, can lead to timeout issues when attempting initial kernel startups on a given node. All in all, it can take a little bit to get all the communications working correctly.
Please take a look at your EG pod logs and see if anything can be gleaned from them as to what might going on. Should you still be stuck, please open an issue in the repo and provide the output of the EG pod logs, as well as a screenshot or two of the notebook that includes the kernel name, it’s status and the executed cell. Thanks.
coder-forever-2020
@coder-forever-2020
Thanks @kevin-bates will do.
coder-forever-2020
@coder-forever-2020
@kevin-bates and all, it turns out tornado 6.x is not compatible with Jupypternotebook 6. I got this warning: "RuntimeWarning: coroutine 'WebSocketHandler.get' was never awaited"
After I downgrade tornado to 5.1.1, the whole integration with remote kernel works like a charm. Thanks everyone for your effort put into this amazing project.
Kevin Bates
@kevin-bates
Hmm - I haven't seen that recently. Please ensure you're running EG 2.2 and Notebook 6.1+. Is that being produced on the client or EG side of things? Please provide a complete traceback of the exception and surrounding log entries.
coder-forever-2020
@coder-forever-2020
Yah, the tornado version issue is under EG 2.2 and below is the detailed jupyter component version.
jupyter core : 4.6.3
jupyter-notebook : 6.1.3
qtconsole : not installed
ipython : 7.17.0
ipykernel : 5.3.4
jupyter client : 6.1.6
jupyter lab : 2.2.5
nbconvert : 5.6.1
ipywidgets : not installed
nbformat : 5.0.7
traitlets : 4.3.3
The warning is produced at JupyterHub's notebook pod.
Kevin Bates
@kevin-bates
Hmm - I guess I was under the impression the issue was occurring where EG is running, but your last comment implies its where notebook is running - which makes sense. Does that fit your scenario?
Could you please provide the traceback you’re seeing? And is the image available for me to pull and look at? Thanks.
coder-forever-2020
@coder-forever-2020
@kevin-bates The jupternotebook docker that got the warning is jupyter/minimal-notebook:latest The trace back only have the warning "RuntimeWarning: coroutine 'WebSocketHandler.get' was never awaited"
May I also check in EG v2.2, can I configured to allow enterprise-gateway pod to pass any environment variables to remote kernel? The use case is to dynamically provisioning kernel's user account for access control.
Kevin Bates
@kevin-bates

@coder-forever-2020 - where are you using jupyter/minimal-notebook:latest? Is this for your client-side notebook server? Please include the full traceback rather than just the "was never awaited" message.

Regarding additional envs, besides unconditional KERNEL_-prefixed envs, you can add a list of env names to EG_ENV_WHITELIST or from the CLI --EnterpriseGatewayApp.env_whitelist which can also be included:

--EnterpriseGatewayApp.env_whitelist=<List>
    Default: []
    Environment variables allowed to be set when a client requests a new kernel.
    (EG_ENV_WHITELIST env var)
coder-forever-2020
@coder-forever-2020
Thanks a lot
coder-forever-2020
@coder-forever-2020
@kevin-bates If we would like to allow user specify kernel parameters, e.g. #GPU when requesting a remote kernel, do you have any suggestions? Thanks.
Kevin Bates
@kevin-bates
This depends on whether or not you own the UI that would allow the user to specify such parameters. However, the answer (at this time) is still the same.... Short of adding proper support for parameterized kernel launch, the only way this can be accomplished is to use KERNEL_ env variables, then modify the kernelspec files to use those values accordingly.
If you own the UI, then you can set the KERNEL_ values (e.g., KERNEL_NUM_GPUS=2) into the env: stanza of the kernel start request body. If you don't have control over the UI, then those envs would need to be present in the notebook process making the kernel start request since the gateway logic automatically flows any KERNEL_ envs.
Rick Lamers
@ricklamers

I noticed the latest version of JupyterLab (3.0.0-rc.8) doesn't have the check_origin() that was added to the gateway websocket handler (https://github.com/jupyter/jupyter_server/blob/master/jupyter_server/gateway/handlers.py#L35).

In general, does Enterprise Gateway already support JupyterLab 3.0.0? Because even with the above patched, I couldn't get kernel messages flowing from JupyterLab (3.0.0-rc.8) to EG (tested with 2.2). No errors, just stuck (kernel had lightning icon) executing a cell.

Luciano Resende
@lresende
@ricklamers we will take a look at it, feel free to create an issue on EG to track the issue
Rick Lamers
@ricklamers
I’ll first try to create a minimal reproducable setup on 2.3. I’ll also submit a PR to JLab master for the handler.
If the problem persists in the minimal reproducable setting I’ll create an issue for tracking! Thanks for replying
Rick Lamers
@ricklamers

Seems like same issue happens on 2.3 + 3.0.0-rc9. Created a PR with jupyter/jupyter_server for handler.py and created an issue for tracking interop on Enterprise Gateway.

Could be related to awaiting connect in the handler (see issue #903).

Kevin Bates
@kevin-bates

Thanks Rick. Yes, the check_origin change is good. However, I'm still unable to complete the kernel's startup from lab3 to EG. I'm not seeing a kernel-info-request/reply sequence that typically occurs during the establishment of the websocket - which I'm not seeing either.

There shouldn't be anything required on the EG side of things, but I need to ensure there aren't any missing PRs in notebook that belong in jupyter-server - like the check_origin change.

Ivar Stangeby
@qTipTip

Hey!

I love the idea of Enterprise Gateway and would love to take it for a spin. I am having some installation issues. Is there any way of testing this locally in KIND? (Kubernetes in docker?). I am running into issues with the kernel-image-puller not being able to connect. Error 111.

│ Traceback (most recent call last):                                              │
│   File "./kernel_image_puller.py", line 25, in <module>                         │
│     docker_client = DockerClient.from_env()                                     │
│   File "/usr/local/lib/python3.8/site-packages/docker/client.py", line 84, in f │
│ rom_env                                                                         │
│     return cls(                                                                 │
│   File "/usr/local/lib/python3.8/site-packages/docker/client.py", line 40, in _ │
│ _init__                                                                         │
│     self.api = APIClient(*args, **kwargs)                                       │
│   File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 188, │
│  in __init__                                                                    │
│     self._version = self._retrieve_server_version()                             │
│   File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 212, │
│  in _retrieve_server_version                                                    │
│     raise DockerException(                                                      │
│ docker.errors.DockerException: Error while fetching server API version: ('Conne │
│ ction aborted.', ConnectionRefusedError(111, 'Connection refused'))
Kevin Bates
@kevin-bates
Hi @qTipTip. I'm not sure what is going on with the KernelImagePuller, it can be a little finicky. If you could provide some more details, go ahead and open an issue in the repo. However, its functionality is more for larger clusters and you should still be able to complete your "spin" w/o it. KIP is helpful because the image download tends to take longer than the kernel startup timeout - so the idea is that it will have pulled the images prior to the first request. You could either preload your kernel images or suffer a timeout and try to start after their initial reference has completed the download.
The image names are baked into the kernelspec files located in the EG image's /usr/local/share/jupyter/kernels directories. I'd just pick a couple kernelspecs you're interested in and pull those images.
1 reply
dummys
@dummys
hey guys
Kevin Bates
@kevin-bates
Hi - I just replied to you on DM.