Detecting and shutting down idle Sagemaker instances

As an indie hacker on a low budget, I want to minimize my cloud spend when doing deep learning experimentation.  Since GPUs are quite costly, it's crucial to avoid having them sit idle.  

It's very easy to kick off a training job that you think will run all night, have it fail 20 minutes into the job, and then wake up to see that you paid for an idle GPU instance for 7.5 hours.

Here's how I got it idle-detection working with AWS Sagemaker.  If there is no activity in a Jupyter notebook after a configurable amount of time, it will automatically shutdown the instance associated with that notebook.

For the overall approach, I took the recommendation of the perplexity.ai LLM search engine:

To automatically shutdown idle resources on SageMaker, you can use an auto-shutdown Jupyter extension. One such extension is the SageMaker Studio Auto-Shutdown Extension, which automatically shuts down KernelGateway Apps, Kernels, and Image Terminals in SageMaker Studio when they are idle for a stipulated period of time. You can configure an idle time limit of your preference. To install the extension, you can use a lifecycle configuration script.

Install the extension manually

First install the script as documented in Option 2 of SageMaker-Studio-Autoshutdown-Extension.  This approach is more straightforward than the recommend approach of using Lifecycle configurations, but the downside is that if you delete the JupyterServer app for your user profile, you'll need to go through these steps again.

Here is a simplified recap of those instructions.

  1. Open a System Terminal (as opposed to instance terminal)
  2. Download the script with curl -O https://raw.githubusercontent.com/aws-samples/sagemaker-studio-lifecycle-config-examples/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh
  3. Change TIMEOUT_IN_MINS in the script as per your needs using vi on-jupyter-server-start.sh - for testing I set it to 10 mins.
  4. Run the script: chmod +x on-jupyter-server-start.sh; ./on-jupyter-server-start.sh  NOTE: It may kill your current terminal.
  5. Run the script. This will create a file called set-time-interval.sh in .auto-shutdown folder
  6. Change Directory to .auto-shutdown and run set-time-interval.sh

The output from the above script should look like this:

 ./set-time-interval.sh 
Succeeded, idle timeout set to 10 minutes

Now test it out:

  1. curl -O https://raw.githubusercontent.com/aws-samples/sagemaker-studio-auto-shutdown-extension/main/check_idle_timeout_configuration.py
  2. python check_idle_timeout_configuration.py

The output should look something like this:

sagemaker-user@studio$ python check_idle_timeout_configuration.py 
<Response [200]>
{'idle_time': 600, 'keep_terminals': False, 'count': 7}

Now really test it out: don't do any activity on an instance for the amount of time you set for TIMEOUT_IN_MINS and see if is shutdown.  Then try the reverse, and set a very short timeout and make sure that the notebook activity keeps the instance alive.  

Major caveat - what about non-jupyter activity?

If you're using an image terminal to do your work, but not using any jupyter notebooks or code consoles, your activity "won't count" and the instance will still be shutdown.  In other words, any activity that the jupyter server doesn't know about won't go towards extending your lease that prevents the instance from being shutdown.  This is also mentioned in this issue.

Workaround idea: write a script that runs in the background, monitors for user activity on the terminal, and then sends keep-alive commands to the jupyter kernel.

In the meantime, I dealt with this issue by setting quite a large timeout value for TIMEOUT_IN_MINS of 8 hours.  It's not ideal, because it can still burn a lot of cash due to idle instances, but it's better than having no backstop and realizing weeks or months later you forgot to shutdown that GPU.  Also, just me, or could AWS do more to make this easier?  I guess they are disincentivized and are enjoying the extra revenue.

Persisting the behavior

As mentioned earlier, the problem with this approach is that anytime you restart your JupyterServer, for example when shutting down and updating Sagemaker Studio, the idle detection will be removed and you will have to redo the steps above.

The solution is to use the recommend approach (Option 1) in SageMaker-Studio-Autoshutdown-Extension.  It looked a bit complicated though, so I haven't tried it yet.  Maybe next time I blow through my AWS budget I'll give it a shot.

If you have any thoughts or suggestions, post a reply to the twitter thread for this blog post.

Happy hacking!

References