Set Up a Working Directory in a Containerized MicroK8s Environment with Canonical Data Science Stack and GitHub Integration
Step-by-step process for setting up a working directory in a containerized MicroK8s environment with Canonical Data Science Stack and GitHub.
Table of contents
- Typical ML Workflow
- Benefits of Canonical Data Science Stack
- Step 1: Setting Up Git on Your Local Machine
- Step 2: Creating a Project Directory in JupyterLab
- Step 3: Initializing Git and Setting the Default Branch to "main"
- Step 4: Creating a GitHub Repository
- Step 5: Linking Your Local Repository to GitHub
- Step 6: Creating a Python File and Committing Changes
- Step 7: Pushing Changes to GitHub
- Step 8: Optional but Mendatory - Setting Up Automatic Push
- Conclusion
In this article, we will walk through the detailed process of setting up a working directory inside a a PyTorch container running on MicroK8s, a lightweight Kubernetes distribution, using the Canonical Data Science Stack. You have the choice to use two other types of containers, one with TensorFlow and other for intel CPU and GPU. Additionally, we will link this working directory to GitHub for version control and backup. This ensures that our work inside the container remains organized, versioned, and backed up externally, offering a robust workflow for data science and machine learning projects.
I have previously installed the Canonical Data Science Stack using one of its per-configured images. The stack includes essential tools such as PyTorch (CPU version), MLFlow, JupyterLab, Git, Python, and their necessary dependencies. This pre-built environment helps streamline machine learning workflows. If you want detailed installation steps, feel free to check out my blog where I go over each installation in depth. For now, let’s focus on setting up our GitHub-linked directory inside this environment.
Before going to the actual steps lets explore briefly why I chose this path for my Data Science and Machine Learning Endeavor.
MicroK8s: A lightweight Kubernetes distribution that allows to deploy applications and manage containers easily. It's perfect for local development and testing.
Containerized Environments (PyTorch, TensorFlow): By containerizing ML frameworks (like PyTorch or TensorFlow), I ensure for consistent environments across development, testing, and production. This helps in avoiding issues related to dependency mismatches.
MLflow: A platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It integrates well with frameworks like TensorFlow and PyTorch, allowing to track experiments, log models, and deploy them easily.
JupyterLab: An interactive development environment that enables to write code, visualize data, and perform exploratory data analysis. Jupyter notebooks can be used to prototype and experiment with ML models.
Git Integration: Version control for your code and experiments. Integrating Git allows for tracking changes in your codebase, even collaborating with others, and managing different versions of my ML models.
Typically, I plan to use this deployment to experiment with ML workflow right from environment setup, which all you will agree is very hectic, model building, testing to deployment. Here’s how this stack is going to help in any such ML workflow -
Typical ML Workflow
Here’s a typical ML or DL workflow using the components the data science stack by Canonical offers:
Environment Setup
Use MicroK8s to create a local Kubernetes cluster.
Deploy containerized applications (like JupyterLab and MLflow) in this cluster.
Development
Start a JupyterLab instance to write and run Python code.
Use Git to manage your code and track changes. You can push changes to a remote repository to collaborate with others.
Model Training
Train your ML models using TensorFlow or PyTorch within JupyterLab or through automated scripts.
Log parameters, metrics, and models to MLflow to keep track of your experiments.
Model Tracking
Use MLflow to compare different runs of your models and visualize performance metrics.
Store models in MLflow's model registry for versioning and easy access.
Deployment
Deploy the trained model to a production environment using Kubernetes.
You can expose your model as a REST API or serve it directly using MLflow's built-in serving capabilities.
Monitoring & Iteration
Monitor model performance in production.
Iterate on model improvements based on new data or feedback, and manage changes through Git and MLflow.
Benefits of Canonical Data Science Stack
Scalability: Kubernetes allows you to scale your applications seamlessly.
Reproducibility: Using containers ensures that experiments can be reproduced consistently.
Collaboration: Git integration facilitates team collaboration, while MLflow manages experiment tracking and model versioning.
Flexibility: You can easily switch between different ML frameworks and tools as needed.
Now let us proceeds with the steps. Believe me, you can follow these steps to setup and experiment with your models with ease.
Step 1: Setting Up Git on Your Local Machine
Git comes per-installed inside the all the six containers by the Canonical data science stack. Let's verify the installation and configure Git before proceeding further.
Open JupyterLab: Navigate to JupyterLab within your container environment. You can open the terminal from the JupyterLab interface by going to
File → New → Terminal
.Verify Git Installation: On the terminal inside the JupyterLab check if Git is installed by running the following command:
git --version
This command checks if Git is installed and displays the version number.
Configure Git : To commit changes to a GitHub repository, you’ll need to configure Git with your GitHub credentials. This ensures that every commit and push is attributed to you. Set your username and email with the following commands:
git config --global user.name "Your GitHub Username" git config --global user.email "Your GitHub Email"
These commands ensure that your commits are associated with your GitHub account.
Step 2: Creating a Project Directory in JupyterLab
Once Git is set up, the next step is to create a project directory inside JupyterLab, which runs within the MicroK8s Kubernetes cluster.
Create a Directory for the Project: Inside the terminal, create a directory where your Python project will reside. For this example, we will create a directory called
python-iitm
:mkdir python-test cd python-test
The
mkdir
command creates a directory namedpython-
test, and thecd
command changes the terminal’s current working directory to this newly created folder.
Step 3: Initializing Git and Setting the Default Branch to "main"
With your project directory ready, the next step is to initialize Git within this directory and set main
as the default branch. By default when you will integrate the container working directory with GitHub it will set master
as default. I have experienced difficulty in merging it with mains. So I will suggest to go for setting main
as default to avoid any complication further.
Initialize Git: In the terminal, initialize a new Git repository inside the
python-test
directory:git init
The
git init
command creates a new.git
directory inside your project folder. This directory tracks all changes to your files, making version control possible.Set Default Branch to "main": By default, Git creates the
master
branch when initializing a repository. Since modern Git practices favor usingmain
instead, you can configure Git to always setmain
as the default branch:git config --global init.defaultBranch main
This command ensures that future repositories will automatically set
main
as the default branch.Rename the Current Branch to "main": If the repository has already created a
master
branch, you can rename it tomain
:git branch -m main
The
-m
option allows you to rename the branch frommaster
tomain
, ensuring consistency with the new naming conventions.
Step 4: Creating a GitHub Repository
Now that your local repository is set up, it’s time to create a remote repository on GitHub.
Log In to GitHub: Go to GitHub and log in to your account.
Create a New Repository:
Click on the "New" button to create a new repository.
Name the repository
python-test
(Please ensure that the working directory name inside your containerized environment and here in git new repository, must be same).Set the repository to "Private" (or "Public" depending on your needs).
Do not initialize the repository with a README,
.gitignore
, and license or any other files.
At this point, you now have a GitHub repository ready to be linked with your local project.
Step 5: Linking Your Local Repository to GitHub
Now, let’s connect your local Git repository inside the MicroK8s container to the remote GitHub repository. This step will allow you to push changes from your local project to GitHub.
Link Local Repository to GitHub: You will need to use a personal access token for authentication (you can create one from GitHub settings). Use the following command to link your local Git repository to the remote GitHub repository:
git remote add origin https://<your github user name>:<your_token>@github.com/<your git hub name>/python-test.git
In this command, replace
<your git-hub user name>
with the username of your git-hub profile and<your_token>
with your actual GitHub personal access token ( I hope you know how to generate and save personal access token, if not comment in the post I will guide you generate one). This allows Git to authenticate your connection with GitHub. Thegit remote add
command adds a reference to your GitHub repository, andorigin
is the default name given to the remote repository.Alternative Authentication (Avoid this method): If you do not want to embed your token directly into the URL, you can use HTTPS. GitHub will prompt you for your username and password when you push changes:
git remote add origin https://github.com/<your github user name>/python-iitm.git
Every time you will commit a change it will prompt to give your username and authentication key, so avoid this method.
Step 6: Creating a Python File and Committing Changes
With Git set up and linked to GitHub, let's create a Python file and push it to GitHub.
Create a Python File: In JupyterLab, create a new Python file named
hello.py
. Write the following code in the file:print("Hello, Python!")
This is a simple Python program that prints "Hello, Python!" to the console.
Add the File to the Git Staging Area: In the terminal, add the file to Git’s staging area. The staging area is where Git tracks changes that are ready to be committed.
git add hello.py
Commit the File: Next, commit the changes with a message describing the update:
git commit -m "First commit - hello.py"
The
-m
flag allows you to add a commit message directly from the command line. In this case, the message describes that this is the first commit, which includes thehello.py
file.
Step 7: Pushing Changes to GitHub
Now that the changes are committed locally, it’s time to push them to the remote GitHub repository.
Push the Changes to GitHub: Use the following command to push your changes to the
main
branch of the GitHub repository:git push -u origin main
The
-u
option setsorigin/main
as the default upstream branch, so in the future, you can just rungit push
without specifying the remote and branch.
Note: You can create jupyter notebook inside this directory, a folder inside this working directory and on committing all will be pushed to your git-hub repository within a button click.
Step 8: Optional but Mendatory - Setting Up Automatic Push
If you’d like to simplify the process of adding, committing, and pushing changes, you can automate this with a bash alias. Please do this setting as it will ease all the stages of committing with just one command.
Create an Alias: Add the following line to your
.bashrc
or.bash_profile
file:alias gitpush="git add . && git commit -m 'auto commit' && git push origin main"
Reload Your Shell: After updating the
.bashrc
file, reload your terminal session with the following command:source ~/.bashrc
Now, every time you want to push changes, you can simply type gitpush
in the terminal. This command will add all files, commit them with the message "auto commit", and push them to the main
branch.
Conclusion
By following the steps outlined above, you can successfully set up a working directory inside a contanorized environment on MicroK8s Kubernetes cluster using the Canonical Data Science Stack. Moreover, you have linked this directory to a GitHub repository, enabling seamless version control and backups for your projects. Whether you're working on machine learning models, data analysis, or any other Python-based project, this workflow provides an efficient, reproducible, and scalable development environment.