Set Up a Working Directory in a Containerized MicroK8s Environment with Canonical Data Science Stack and GitHub Integration

Set Up a Working Directory in a Containerized MicroK8s Environment with Canonical Data Science Stack and GitHub Integration

Step-by-step process for setting up a working directory in a containerized MicroK8s environment with Canonical Data Science Stack and GitHub.

In this article, we will walk through the detailed process of setting up a working directory inside a a PyTorch container running on MicroK8s, a lightweight Kubernetes distribution, using the Canonical Data Science Stack. You have the choice to use two other types of containers, one with TensorFlow and other for intel CPU and GPU. Additionally, we will link this working directory to GitHub for version control and backup. This ensures that our work inside the container remains organized, versioned, and backed up externally, offering a robust workflow for data science and machine learning projects.

I have previously installed the Canonical Data Science Stack using one of its per-configured images. The stack includes essential tools such as PyTorch (CPU version), MLFlow, JupyterLab, Git, Python, and their necessary dependencies. This pre-built environment helps streamline machine learning workflows. If you want detailed installation steps, feel free to check out my blog where I go over each installation in depth. For now, let’s focus on setting up our GitHub-linked directory inside this environment.

Before going to the actual steps lets explore briefly why I chose this path for my Data Science and Machine Learning Endeavor.

  • MicroK8s: A lightweight Kubernetes distribution that allows to deploy applications and manage containers easily. It's perfect for local development and testing.

  • Containerized Environments (PyTorch, TensorFlow): By containerizing ML frameworks (like PyTorch or TensorFlow), I ensure for consistent environments across development, testing, and production. This helps in avoiding issues related to dependency mismatches.

  • MLflow: A platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It integrates well with frameworks like TensorFlow and PyTorch, allowing to track experiments, log models, and deploy them easily.

  • JupyterLab: An interactive development environment that enables to write code, visualize data, and perform exploratory data analysis. Jupyter notebooks can be used to prototype and experiment with ML models.

  • Git Integration: Version control for your code and experiments. Integrating Git allows for tracking changes in your codebase, even collaborating with others, and managing different versions of my ML models.

Typically, I plan to use this deployment to experiment with ML workflow right from environment setup, which all you will agree is very hectic, model building, testing to deployment. Here’s how this stack is going to help in any such ML workflow -

Typical ML Workflow

Here’s a typical ML or DL workflow using the components the data science stack by Canonical offers:

  1. Environment Setup

    • Use MicroK8s to create a local Kubernetes cluster.

    • Deploy containerized applications (like JupyterLab and MLflow) in this cluster.

  2. Development

    • Start a JupyterLab instance to write and run Python code.

    • Use Git to manage your code and track changes. You can push changes to a remote repository to collaborate with others.

  3. Model Training

    • Train your ML models using TensorFlow or PyTorch within JupyterLab or through automated scripts.

    • Log parameters, metrics, and models to MLflow to keep track of your experiments.

  4. Model Tracking

    • Use MLflow to compare different runs of your models and visualize performance metrics.

    • Store models in MLflow's model registry for versioning and easy access.

  5. Deployment

    • Deploy the trained model to a production environment using Kubernetes.

    • You can expose your model as a REST API or serve it directly using MLflow's built-in serving capabilities.

  6. Monitoring & Iteration

    • Monitor model performance in production.

    • Iterate on model improvements based on new data or feedback, and manage changes through Git and MLflow.

Benefits of Canonical Data Science Stack

  • Scalability: Kubernetes allows you to scale your applications seamlessly.

  • Reproducibility: Using containers ensures that experiments can be reproduced consistently.

  • Collaboration: Git integration facilitates team collaboration, while MLflow manages experiment tracking and model versioning.

  • Flexibility: You can easily switch between different ML frameworks and tools as needed.

Now let us proceeds with the steps. Believe me, you can follow these steps to setup and experiment with your models with ease.

Step 1: Setting Up Git on Your Local Machine

Git comes per-installed inside the all the six containers by the Canonical data science stack. Let's verify the installation and configure Git before proceeding further.

  1. Open JupyterLab: Navigate to JupyterLab within your container environment. You can open the terminal from the JupyterLab interface by going to File → New → Terminal.

  2. Verify Git Installation: On the terminal inside the JupyterLab check if Git is installed by running the following command:

     git --version
    

    This command checks if Git is installed and displays the version number.

  3. Configure Git : To commit changes to a GitHub repository, you’ll need to configure Git with your GitHub credentials. This ensures that every commit and push is attributed to you. Set your username and email with the following commands:

     git config --global user.name "Your GitHub Username"
     git config --global user.email "Your GitHub Email"
    

    These commands ensure that your commits are associated with your GitHub account.

Step 2: Creating a Project Directory in JupyterLab

Once Git is set up, the next step is to create a project directory inside JupyterLab, which runs within the MicroK8s Kubernetes cluster.

  1. Create a Directory for the Project: Inside the terminal, create a directory where your Python project will reside. For this example, we will create a directory called python-iitm:

     mkdir python-test
     cd python-test
    

    The mkdir command creates a directory named python-test, and the cd command changes the terminal’s current working directory to this newly created folder.

Step 3: Initializing Git and Setting the Default Branch to "main"

With your project directory ready, the next step is to initialize Git within this directory and set main as the default branch. By default when you will integrate the container working directory with GitHub it will set master as default. I have experienced difficulty in merging it with mains. So I will suggest to go for setting main as default to avoid any complication further.

  1. Initialize Git: In the terminal, initialize a new Git repository inside the python-test directory:

     git init
    

    The git init command creates a new .git directory inside your project folder. This directory tracks all changes to your files, making version control possible.

  2. Set Default Branch to "main": By default, Git creates the master branch when initializing a repository. Since modern Git practices favor using main instead, you can configure Git to always set main as the default branch:

     git config --global init.defaultBranch main
    

    This command ensures that future repositories will automatically set main as the default branch.

  3. Rename the Current Branch to "main": If the repository has already created a master branch, you can rename it to main:

     git branch -m main
    

    The -m option allows you to rename the branch from master to main, ensuring consistency with the new naming conventions.

Step 4: Creating a GitHub Repository

Now that your local repository is set up, it’s time to create a remote repository on GitHub.

  1. Log In to GitHub: Go to GitHub and log in to your account.

  2. Create a New Repository:

    • Click on the "New" button to create a new repository.

    • Name the repository python-test(Please ensure that the working directory name inside your containerized environment and here in git new repository, must be same).

    • Set the repository to "Private" (or "Public" depending on your needs).

    • Do not initialize the repository with a README, .gitignore, and license or any other files.

At this point, you now have a GitHub repository ready to be linked with your local project.

Step 5: Linking Your Local Repository to GitHub

Now, let’s connect your local Git repository inside the MicroK8s container to the remote GitHub repository. This step will allow you to push changes from your local project to GitHub.

  1. Link Local Repository to GitHub: You will need to use a personal access token for authentication (you can create one from GitHub settings). Use the following command to link your local Git repository to the remote GitHub repository:

     git remote add origin https://<your github user name>:<your_token>@github.com/<your git hub name>/python-test.git
    

    In this command, replace <your git-hub user name> with the username of your git-hub profile and <your_token> with your actual GitHub personal access token ( I hope you know how to generate and save personal access token, if not comment in the post I will guide you generate one). This allows Git to authenticate your connection with GitHub. The git remote add command adds a reference to your GitHub repository, and origin is the default name given to the remote repository.

  2. Alternative Authentication (Avoid this method): If you do not want to embed your token directly into the URL, you can use HTTPS. GitHub will prompt you for your username and password when you push changes:

     git remote add origin https://github.com/<your github user name>/python-iitm.git
    

    Every time you will commit a change it will prompt to give your username and authentication key, so avoid this method.

Step 6: Creating a Python File and Committing Changes

With Git set up and linked to GitHub, let's create a Python file and push it to GitHub.

  1. Create a Python File: In JupyterLab, create a new Python file named hello.py. Write the following code in the file:

     print("Hello, Python!")
    

    This is a simple Python program that prints "Hello, Python!" to the console.

  2. Add the File to the Git Staging Area: In the terminal, add the file to Git’s staging area. The staging area is where Git tracks changes that are ready to be committed.

     git add hello.py
    
  3. Commit the File: Next, commit the changes with a message describing the update:

     git commit -m "First commit - hello.py"
    

    The -m flag allows you to add a commit message directly from the command line. In this case, the message describes that this is the first commit, which includes the hello.py file.

Step 7: Pushing Changes to GitHub

Now that the changes are committed locally, it’s time to push them to the remote GitHub repository.

  1. Push the Changes to GitHub: Use the following command to push your changes to the main branch of the GitHub repository:

     git push -u origin main
    

    The -u option sets origin/main as the default upstream branch, so in the future, you can just run git push without specifying the remote and branch.

Note: You can create jupyter notebook inside this directory, a folder inside this working directory and on committing all will be pushed to your git-hub repository within a button click.

Step 8: Optional but Mendatory - Setting Up Automatic Push

If you’d like to simplify the process of adding, committing, and pushing changes, you can automate this with a bash alias. Please do this setting as it will ease all the stages of committing with just one command.

  1. Create an Alias: Add the following line to your .bashrc or .bash_profile file:

     alias gitpush="git add . && git commit -m 'auto commit' && git push origin main"
    
  2. Reload Your Shell: After updating the .bashrc file, reload your terminal session with the following command:

     source ~/.bashrc
    

Now, every time you want to push changes, you can simply type gitpush in the terminal. This command will add all files, commit them with the message "auto commit", and push them to the main branch.

Conclusion

By following the steps outlined above, you can successfully set up a working directory inside a contanorized environment on MicroK8s Kubernetes cluster using the Canonical Data Science Stack. Moreover, you have linked this directory to a GitHub repository, enabling seamless version control and backups for your projects. Whether you're working on machine learning models, data analysis, or any other Python-based project, this workflow provides an efficient, reproducible, and scalable development environment.