Setting Up a Complete Machine Learning Pipeline on a Linux Tablet: A Guide to the Data Science Stack

Setting Up a Complete Machine Learning Pipeline on a Linux Tablet: A Guide to the Data Science Stack

Learn how to set up a complete environment for the machine learning lifecycle using the Data Science Stack on your Linux tablet.

Introduction

In the world of data science and machine learning (ML), the tools and environments you work with are just as critical as the algorithms and data themselves. For those starting out in ML, managing the development environment can be overwhelming, with constant challenges around setup, package dependencies, and compatibility. That’s why I strongly recommend using a Linux command-line interface (CLI) environment. The Linux CLI stands out because of its flexibility, powerful package management, and direct control over system resources. It is widely considered the best choice for data scientists who want to focus on model building and experimentation, with minimal time spent on environment management issues.

To streamline my own workflow, I recently bought a Linux-based tablet, the Star Labs StarLite MK-V, and installed the Data Science Stack (DSS). DSS simplifies the process of setting up and managing an ML environment by bundling essential open-source tools into a single solution. It includes everything you need, from Jupyter Notebooks to ML frameworks like TensorFlow and PyTorch, allowing seamless integration with GPU resources while avoiding complex configuration steps. Leveraging the powerful hardware of my tablet, DSS creates a smooth, efficient ML workflow without the usual setup headaches.

This guide will walk you through the process of installing DSS on my StarLite MK-V tablet, but the same steps can be applied to any Linux distribution of your choice. By using the Linux CLI and DSS, you can create a robust, scalable machine learning environment that maximizes productivity, minimizes downtime, and allows you to focus entirely on building and optimizing your ML models.

System Details: The Heart of the Setup

Before diving into the installation process, let's take a closer look at the specifications of my tablet, which make it an ideal candidate for machine learning tasks.

Hardware Information

  • Model: Star Labs StarLite

  • Memory: 16.0 GiB RAM

  • Processor: Intel® N200 × 4

  • Graphics: Intel® Graphics (ADL-N)

  • Disk Capacity: 1.0 TB

Software Information

  • Firmware Version: 24.06

  • Operating System: Ubuntu 24.04.1 LTS

  • OS Type: 64-bit

  • GNOME Version: 46

  • Windowing System: Wayland

  • Kernel Version: Linux 6.8.0-44-generic

With this powerful setup, i feel I am well-equipped to handle various machine learning workloads.

Why Data Science Stack?

New data scientists often face significant challenges when setting up their environments, managing package dependencies, and accessing compute resources. At least, I faced a lot. These issues lead to wasted time on tooling, complex configurations, a steep learning curve, and high initial costs. With new tools and frameworks constantly being introduced, selecting and integrating the right ones is overwhelming, especially when they come with dependencies and versioning constraints that complicate upgrades and maintenance.

To address these issues, what I think, data scientists should look for platforms that cover the entire machine learning lifecycle, are easy to use, scalable, and integrate popular tools for exploratory data analysis and model building. The Data Science Stack (DSS) by Canonical provides a solution by bundling leading open-source tools into a ready-made environment. DSS is designed to simplify setup and maximize GPU usage on any Ubuntu AI workstation, making it easy for users to develop, optimize, and store models without the hassle of managing individual tools or dealing with complex configurations.

DSS includes popular machine learning environments like Jupyter Notebook, MLflow, PyTorch, and TensorFlow, with a simple command-line interface for deployment. It also takes care of package dependencies, ensuring compatibility and seamless integration across all tools and hardware. Additionally, DSS simplifies GPU configuration through its GPU operator, allowing data scientists to fully leverage their machine's capabilities without technical barriers, enabling faster development and optimization of machine learning models.

What is the Data Science Stack (DSS)?

The Data Science Stack is a comprehensive environment designed to simplify the machine learning workflow. It allows you to build, train, and deploy machine learning models efficiently using containerized environments.

The key features of DSS include:

  • Container orchestration system (microK8s): Manages containerized applications and services.

  • Containerized ML environments: Provides pre-configured environments tailored for different machine learning tasks.

  • CLI for management: Streamlines the management of containers and resources.

  • Seamless GPU utilization: Allows your applications to leverage the power of the Intel integrated GPU (ADL-N).

  • Easy integration with MLflow: For tracking lineage and managing ML workflows.

Prerequisites

Before we begin the installation process, ensure that your system meets the following prerequisites:

  1. Sudo Access: You need administrative privileges to install packages and services.

  2. Internet Connection: Required for downloading packages and updates.

  3. Terminal Access: Familiarity with the terminal commands.

Step-by-Step Installation Guide

Step 1: Install microK8s

microK8s is a lightweight Kubernetes distribution that enables you to run containerized applications on your local machine.

  1. Open the terminal on your tablet.

  2. Install microK8s by executing the following command:

     sudo snap install microk8s --channel 1.28/stable --classic
    
    • This command uses Snap to install microK8s from the specified channel. The --classic flag allows the snap to access system resources directly.
  3. Enable necessary microK8s services:

     sudo microk8s enable hostpath-storage
     sudo microk8s enable dns
     sudo microk8s enable rbac
    
    • hostpath-storage: Enables persistent storage for your containers.

    • dns: Sets up a DNS service for your Kubernetes cluster, allowing easy service discovery.

    • rbac: Enables Role-Based Access Control, enhancing security for managing resources.

Step 2: Install the Data Science Stack CLI

The DSS CLI simplifies managing the containerized environments and resources.

  1. Install the DSS CLI with the following command:

     sudo snap install data-science-stack --channel latest/stable
    
    • This command installs the Data Science Stack CLI, allowing you to manage your data science environments easily.

Step 3: Initialize the Data Science Stack

Now that you have installed microK8s and the DSS CLI, you need to initialize DSS.

  1. Run the following command:

     dss initialize --kubeconfig="$(sudo microk8s config)"
    
    • This command initializes the DSS environment, using the configuration of your microK8s installation to set up the necessary connections and permissions.

Step 4: Launch a Jupyter Notebook

With DSS set up, you can now launch a Jupyter Notebook pre-configured for TensorFlow, a popular machine learning framework.

  1. Create a new notebook environment:

     dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0
    
    • This command creates a new Jupyter Notebook environment using a pre-defined Docker image that includes TensorFlow and CUDA support for GPU acceleration.

    • dss create: Command to create a new environment.

    • my-tensorflow-notebook: The name of your notebook environment.

    • --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0: Specifies the Docker image to use for the notebook, which includes TensorFlow with CUDA support.

  2. Access the Jupyter Notebook:

    After the notebook environment is created, you can access it by running:

     dss list
    
    • After running the above command, you will see output with a URL to access the Jupyter Notebook. Open this URL in your web browser to start using the notebook.

Step 5: Integrating the Intel Integrated GPU (ADL-N)

To leverage the Intel integrated GPU for your machine learning tasks, ensure that the necessary drivers and configurations are in place.

  1. Install Intel Graphics Drivers: Make sure your system has the latest Intel graphics drivers installed. On Ubuntu, you can usually get these through the package manager:

     sudo apt update
     sudo apt install intel-media-va-driver-non-free
    
    • This command updates your package list and installs the Intel media driver, which provides GPU acceleration capabilities.
  2. Verify GPU Access in Jupyter Notebook: To check if the GPU is accessible in your Jupyter Notebook, run the following code in a new cell:

     import tensorflow as tf
     print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    
    • This code checks and prints the number of available GPUs. If everything is set up correctly, it should show a count greater than 0.
  3. Or you can verify that your GPU is recognized by the system, by running the following command too:

     lspci | grep VGA
    

Running Your Notebook

Every time you want to run your Jupyter Notebook, follow these steps:

  1. Start microK8s (if not running):

     sudo microk8s start
    
    • This command starts the microK8s services if they were previously stopped.
  2. Access your Jupyter Notebook:

    • Run the command to create or launch your desired notebook environment:

        dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0
      
    • Open the provided URL in your web browser.

  3. Install any additional libraries:
    Inside the Jupyter Notebook, you might want to install libraries specific to your projects:

     !pip install <library-name>
    
    • This command installs any additional Python libraries you need for your projects.

Exploring Kubeflow Integration

Kubeflow is a machine learning toolkit for Kubernetes, making it easier to deploy and manage ML workflows. It was possible to set up a single-node Kubeflow installation on my machine.

Installing Kubeflow

  1. Install the Kubeflow CLI:

     sudo snap install kfctl --classic
    
    • This command installs the Kubeflow CLI tool, which helps you manage Kubeflow installations.
  2. Deploy Kubeflow on microK8s: You will need to create a configuration file for your Kubeflow deployment. Follow the official Kubeflow deployment documentation for detailed steps.

  3. Access Kubeflow: After the installation, you can access the Kubeflow dashboard to manage your ML workflows.

Congratulations to me! I have successfully set up the Data Science Stack on my Star Labs StarLite MK-V tablet, enabling me to build, train, and deploy machine learning models. Additionally, I have integrated Kubeflow for managing ML workflows and configured your Intel integrated GPU for optimized performance.

Conclusion

By following this guide, you will now have a robust machine learning environment at your fingertips, ready for experimentation and learning. Your Linux machine, coupled with the Data Science Stack, provides an excellent platform for mastering machine learning and data science. Happy learning!