HPC-AI convergence
See:
# Resources
- #TALK Innovator Insights: HW & SW Platforms for HPC, AI and ML
- #TALK HPC + AI: Machine Learning Models in Scientific Computing
- The increasing interest in the usage of AI/ML (ML) and Artificial Intelligence techniques (AI/AI) to tackle high-impact research problems requires High Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes (Brayford et al. 2019). A diverse group of disciplines could be impacted by the integration of HPC and AI, especially those dealing with massive quantities of multi-dimensional data such as high energy physics, astrophysics and medical imaging.
# HPC vs AI programming models
- One of the main challenges in the convergence of HPC and AI is the gap between programming models.
- The worlds of HPC and AI deal with different computing paradigms or programming models. On one hand, for HPC it is the norm to use well tested and closed source code written in low-level languages (C, Fortran), distributed computing API’s (MPI, OpenMPI), the command line and workload managers (SLURM), and systems that grant users restricted administrator privileges (no connection to external systems). On the other, in ML/AI fields the applications are usually developed using high-level scripting languages or frameworks (Python, Julia, Tensorflow) that usually address the challenges of writing performant and distributed code, and heavily relying on open source libraries which need to be downloaded from external sources (internet). The way ML applications are developed requires an interactive computing platform where new models could be tested and validated quickly without the overhead of restrictive administrative rules (very often related to security measures). The interactive computing paradigm implies code development, real-time exploratory data analytics, and visualizations of inputs and results. The design of ML models, opposed to calling pre-coded numerical models, requires a lot of experimentation.
- The most recent family of ML algorithms, based on deep neural networks (Deep Learning, DL), has become the workhorse of AI. In order to handle complex problems, DL relies on training increasingly bigger deep networks on large amounts of data, which usually make use of specialized hardware, such as Graphical Processing Units (GPU).
# Interactive computing and containers
- Data Science and ML workflows require interactive/exploratory computing platforms and more flexible environments that allow on-the-fly software stack modifications (the installation of new libraries) and reproducibility of these software stacks. For Python this could be achieved with Pipenv or Conda (without sudo). However, a more general solution is to use containers, which are a combination of Kernel “cgroups” and “namespaces” to create isolated environments. Docker provided a complete tool chain to simplify using containers from build to run. Containers address a few of the aspects described before, which are very relevant to modern scientific computing: reusability, collaboration, reproducibility and portability. Unfortunately, some of the characteristics of Docker prevent from it being in HPC systems:
- Uses an all or nothing security model, which would grant users with system privilege.
- It does not play well with batch systems,
- Assumes a local disk.
- Requires a very modern kernel.
- Its adoption would imply adding new layers of complexity to the HPC systems.
- In order to deal with the security and integration issues of Docker, several HPC container systems were created. Usually these container solutions are backwards compatible with Docker, and among them the most widespread and mature are:
- Shifter
- developed at NERSC to enable Docker images to be securely executed on a HPC ecosystem, enabling user independence, shared resource availability and high security1
- Singularity
- very popular container used in many supercomputing centers, with a similar runtime compared to Shifter
- Charliecloud
- light-weight container system developed at LANL with high standards of security
- Shifter
# Talks
- #TALK Reproducible Science with Containers on HPC through Singularity
- #TALK Containers in HPC. IDEAS webinar by Shane Canon
- #TALK Distributed HPC Applications with Unprivileged Containers