Skip to main content

Creating Scale for AI Workloads: Pushing the Boundaries of Deep Learning

Perlmutter

It’s September, and SC24 in Atlanta is just around the corner. The intersection of AI and HPC is one of the hottest topics in the supercomputing zeitgeist as the community continues to explore how these technologies impact everything from scientific research to industrial applications. For the seventh year running, the SC Conference will host a tutorial session on “Deep Learning at Scale,” led by the National Energy Research Scientific Computing Center (NERSC) (operated by Berkeley Lab). This year, NERSC is partnering with experts from NVIDIA and Oak Ridge National Laboratory. The session will dive into the current strategies and tools that push the boundaries of what is possible with deep learning on the world’s most powerful supercomputers.

Pushing Boundaries

We spoke with Wahid Bhimji, the Group Lead for Data & Analytics Services (as well as the Division Deputy for AI and Science) at NERSC, to gain some insight into the state of deep learning at scale, explore the challenges and opportunities ahead, and get an idea about what attendees can expect from this popular session.

Wahid Bhimji

Wahid Bhimji

SC24 Communication Team: Can you provide an overview of the current state of deep learning at scale? What are some of the most exciting developments and trends happening currently?

Wahid Bhimji: Machine learning has been used in the sciences for decades, but the recent revolution has been driven by deep learning and modern AI techniques. This shift has been significant in science, and NERSC has been involved in related projects since around 2015. A lot of early work in this modern era was presented at SC, particularly SC17, where we scaled up to tens of thousands of nodes on the Cori supercomputer. The paper we presented was titled “Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data.”

Back then, government HPC centers and the scientific community were pushing the boundaries of scale, but the frameworks for doing this easily, like TensorFlow and PyTorch, weren’t as developed, so a lot of the work was done manually. Now, HPC and AI have come together more seamlessly, especially in areas like large language models, and now industry is really pushing the envelope. Deep learning for the sciences has also moved beyond proof-of-concept stages to production use cases, and, in many instances, these are now achieving or surpassing the scale of traditional scientific modeling, simulation, or data analysis methods.

One of the most exciting developments is the ability to scale these models, particularly by leveraging large GPU systems. Initially, data parallelism—splitting the training set across different devices—was the main approach. But now, we’re seeing advancements in model parallelism, where the model itself is split across devices. This is a key recent development, driven largely by industry for large-language models. What’s even more promising is extending these techniques beyond large language models and applying them to scientific and industrial applications. Further developments are needed if we are to create foundation models for science that can transfer learning across vastly different scientific examples and industry problems. We’re starting to see some of this, but there’s more work to be done.

Cori

Bringing these techniques to a broader audience is what our tutorial is about—expanding these approaches beyond just large language model frameworks to something more universally applicable.

SC24: What are the primary challenges associated with scaling deep neural network models? How are these challenges being addressed within the broader scientific and industrial communities?

Bhimji: Despite all the progress we’ve discussed, scaling models remains a significant challenge. It’s not as simple as taking something that works on a single GPU and scaling it up to a large HPC machine to exploit all its resources. We see this limitation, particularly in model sizes, which are constrained by the available high-bandwidth memory on a single GPU. As GPU technology evolves, we will see leaps in what can be done, but it highlights the difficulties in scaling.

This challenge stems from several factors. First, deep learning scaling—whether through data parallelism, model parallelism, or pipeline parallelism—is complex and highly dependent on the specific problem and model. This complexity can make it difficult for people to navigate.

Once you’re operating at scale, you often need to redo the hyperparameter tuning, which is expensive to optimize. While methodological changes could improve this, there’s also a need for better tooling and software. There is a lot of software available to retrain or fine-tune large language models, but beyond that, tools are sparse. Different use cases and models require different approaches, and that diversity adds to the challenge.

Another challenge is the ability to reuse expensive models. When a huge model consumes a significant amount of HPC system time for training, being able to reuse that model efficiently becomes critical. The gap in tools and technologies that allow for this reuse is another area that needs addressing.

SC24: What opportunities do you see for the advancement of deep learning at scale? Are there emerging technologies or methodologies that hold particular promise for overcoming current limitations?

Bhimji: Certainly, addressing some of the previously mentioned challenges is key to advancing the field. Building platforms that allow for fine-tuning, sharing foundation models, and experimentation is an important part of what’s needed. Another promising area is the development of resilient models through data-driven learning, where models are stress-tested with different datasets. Again, building platforms that enable people to retrain models with various datasets is crucial.

Another important aspect is benchmarking. We’ve been heavily involved in the HPC working group of the MLPerf benchmark suite, which is part of MLCommons. MLPerf has become an industry-standard benchmarking suite, and tools like that are very important for advancing the field.

SC24: Can you share some of the more extreme examples of how deep learning at scale is being applied to real-world problems across various domains? What impact are these applications having?

Bhimji: Most of the examples I’m familiar with are in the realm of science. One example is in weather and climate prediction. At NERSC, we collaborated with NVIDIA and others to develop the FourCastNet model, which was the first to achieve the skill of numerical weather prediction. This model has since been built upon by various other efforts, and now several groups have state-of-the-art models. This work is now being extended to conduct large-scale ensembles for climate applications, with people in my group actively collaborating on these projects.

In addition, we’ve seen significant advancements in simulation, particularly in particle physics. For example, deep learning is being used to model detectors at the Large Hadron Collider with incredible precision, offering orders of magnitude improvement over traditional methods. The advantage of these deep learning approaches is that once a model is trained—which can require extensive HPC resources—running the model to perform new simulations is exponentially faster, sometimes up to 10,000 times faster than traditional approaches. This speed opens up new possibilities, such as in the example of conducting large-scale ensembles for extreme weather forecasting and other applications.

LHC

Another important area is anomaly detection. At the Large Hadron Collider, for example, deep learning is being used to find new fundamental particles, potentially unveiling secrets of the universe such as dark matter. Traditional methods allow you to search in a model-dependent way, but deep learning and AI enable us to explore beyond known theories, potentially discovering what we might otherwise miss. This capability is opening new doors for science applications that were previously impossible.

SC24: What strategies and techniques are essential for optimizing the performance of deep learning models on HPC systems? Are there specific tools or approaches that you find particularly effective?

Bhimji: I think a multi-layered approach is essential. First, it’s important to optimize at the single device level, such as the single GPU. Using profilers to identify bottlenecks and track improvements is a critical first step. One common area where bottlenecks occur, especially for new users, is in data loading. This isn’t unique to AI and machine learning, but it’s prevalent in those areas.

Moving to more advanced techniques, just-in-time compilation can be used to fuse kernels and run everything on the GPU. For AI and deep learning, it’s crucial to leverage the compute power in GPUs, particularly as architectures evolve to favor lower precision operations, such as tensor cores. Using mixed precision to take advantage of these tensor cores is becoming increasingly important.

Once you’ve optimized at the single GPU level, the next step is scaling to large HPC machines. Here, it’s important to ensure the HPC system is well-configured for distributed learning, using libraries like NCCL (NVIDIA Collective Communication Library) for NVIDIA GPUs. Balancing efficiency on single devices with overall time to solution is key, as is optimizing parallelization strategies.

Data parallelism is often the most efficient approach, especially when models can fit on a single GPU. However, for larger models, more complex strategies involving model parallelism, along with data parallelism, may be necessary. Our tutorial offers advice on these strategies, and we’re also developing tools to help identify the optimal pattern for the system you’re running on. I think these tools will become increasingly important as the field progresses.

SC24: Looking ahead, what innovations and trends do you anticipate will shape the future of deep learning at scale? How do you see the field evolving over the next few years, and what new challenges might arise?

Bhimji: I think we’ll see AI becoming increasingly pervasive in science and HPC. Wherever AI can benefit an application, we should enable it. This means not only applying AI to scientific simulation and data analysis but also to the operation of HPC systems, experiment design, and automation.

A key trend will be the development of robust scientific models that can be transferred across different problems. These models need to be resilient and have the infrastructure for thorough testing and reliable transfer. To support this, we’ll need to build integrated platforms that combine HPC systems with the ability to handle other scientific simulations, data pipelines, and AI. Effective workflow management for these systems and everything running on them will also be crucial.

SC24: You are hosting a tutorial on Deep Learning at Scale at SC24 this year. Who should be attending this session, and what types of information do you expect them to come away with?

Bhimji: We’ve been running this session since SC18 with various partners over the years. This year, it’s a collaboration between NERSC, NVIDIA, and Oak Ridge National Laboratory. The session is relevant to anyone who wants to run deep learning at scale and learn the latest tips, tricks, and methodologies. There are various challenges people might face if they want to move beyond training a single model on a single device. We cover topics like that, but the primary focus is on scaling up to training at large scales.

Perlmutter

The session has evolved over the years, and now we incorporate real large-scale science use cases. For example, we’ll be running a model-parallel extreme weather forecasting case, which I mentioned earlier. Participants will have hands-on access to Perlmutter, the newest supercomputer at NERSC, and they’ll be able to run examples on dozens of GPUs. We’ll also cover single GPU profiling and optimization with our NVIDIA colleagues leading that part. And with Oak Ridge involved this year, we’ll have some AMD examples and scaling approaches on Frontier. This is a great opportunity to learn from people pushing the boundaries of deep learning and connect with peers who are experiencing the same.

Join Us in Atlanta

Collaboration and continuous learning are key to realizing supercomputing’s full potential. SC24 offers an opportunity to expand your knowledge and enrich your experiences within the HPC community.

Attendees engage with technical presentations, papers, workshops, tutorials, posters, and Birds of a Feather (BOF) sessions – all designed to showcase the latest innovations and practical applications in AI and HPC. The conference offers a unique platform where experts from leading manufacturers, research organizations, industry, and academia come together to share insights and advancements that are driving the future.

Join us for a week of innovation at SC24 in Atlanta, November 17-22, 2024, where you can discover the future of quantum, supercomputing, and more. Registration is open!

atlanta

Stay Up to Date

Sign up to receive the SC newsletter in your inbox.

Information provided is treated in accordance with ACM & IEEE privacy policies.

Back To Top Button