A Way to Reduce the Cost of Training Artificial Intelligence Models That Cost Hundreds of Billions

3월 12, 2025

Recently, large-scale artificial intelligence (AI) models such as ChatGPT and DeepSeek have been utilized in various fields and are attracting attention.

Large-scale language models are trained in large-scale distributed systems equipped with tens of thousands of GPUs for data centers. In the case of GPT-4, the cost of training the model is estimated to be approximately 140 billion won.

A domestic research team has developed a technology that helps derive the optimal parallelization configuration that can increase GPU utilization and reduce training costs.

Changes in MT-NLG training time and GPU utilization according to various parallelization techniques. [Photo = KAIST]

The Korea Advanced Institute of Science and Technology (KAIST, President Lee Kwang-hyung) announced on the 13th that the research team of Professor Yoo Min-soo of the Department of Electrical and Electronic Engineering, in collaboration with Samsung Electronics' Samsung Advanced Institute of Technology, has developed a simulation framework (vTrain) that can predict and optimize the training time of large-scale language models (LLMs) in large-scale distributed systems.

Finding the optimal distributed training strategy is essential to increasing the training efficiency of large-scale language models. Not only are there a huge number of possible strategies, but it takes a lot of time and money to test the performance of each strategy in a real environment.

Currently, companies that train large-scale language models are only using a small number of empirically verified strategies. This leads to inefficiency in GPU utilization and unnecessary cost increases. The lack of simulation technology for large-scale systems is preventing companies from effectively solving the problem.

The KAIST research team developed vTrain to accurately predict the training time of large-scale language models and to quickly explore various distributed parallelization strategies.

The research team compared the actual training time values of various large-scale language models in an actual multi-GPU environment with the predicted values of vTrain and verified that the training time can be predicted with an accuracy of 8.37% in a single node (MAPE) and 14.73% in multiple nodes.

The research team conducted joint research with Samsung Electronics' Samsung Advanced Institute of Technology and released the vTrain framework and over 1,500 actual training time measurement data as open source so that AI researchers and companies can freely utilize them.

Professor Yoo Min-soo said, "vTrain is a profiling-based simulation technique that explores a learning strategy that can increase GPU utilization and reduce learning costs compared to existing empirical methods, and we have released the open source." He added, "This will allow companies to efficiently reduce the cost of learning ultra-large AI models."

The results of this research (paper title: vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training), in which Ph.D. candidate Bang Je-hyun participated as the first author, were presented in November at the IEEE/ACM International Conference on Microarchitecture (MICRO), one of the top academic conferences in the field of computer architecture.

https://www.inews24.com/view/blogger/1822748

INEWS24

A Way to Reduce the Cost of Training Artificial Intelligence Models That Cost Hundreds of Billions

댓글

댓글 쓰기

이 블로그의 인기 게시물

Livestock Manure Methane Is Soaring, But 'Resource Recovery' Isn't Working [Now is a Climate Crisis]

Large and small launch vehicle companies unite [Now in space]

When a star swallows a planet... [Now in space]