Transformer-based Large Language Models (LLMs) have become central to a wide range of applications, delivering state-of-the-art performances across diverse domains [26, 27, 29, 34, 44, 51, 64, 80, 84, 86, 90]. To improve various task performances, LLMs are reaching unprecedented scales, with models such as LLaMA 3.1 (405B) [34], DeepSeek-V3 (671B) [27], and Kimi-K2 (1T) [78] pushing the ...