Shiwei Liu

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

arXiv 2025

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

arXiv 2025

GradientStabilizer:Fix the Norm, Not the Gradient

arXiv 2025

SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

arXiv 2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

arXiv 2025

Diffusion Language Models Know the Answer Before Decoding

arXiv 2025

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

arXiv 2025

Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

arXiv 2025

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

arXiv 2025

GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

arXiv 2025

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

arXiv 2024

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

arXiv 2024

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

arXiv 2024

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

arXiv 2024

Composable Interventions for Language Models

arXiv 2024

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

arXiv 2024

AdaMerging: Adaptive Model Merging for Multi-Task Learning

arXiv 2023

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

arXiv 2023

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

arXiv 2023

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

the-emergence-of-essential-sparsity-in-large