Go to file

Zach Nussbaum 4806f68091 feat: loss free wip		2025-02-08 18:28:29 +00:00
csrc	Fix bug in topology kernel for ffn_hidden_size>4096.	2023-12-08 15:33:25 +00:00
exp	Update Megatron-LM scripts and integration for latest Docker container. (#55 )	2023-12-11 13:59:46 -05:00
media	Cropping chart.	2023-01-27 12:01:07 -05:00
megablocks	feat: loss free wip	2025-02-08 18:28:29 +00:00
third_party	Update Megatron-LM scripts and integration for latest Docker container. (#55 )	2023-12-11 13:59:46 -05:00
yamls	Clean up yamls and matmul benchmarks.	2023-09-23 18:24:25 +00:00
.gitignore	refactor: ignore unused	2025-01-31 17:13:44 +00:00
.gitmodules	Adding MoE and dMoE layers, unit tests, and custom operations.	2023-01-25 17:28:09 -08:00
Dockerfile	Update dependencies and package organization. (#52 )	2023-12-11 12:39:15 -05:00
LICENSE	Adding license.	2023-01-26 13:02:00 -08:00
MANIFEST.in	Make MegaBlocks installable via pip	2023-02-15 21:09:01 -08:00
Makefile	Make MegaBlocks installable via pip	2023-02-15 21:09:01 -08:00
README.md	Update README.md (#98 )	2024-03-26 14:41:29 -07:00
docker.sh	Docker launch script path updates.	2023-07-06 19:50:29 +00:00
requirements.txt	Update dependencies and package organization. (#52 )	2023-12-11 12:39:15 -05:00
setup.py	fix: stk version	2025-01-15 17:31:07 +00:00

README.md

🤖 MegaBlocks

MegaBlocks is a light-weight library for mixture-of-experts (MoE) training. The core of the system is efficient "dropless-MoE" (dMoE, paper) and standard MoE layers.

MegaBlocks is integrated with Megatron-LM, where we support data, expert and pipeline parallel training of MoEs. Stay tuned for tighter integration with Databricks libraries and tools!

🚀 Performance

MegaBlocks dMoEs outperform MoEs trained with Tutel by up to 40% compared to Tutel's best performing capacity_factor configuration. MegaBlocks dMoEs use a reformulation of MoEs in terms of block-sparse operations, which allows us to avoid token dropping without sacrificing hardware efficiency. In addition to being faster, MegaBlocks simplifies MoE training by removing the capacity_factor hyperparameter altogether. Compared to dense Transformers trained with Megatron-LM, MegaBlocks dMoEs can accelerate training by as much as 2.4x. Check out our paper for more details!

🏗️ Installation

NOTE: This assumes you have numpy and torch installed.

Training models with Megatron-LM: We recommend using NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container. The Dockerfile builds on this image with additional dependencies. To build the image, run docker build . -t megablocks-dev and then bash docker.sh to launch the container. Once inside the container, install MegaBlocks with pip install .. See Usage for instructions on training MoEs with MegaBlocks + Megatron-LM.

Using MegaBlocks in other packages: To install the MegaBlocks package for use in other frameworks, run pip install megablocks. For example, Mixtral-8x7B can be run with vLLM + MegaBlocks with this installation method.

Extras: MegaBlocks has optional dependencies that enable additional features.

Installing megablocks[gg] enables dMoE computation with grouped GEMM. This feature is enabled by setting the mlp_impl argument to grouped. This is currently our recommended path for Hopper-generation GPUs.

MegaBlocks can be installed with all dependencies via the megablocks[all] package.

🚂 Usage

We provide scripts for pre-training Transformer MoE and dMoE language models under the top-level directory. The quickest way to get started is to use one of the experiment launch scripts. These scripts require a dataset in Megatron-LM's format, which can be created by following their instructions.

✍️ Citation

@article{megablocks,
  title={{MegaBlocks: Efficient Sparse Training with Mixture-of-Experts}},
  author={Trevor Gale and Deepak Narayanan and Cliff Young and Matei Zaharia},
  journal={Proceedings of Machine Learning and Systems},
  volume={5},
  year={2023}
}