From 55bc2e60c4e37beeae7628816bd8574868bdb16e Mon Sep 17 00:00:00 2001 From: pouya samie Date: Mon, 18 Mar 2024 13:03:01 +0330 Subject: [PATCH] Update readme and add break downs for each class --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7c3db3a..70fc3da 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,7 @@ The `LM_PARTITION_RULES` list contains the following rules: - #### `(("language_model", "positional_embeddings"), P(None, ("data", "model")))`: - This rule matches the positional embeddings tensor in the language model module. The PartitionSpec P(None, ("data", "model")) specifies that this tensor should be partitioned along the "data" and "model" dimensions, but not partitioned along the leading dimension (represented by None). This means that the positional embeddings will be split across multiple devices along the "data" and "model" dimensions, but replicated along the leading dimension (e.g., batch dimension). + This rule matches the positional embeddings tensor in the language model module. The PartitionSpec `P(None, ("data", "model"))` specifies that this tensor should be partitioned along the "data" and "model" dimensions, but not partitioned along the leading dimension (represented by None). This means that the positional embeddings will be split across multiple devices along the "data" and "model" dimensions, but replicated along the leading dimension (e.g., batch dimension). - #### `(("language_model", "in_out_embed", "embeddings"), P(None, ("data", "model")))`: This rule matches the embeddings tensor of the InOutEmbed module (used for input and output embeddings) in the language model. Similar to the previous rule, it specifies that this tensor should be partitioned along the "data" and "model" dimensions, while being replicated along the leading dimension. @@ -282,7 +282,7 @@ The `MoELayer` class is a module that implements the Mixture of Experts (MoE) la - It selects the top `num_selected_experts` experts for each input token based on the routing probabilities. -` It creates a broadcasted version of the input tensor, duplicating it num_selected_experts times for each token position. +- It creates a broadcasted version of the input tensor, duplicating it num_selected_experts times for each token position. - It initializes the expert networks (specified by `layer_fn`) by creating a batched version of the initialization function.