MosaicML launches MPT-7B-8K, a 7B-parameter open-source LLM with 8k context length

MosaicML has unveiled MPT-7B-8K, an open-source large language model (LLM) with 7 billion parameters and an 8k context length.

According to the company, the model is trained on the MosaicML platform and underwent a pretraining process commencing from the MPT-7B checkpoint. The pretraining phase was conducted using Nvidia H100s, with an additional three days of training on 256 H100s, incorporating an impressive 500 billion tokens of data.

Previously, MosaicML had made waves in the AI community with its release of MPT-30B, an open-source and commercially licensed decoder-based LLM. The company claimed it to be more powerful than GPT-3-175B, with only 17% of GPT-3's parameters, equivalent to 30 billion.

MPT-30B surpassed GPT-3's performance across various tasks and proved more efficient to train than models of similar sizes. For instance, LLaMA-30B required approximately 1.44 times more FLOPs budget than MPT-30B, while Falcon-40B had a 1.27 times higher FLOPs budget than MPT-30B.

MosaicML claims that the new model MPT-7B-8K exhibits exceptional proficiency in document summarization and question-answering tasks compared to all previously released models.

The company said the model is specifically optimized for accelerated training and inference for quicker results. Moreover, it allows fine-tuning of domain-specific data within the MosaicML platform.

The company has also announced the availability of commercial-use licensing for MPT-7B-8k, highlighting its exceptional training on an extensive dataset comprising 1.5 trillion tokens, surpassing similar models like XGen, LLaMA, Pythia, OpenLLaMA and StableLM.

MosaicML claims that through the use of FlashAttention and FasterTransformer, the model excels in rapid training and inference while benefiting from the open-source training code available through the llm-foundry repository.

The company has released the model in three variations:

MPT-7B-8k-Base: This decoder-style transformer is pretrained based on MPT-7B and further optimized with an extended sequence length of 8k. It undergoes additional training with 500 billion tokens, resulting in a substantial corpus of 1.5 trillion tokens encompassing text and code.

MPT-7B-8k-Instruct: This model is designed for long-form instruction tasks, including summarization and question-answering. It is crafted by fine-tuning MPT-7B-8k using carefully curated datasets.

MPT-7B-8k-Chat: This variant functions as a chatbot-like model, focusing on dialogue generation. It is created by finetuning MPT-7B-8k with approximately 1.5 billion tokens of chat data.

Mosaic asserts that MPT-7B-8k models exhibit comparable or superior performance to other currently available open-source models with an 8k context length, as confirmed by the company's in-context learning evaluation harness.

The announcement coincides with Meta's unveiling of the LLaMA 2 model, now available on Microsoft Azure. Unlike LLaMA 1, LLaMA 2 offers various model sizes, boasting 7, 13 and 70 billion parameters.

Meta asserts that these pre-trained models were trained on a vast dataset, 40% larger than that of LLaMA 1, with an expanded context length of two trillion tokens, twice the size of LLaMA 1. LLaMA 2 outperforms its predecessor according to Meta's benchmarks.