Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer

2022-06-19 01:17:51 By : Mr. Bill lu

InfoQ Live July 19: How to Successfully Build and Deploy Applications on the Cloud? Register Now

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Avdi Grimm describes the future of development, which is already here. Get a tour of a devcontainer, and contrast it with a deployment container.

Fran Mendez discusses event-driven or asynchronous APIs, comparing AsyncAPI with OpenAPI/Swagger, AMQP/MQTT/Kafka with HTTP, and publish/subscribe with request/response.

Roland Meertens shows how one can get started deploying models without requiring any data, discussing foundational models, and examples of them, such as GPT-3 and OpenAI CLIP.

This article describes how NAV (Norwegian Labor and Welfare Administration), Norway's largest bureaucracy, has achieved alignment in over 100 autonomous teams. It shows the techniques it uses to align teams with respect to technology: two descriptive techniques - the technology radar and the weekly deep dive, and two normative techniques - the technical direction and internal platforms.

Devcontainers provide a reproducable, reusable, simplified developer experience. Get a tour of a devcontainer including how they work, how to use them most efficiently, and how they differ to deployment containers.

Learn how cloud architectures achieve cost savings, improve reliability & deliver value. Register Now.

Learn how to migrate an application to serverless and what are the common mistakes to avoid. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

InfoQ Homepage News Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer

Researchers at Google Brain have open-sourced the Switch Transformer, a natural-language processing (NLP) AI model. The model scales up to 1.6T parameters and improves training time up to 7x compared to the T5 NLP model, with comparable accuracy.

The team described the model in a paper published on arXiv. The Switch Transformer uses a mixture-of-experts (MoE) paradigm to combine several Transformer attention blocks. Because only a subset of the model is used to process a given input, the number of model parameters can be increased while holding computational cost steady. Compared to Google's state-of-the-art T5 NLP model, baseline versions of the Switch Transformer can achieve target pre-training perplexity metrics in 1/7 the training time. The 1.6T-parameter version outperforms a T5-XXL on the perplexity metric, with comparable or better performance on downstream NLP tasks, despite training on half the data.

The Transformer architecture has become the primary deep-learning model used for NLP research. Recent efforts have focused on increasing the size of these models, measured in number of parameters, with results that can exceed human performance. A team from OpenAI, creators of the GPT-3 model, found that NLP performance does indeed scale with number of parameters, following a power-law relationship. In developing the Switch Transformer, the Google Brain team sought to maximize parameter count while keeping constant the number of FLOPS per training example and training on "relatively small amounts of data."

To achieve this, the model uses a mixture of experts (MoE) scheme. MoE was developed in 1991 by a research team that included deep-learning pioneer and Switch Transformer co-creator Geoff Hinton, then at University of Toronto and now at Google Brain. In 2017, Hinton and Google Brain colleagues used MoE to create an NLP model based on a recurrent neural network (RNN) of 137B parameters which achieved state-of-the-art results on language modeling and machine translation benchmarks.

The Switch Transformer uses a modified MoE algorithm called Switch Routing: instead of activating multiple experts and combining their output, Switch Routing chooses a single expert to handle a given input. This simplifies the routing computation, and reduces communication costs since individual expert models are hosted on different GPU devices. One drawback to the scheme, however, is an increased chance of training instability, especially when using reduced-precision arithmetic, due to the "hard" switching decisions. The team mitigated this by reducing the scale factor for initializing the model parameters.

The team used Mesh-TensorFlow (MTF) to train the model, taking advantage of data- and model-parallelism. To investigate the performance of the architecture at different scales, the team trained models of different sizes, from 223M parameters up to 1.6T parameters, finding that the "most efficient dimension for scaling" was the number of experts. Model performance on pre-training and downstream NLP tasks was compared to T5 models requiring similar FLOPs per sample. Baseline-sized Switch Transformer models outperformed T5 on GLUE, SuperGLUE, and SQuAD benchmarks, while achieving a 7x speedup on pre-training time. The large-scale Switch Transformer, with 1.6T parameters and 2048 experts, outperformed a 13B-parameter T5 model in pre-training perplexity, while finishing in 1/4 the time.

In a discussion on Reddit, commenters pointed out that the Google Brain team did not compare their model's performance to GPT-3, speculating this was due to lack of information in OpenAI's published result. Another commenter noted:

[T]he time to accuracy gains are remarkable, albeit coming at a cost for hardware requirements. All these are non-issues for Google, but I can see why OpenAI isn't too keen on these models, at least, so far.

Although Google has not released the pre-trained model weights for the Switch Transformer, the implementation code is available on GitHub.  

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

Uncover emerging trends and practices from domain experts. Attend in-person at QCon San Francisco (October 24-28, 2022).

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy