May 23, 2023

Large Language Models for Product Managers: 5 Things to Know

A Product Manager's guide to understanding Large Language Models and the building blocks of Conversational AI.

Product Management

LLMs

Marco Ramponi

Table of contents

[Visible on live site]

Get $50 in credits

The widespread use of ChatGPT has led to millions embracing Conversational AI tools in their daily routines. From copywriting to brainstorming, customer service, email streamlining, the potential applications span vastly different areas.

With these complex algorithms often labeled as "giant black boxes" in media, there's a growing need for accurate and easy-to-understand resources, especially for Product Managers wondering how to incorporate AI into their product roadmap.

ChatGPT is part of a group of AI systems called Large Language Models (LLMs), which excel in various cognitive tasks involving natural language. This emerging technology has recently experienced remarkable growth. ChatGPT alone reached 100 million unique users soon after its launch, marking the fastest adoption of any internet service in history.

ChatGPT reached an estimated 100 million users in just 2 months (**source**).

LLMs are transforming the AI commercial landscape at unprecedented speed. Industry leaders like Microsoft and Google recognize the importance of LLMs in driving innovation, automation, and enhancing user experiences.

Before deciding to build with LLMs, it's crucial to grasp how they work and develop a basic understanding of their capabilities. This article offers a succinct overview of the essential aspects of language models, from foundational concepts to the latest advancements, without compromising accuracy.

#1. Understanding Language Models

Language Models (LMs) are probabilistic models designed to identify and learn statistical patterns in natural language. Their main function is to calculate the probability of a word following a given input sentence.

A language model predicts the most probable word(s) to follow a phrase based on learned statistical patterns. For example, a Language Model may estimate a 91% probability that the word "blue" follows "The color of the sky is."

These models are trained using self-supervised learning, a technique that utilizes the data's inherent structure to generate labels for training.

During training, an LM is fed a large text dataset and tasked with predicting the next word in a sentence. This is typically done by truncating the last part of an input sentence and training the model to fill in the missing word(s).

As the model processes numerous examples, it learns linguistic patterns, rules, and relationships between words and concepts, creating internal representations of language and knowledge.

During training, text sequences are extracted from the corpus and truncated. The language model calculates probabilities of the missing words, which are adjusted and fed back to the model to match the ground truth. This process is repeated throughout the whole text corpus.

The result is a pre-trained language model with a foundation for understanding natural language and generating contextually relevant, coherent text. These pre-trained models are often referred to as foundation models (for language).

How does fine-tuning benefit foundation models?

Fine-tuning is a process that unlocks a foundation model's potential by specializing it for specific tasks or domains. It refines the model's general knowledge, acquired during pre-training, to suit specialized applications.

Fine-tuning typically involves training the foundation model on a smaller, task-specific labeled dataset using supervised learning. This builds on the linguistic foundation established during pre-training, enabling the model to perform practical tasks with greater accuracy.

For instance, in machine translation, a foundation model can be fine-tuned on a parallel corpus containing sentences in the source language and their translations in the target language. This teaches the model to map linguistic structures and patterns between languages, allowing it to translate text effectively.

A foundation model can be adapted for machine translation using a parallel dataset of sentences in both languages.

Fine-tuning is also used to adapt foundation models to specialized knowledge domains, such as medicine or law. This process enables the model to handle the unique vocabulary, syntax, and conventions specific to that domain.

A language model can be fine-tuned on medical documents for specialized tasks in the medical field.

#2. Neural Networks and Transformers

What determines a language model's effectiveness?

The performance of LMs in various tasks is significantly influenced by the size of their architectures, which are based on artificial neural networks. These networks, inspired by the human brain, consist of interconnected layers of nodes, or "neurons," that process and learn from data.

Neurons in the network are associated with a set of numbers, known as the neural network's parameters, representing the strength of connections between neurons. These parameters are adjustable and updated during training to minimize the difference between the model's predictions and target values.

A simple artificial neural network with three layers. Nodes (neurons) in each layer are shown as circles, and connections between nodes are lines. The parameters in the network are numerical values assigned to each connection, determining the strength of the signal passed between nodes.

In the context of language models, an increase in number of parameters translates to an increase in an LM’s storage capacity. However, it's important to note that LMs don't store information like standard computer storage devices (hard drives). A higher number of parameters enables the model to learn more diverse patterns through the numerical relationships of its parameters.

However, larger models also require more computational resources and training data to reach their full potential.

A neural network with 100 nodes and 1842 parameters (edges). The first layer represents a numerical encoding of the input. Intermediate layers process this information by applying linear and non-linear operations. The output layer generates a single number, which, when scaled appropriately, can be interpreted as a probability estimate.

Modern language models consist of various components or blocks, often created by different neural networks, each designed for specific tasks and featuring specialized architectures. Almost all current LMs are based on a highly successful architecture, the Transformer model, introduced in 2017.

Initially developed for machine translation tasks, Transformers have revolutionized numerous applied AI fields due to their ability to process large amounts of data simultaneously (parallelization) rather than sequentially. This feature enables training on much larger datasets than previous architectures.

Transformers excel at natural language contextual understanding, making them the go-to choice for most language tasks today. Two key components contribute to their success: Word Embeddings and Attention.

#3. Large Language Models

In recent years, LLM development has seen a significant increase in size, as measured by the number of parameters.

This trend started with models like the original GPT and ELMo, which had millions of parameters, and progressed to models like BERT and GPT-2, with hundreds of millions of parameters. The largest models, such as Megatron-Turing NLG and Google's PaLM, now exceed 500 billion parameters.

To put it differently, this means that in the span of the last 4 years only, the size of LLMs has repeatedly doubled every 3.5 months on average.

Language models increase in number of parameters over time –Note: value axis in log scale (**source**)

Training an LLM can be costly, with estimates ranging from 10 to 20 million US dollars for pre-training a model like PaLM using customer cloud services. This figure doesn't include the costs of engineering, research, and testing associated with these complex systems.

The size of a model is only one aspect to account for during training. The size of the training dataset is also crucial.

Determining the necessary data for training an LLM is challenging. Previous heuristics suggested that increasing model size improved performance, while scaling training datasets was less important. However, recent research shows that many current LLMs are undertrained with respect to their pre-training data.

The landmark *Chinchilla paper* by **DeepMind** revealed that most current language models are undertrained and established a new set of scaling laws for LLMs.

This shift has led to new heuristics, emphasizing the importance of training large models with more extensive datasets. To fully train the next massive LLM, an immense amount of data would be required, possibly corresponding to a significant fraction, if not all, of the text data available on the internet today.

The implications of this perspective are profound. The total amount of available training data might be the true bottleneck for LLMs. Moreover, even an ideal model that perfectly replicates the internet's knowledge is not the "ultimate LLM." Risks and safety concerns arise from the growing use of these models in people’s daily life, and are correlated with model scaling, a process that has led to many "unexpected" capabilities in large language models.

#4. Capabilities and Prompting

Scaling language models leads to unexpected results.

As expected, LLM performance improves across various quantitative metrics when scaled, such as perplexity, a measure of generated text fluency. However, scaling language models also involves training them on vast quantities of data, often sourced from the extensive text available online. As LLMs are exposed to a diverse range of linguistic patterns and structures, they learn to emulate and reproduce these patterns with high fidelity.

Interestingly, this process gives rise to unexpected qualitative behaviors. Studies have shown that as LLMs scale, they "unlock" new capabilities in a discontinuous manner, unlike the more predictable linear improvement of quantitative metrics.

New capabilities are unlocked as the number of parameters surpasses certain thresholds.

These emergent abilities cover various tasks, such as translation between languages, writing programming code, and more. Notably, LLMs acquire these skills through observation of recurring patterns in natural language during training, without explicit task-specific supervision.

The phenomenon of emergence is not limited to LLMs and has been observed in other scientific contexts. For a more general discussion, readers can refer to Emergent Abilities of Large Language Models.

Surprisingly, Emergent abilities are sometimes accessible through well-crafted prompts: An LLM can perform certain tasks simply by receiving the appropriate query in natural language. For example, it can generate a concise summary when prompted with a passage followed by a summarization request.

However, pre-trained LLMs may not always follow prompts accurately, possibly due to replicating patterns observed in training data. For instance, if asked about the capital of France, the model might respond with a question about Italy's capital, having picked up this pattern from online quizzes.

To overcome this, researchers developed Instruction Tuning, a strategy that trains LLMs on a small dataset of prompts or instructions followed by correct actions. Fine-tuning the model on these examples helps it better understand and follow natural language instructions.

The main advantage of Instruction Tuning is the LLM's generalization capability, enabling it to follow instructions for a variety of tasks beyond those seen in the small dataset. This has partly replaced the need for extensive fine-tuning of smaller, specialized models for certain tasks, as large, scaled models can effectively perform them after exposure to diverse data and simple instruction tuning.

LLMs can be prompted to perform tasks, which previously required fine-tuning a model through supervised learning.

#5. LLM Behavior and Risks

Although the emergence of new cognitive capabilities in LLMs is not yet fully understood, the general pattern that enables this is reasonably clear. Consider the example of question-answering: the vast amount of text on the internet contains numerous question-answer pairs, allowing LLMs to perform a kind of information retrieval and answer questions on various topics.

However, the internet also contains a significant amount of false information, making it difficult for developers to regulate the content LLMs are exposed to during training. As a result, LLMs may exhibit undesirable behavior, such as reproducing harmful or biased content, or generating hallucinations by fabricating false facts.

When LLMs are used as general-purpose conversational chatbots (like ChatGPT), identifying all potential threats from mass use becomes challenging, as it is nearly impossible to predict all possible scenarios beforehand.

The dangers of Large Language Models (LLMs) extend beyond incorrect answers or fabricated information. Risks vary based on the specific use case, and as general-purpose chatbots become increasingly popular, ensuring these models are not exploited for malicious purposes is crucial.

Some examples of potential harm include:

LLMs with coding abilities being used to create sophisticated malware easily.
Mass propaganda via coordinated networks of chatbots on social media platforms, aiming at distorting public discourse.
Privacy risks from LLMs inadvertently replicating personally identifiable information.
Psychological harm when users seek emotional support from chatbots, only to receive harmful responses.

AI safety concerns emphasize the importance of LLMs adhering to three general principles:

Helpfulness: Following instructions, performing tasks, and providing relevant information.
Truthfulness: Providing factual information and acknowledging uncertainties and limitations.
Harmlessness: Avoiding toxic, biased, or offensive responses and refusing to assist in dangerous activities.

An LLM is considered aligned if it meets these guidelines. Addressing these diverse problems may require different strategies. Reinforcement Learning from Human Feedback (RLHF) is one technique that has made significant strides in aligning LLMs with human values, potentially addressing all these issues simultaneously.

Read more about RLHF and how it is used to train ChatGPT in this article.

#Final Words

LLMs hold the promise to reshape various business aspects, from customer support automation to content creation, real-time sentiment analysis, and more.

To fully grasp the influence of LLMs in the enterprise context, it's essential to see how they fit within the bigger picture of Generative AI, a growing field covering a range of innovative techniques that are driving the AI revolution and changing how technology integrates with our daily lives. Learn more about it in our dedicated blog series on Generative AI.

Incorporating LLMs in existing products may be hard for a variety of reasons, from the very first step of choosing the right language model, to the subtleties of prompt engineering, and the difficulty of dealing with varying context windows. That's why we've built LeMUR, a framework that makes it easy to use and embed various LLMs when working with audio data.