A comprehensive introduction to Large Language Models, their capabilities, and limitations

What is a Large Language Model?

A ‘large language model’ (LLM) is a type of artificial intelligence (AI) that is ‘trained’ on a massive amount of text data. The process of the model ingesting massive amounts of data as it is trained through an iterative process allows it to learn patterns and relationships in language, which then enables the model to do some pretty amazing things.

LLMs are fed enormous amounts of text data, which is often scraped from the internet, books, code repositories, and other sources. This data is the ‘learning material’ for the model and can best be conceptualized as a library.

The model uses complex algorithms, usually based on neural networks, to analyze the training data and identify patterns and relationships between words, phrases, and sentences. As it is trained, the model learns grammar, syntax, and semantics (meaning). It learns which words tend to follow each other, how sentences are structured, and how language is used in different contexts.

For example, given the sentence "The quick brown fox jumps over the lazy _____" the model learns to predict the word "dog." This seemingly simple task, performed billions of times on massive datasets, and catalyzed by significant advances in computing power and more nuanced algorithms, enables the model to learn the intricacies of language.

The training process involves constantly adjusting the model's internal parameters (known as weights and biases) to improve its ability to predict the next word accurately. This is done using optimization techniques that aim to minimize a "loss function," which measures the difference between the model's predictions and the actual text in the training data. The model goes through the training data multiple times (epochs), refining its understanding of language with each pass. The process is iterative, with the model gradually improving its ability to generate text that resembles human-written language.

A successfully trained LLM can understand the nuances of language (grammar, syntax, semantics), which can then be used to generate creative text formats, respond to prompts and questions, and perform various language-based tasks such as translation, summarization, and text classification.

The ‘Large’ in Large Language Models

The size of the training data, and the cost to train an LLM, are both substantial and vary significantly depending on the model’s size and complexity. In terms of computational metrics, training data is typically measured in petabytes, where 1 petabyte = 1 million terabytes. To put this into context, 1 terabyte (“TB”) is roughly equivalent to the storage capacity of a high-end laptop; said another way, a 1 TB drive could store ~100-200 high-definition movies.

Training a large language model can cost millions of dollars due to the cost of computational resources, energy consumption, data storage and processing, along with personnel costs. When new versions of LLMs are released, they are often released with the number of parameters used to train the model, where a parameter can be thought of as representing patterns in the data, relationships between words, grammatical rules, etc.

AI pioneer OpenAI announced GPT-3, which was trained on 175 billion parameters, in June 2020, while GPT-4 (trained on as much as 1 trillion parameters) was released in March 2023. Current SOTA (state of the art) models are being trained on as much as 1.75 trillion parameters, although understandably this is difficult to verify.

How do LLMs work?

Once a LLM is trained, it operates by leveraging the patterns and relationships it learns from the massive dataset it was trained on. Typically, an LLM receives input in the form of a text (often called a ‘prompt’), which sets the context for what the model should generate.

The input text is then broken down into smaller units called ‘tokens’, which are words, parts of words (prefixes or suffixes), or even individual characters. Tokenizing input text is comparable to breaking down a sentence into individual words for more nuanced understanding.

From there, each token is then converted into a numerical representation known as a vector embedding. Vectors capture the meaning and context of the token, based on the relationships learned during training, with similar words or tokens having similar vector embeddings. It’s the process of translating the semantic meaning of the input into numbers through vector embeddings that allows the model to understand the semantic meaning of text.

From there, the encoded input needs to be processed through a transformer decoder, which generates a probability distribution over the entire vocabulary of the model. The distribution represents the likelihood of each word being the next word in the sequence, and can be conceptualized as the model brainstorming possible next words in a text sequence and assigning probabilities to each.

The model selects the next token (or word/character) based on the probability distribution, and this process is repeated iteratively and very quickly so that the highest probability response based on the input text is generated. The generated sequence of tokens is then converted back into human-readable text, which is the final output that the user sees.

Who are the key players in LLM development?

Several key players are driving the development and advancement of Large Language Models:

OpenAI: Among the first to achieve significant breakthroughs in LLMs, OpenAI is regarded as having a first mover advantage. GPT models are widely regarded for their strong performance in text generation, conversation, and various other tasks. The company also offers a well-developed API, enabling easy integration into various applications.
Google AI: Known for its LaMDA, PaLM, BERT, and MUM models, Google is recognized for its massive resources, including immense computational power and vast amounts of data, which are crucial for training large models. The tech titan also has deep integration with existing Google products, as well as a focus on multimodal models that combine text, images, and other modalities.
Microsoft: A fellow tech titan with a close partnership with OpenAI and a strong enterprise focus. Microsoft integrates LLMs into its suite of products, including Azure AI services, enhancing cloud-based AI capabilities for businesses.
Meta AI (Facebook): Meta has released several open-source LLMs and has a strong interest in developing conversational AI agents and chatbots.
Anthropic: Formed by former OpenAI members, Anthropic focuses on AI safety and developing "constitutional AI" models like Claude.
Cohere: Focused on providing LLM-powered solutions to enterprises.
AI21 Labs: Developed the Jurassic-1 model, contributing to the diversity of available LLMs.
Stability AI: Focused on open-source and multimodal models, including Stable Diffusion (text-to-image generation).
Nvidia: Plays a critical role by providing the high-end semiconductors essential for training LLMs.

Future LLM use cases

Future LLMs are likely to exhibit improved abilities in:

Logical reasoning
Problem-solving
Common-sense reasoning

Models will likely become better at understanding and maintaining context over longer conversations and more complex interactions. LLMs could be personalized to individual users, learning their preferences and adapting their responses accordingly, which could lead to more engaging and helpful AI assistants and chatbots. Future models will likely seamlessly integrate multiple modalities, including text, images, audio, and video, enabling richer and more natural interactions with AI systems.

The high cost of training and running LLMs is a major challenge. Developing more efficient training algorithms and model architectures that require less computational resources is essential. Techniques like model compression, pruning, and knowledge distillation can reduce model size and complexity without significant performance loss. Utilizing specialized hardware optimized for deep learning workloads can significantly speed up training and inference, reducing overall costs. Carefully curating and cleaning training data can reduce the amount of data required for training, leading to cost savings, including techniques like data augmentation.

Applications in Investment Management

Advancements in LLMs present significant potential to revolutionize investment management and equity research. Consider that there are an estimated 10 to 15 million filings made by companies to the US Securities and Exchange Commission per year, representing tens of terabytes annually. Certainly, no human, or even an investment firm, has enough bandwidth to analyze all this data in a timely manner, but an LLM might be able to.

Data Processing: LLMs can process vast amounts of financial data, including company filings (10-Ks, 10-Qs), earnings call transcripts, news articles, and social media sentiment, to identify trends, assess risks, and generate insights much faster than human analysts.
Market Sentiment Analysis: By analyzing text data, LLMs can gauge market sentiment towards specific companies or industries, informing investment decisions and helping predict market movements.
Automated Reporting: LLMs can automate the generation of investment reports, summarizing key findings and providing actionable insights, thereby increasing efficiency and accuracy.

Conclusion

Large Language Models represent a monumental advancement in artificial intelligence, offering unparalleled capabilities in understanding and generating human language. While challenges such as high training costs and the need for efficient algorithms remain, the potential applications of LLMs across various industries, including investment management, are vast and transformative. As technology continues to evolve, LLMs will undoubtedly play a pivotal role in shaping the future of AI-driven solutions.