A Primer on AI Technologies

Introduction

 

Over the last five years as machine learning and AI models have become almost omnipresent in our daily lives, they have also become more inscrutable to the general public. Understanding the core concepts of how AI algorithms function and why they work does not require an extensive computer science or mathematics background. Regrettably, though, there are scant few resources available to the interested layperson who wants to understand the nuts and bolts of AI, even if they have no plans to write code or build models themselves. This article seeks to remedy that gap.

 

In this post, we will walk through the building blocks of AI, explain how and why they work, and where they sometimes go wrong. We’ll then demonstrate how these components are combined and expanded to create the extremely large models that are now ascendant. Many people are familiar with the basic statistical tools that AI algorithms employ, such as regression or randomization. While a background in these concepts is helpful, we assume no particular statistical knowledge on the part of the reader. We will begin by explaining the simplest statistical models, and then explain the process of machine learning, where algorithms learn underlying patterns from training data. We will then discuss the primary areas of development of these algorithms over the last few decades, from the development of better methods for turning information into training data to what neural networks are and why they now work so well in domains such as images and language.

 

Throughout this post, we will highlight not only why these techniques and models function but also what their limitations are. In light of those limitations, we will then review the where and how models can go wrong. Fundamentally, we need to have a broad public conversation about where, when, and how to deploy AI and machine learning tools. Some of this conversation will play out in newspapers, much of it will take place in living rooms, and a small (but important) part will occur in courtrooms. This conversation must be one about values. What costs are we willing to bear to gain the advantages these new tools offer? Who should pay those costs? Who reaps these advantages? These are age-old questions that law and political philosophy have useful frameworks for contemplating. That is why it is vital that an understanding of how this technology functions is held widely – because answering these questions will be the work of many years, many disciplines, and many people.

 

 

Finding underlying patterns in data

 

Consider the following problem: you are a fisherman and need to plan your first day at sea of the fishing season. You need to know how long you’ll be out so you can carry enough fuel (but not significantly more than you need – it’s dead weight). How long you’ll be out for is determined by how many fish you’re trying to catch – the longer you stay out, the more fish. But your license is limited -  you’re only allowed to 1000lbs of fish per day. How long of a voyage should you plan for? Your historical catches and voyage times probably have helpful information, but you’ve never caught exactly 1000 pounds of fish. Consider the chart of historical fishing data in Figure 1, which shows the hours at sea and the pounds of fish caught. A simple way to determine how many hours we should plan to be at sea is to draw a simple line connecting the points of our historical data. When we do this, we see that if we are at sea for 6 hours and 40 minutes, we could expect to catch 1,000 pounds of fish. This process is known as interpolation and is an important component of how AI systems are able to generate answers to questions that aren’t a perfect match for historical data.

 

But what if we think the relationship between fish caught and hours isn’t linear? What if we believe that fish stocks are better away from shore up to some limit so that the relationship between time and fish caught is non-linear? Then, our graph might look like Figure 2, where we draw an upward-curving line rather than a straight one. This doesn’t yield very different answers to the question of how long it would take to catch 1000 lbs of fish, but we were trying to find the answer for 3000 lbs, this could be a meaningful difference. Of course, to know to ‘connect the dots’ differently, we needed to know something about the underlying reality we were attempting to model, or we need much more data from which to learn the pattern. This is why, particularly when there is less training data available then we would like,  data scientists building new models often work with domain experts with particular expertise in solving that exact problem. That expertise influences how the model is built.

 But what if our data about our historical catches is a little messier than what’s shown in Figure 1? Perhaps the scale weight of the catch was known to be somewhat inexact. That would mean that simply ‘connecting the dots’ wouldn’t lead to great results, because our data is ‘noisy.’ This is a common problem because many aspect of our world is noisy. Measurements are inexact, and physical systems are entropic. If we think there is an underlying linear relationship between the hours at sea and lbs of fish caught, we may want to find a line that is as close as possible to our data points while still being straight. When we do this, we are assuming that these two types of data are related and in fact, that one is dependent on the other. Specifically, we think that ‘hours at sea’ is a data feature that has information that is useful to the model we want to build as an input. ‘Pounds of fish caught’ is a model output. This kind of model is called a linear regression because the relationship between the input (time in hours) and output (lbs of fish) is linear. We will find this ‘closest line’ by first defining a way of measuring the distance from our data points, called a cost function(also sometimes called a loss function), and then minimizing the value of this cost. While this function is trivial in this example (linear distance), when functions grow more complex over many dimensions of data, this function becomes more important. The closest line to our data points can be represented by a function: F = 300 * H. We have learned this function from the historical training data.

 

Linear Regressions are one of the simplest types of models, and you will hear about many others, including another broad category of models called Classifiers, which seek to categorize inputs rather than output a value or a probability. Still, fundamentally, all these models are attempting to do the same thing: identify an underlying pattern in training data that can be used to make useful future predictions.

 

Overcoming Noise through Ample and Representative Data

 

In our prior example, the underlying pattern we were trying to learn was a simple one. This made the presence of noise in our data easy to deal with. Also, the noise was quite small compared to the strength of the relationship between our input data(hours) and output data (lbs of fish.) In real-world examples, relationships and patterns are more complex, outcomes are dependent on multiple inputs, and the data we might learn these patterns and relationships from are much messier.

 

The more complex the underlying pattern, the easier it becomes for noisy data to obscure that pattern. The answer to this problem has been to increase the amount of training data models use to learn patterns. As long as the ‘noise’ of data is randomly distributed, as long as we have enough training data, the noise will eventually cancel itself out, and the true, underlying pattern will be learnable. And because most real-world patterns we want to learn are complex, the most important factor that has improved the quality of machine learning models over the past several decades is the growth of training data, methods to gather more training data, and techniques to make training data we have more learnable.

 

Now, we have just hit the first snag that all machine learning-driven systems must face: how do you know your noise is random? The answer often is: that it isn’t. This is where bias creeps into AI systems, and it can happen in two ways. First, data collection itself might be biased. Perhaps we collect data in slightly different ways from different groups, or our tools for taking measurements are miscalibrated. This is collection bias. Second, an underlying pattern might manifest differently in other portions of a population, so we will not fully observe the pattern if we do not sample a representative portion of the population. This is typically referred to as sampling bias.

 

To illustrate how these different types of bias can occur, imagine that instead of a single fishing boat trying to make predictions based on its own historical data, we are a harbor master trying to make predictions for all the boats in a port. To help us do that, we have several new types of data, such as types of boats and sizes of boats and crews. However, not all of those boats may measure data the same way, and we might not be collecting data from them in the same way. Perhaps some of the boats that use the port are pleasure craft. They still go out to sea to fish, but they aren’t commercial fishermen. Some of them still report data to the harbor master, but they aren’t required to. The non-commercial fishermen are competitive though – if they think they had a particularly good catch, they are much more likely to report it, hoping to beat the record for best haul by a non-commercial fishing boat. But this means we aren’t getting a representative sample of the catches from the pleasure craft – only the best catches get registered.  Also, among the commercial fisherman, some have on-board scales that accurately measure their catch’s weight, while others estimate weight from the volume of fish in the hold.  Suppose we aren’t aware of these disparities in data collection. In that case, we won’t be very good at predicting results for non-commercial fishers in particular, and we may give systematically bad predictions to everyone if we don’t find a way to identify historical data that was estimated by volume instead of directly weighed.

 

We note that these types of bias are separate from another phenomenon that is often described as algorithmic bias. Sometimes, an algorithm is trained to perform a task that is inherently biased, or is trained to perform a task in the same biased way humans currently perform it. For example, some jurisdictions have attempted to use machine learning algorithms to predict recidivism. Because there are statistically significant racial differences in rates of arrests and incarceration, race is a factor with significant predictive power for recidivism. Even if we do not allow race to be included as a feature, other data features that are strongly correlated with race (such as home address) will betray this predictive power. Even if it were possible to build a perfectly accurate algorithm for recidivism(which it is not), such a perfect algorithm would say that of two otherwise similar defendants, the Black one would have a higher likelihood of reoffending because this is an accurate reflection of our criminal justice system.

 

Learning: Measuring the Distance to Reality

In the previous example, we needed a cost function (also sometimes called a loss function) to measure the distance between the line we were learning and the historical examples. We found the line that minimized the value of that cost function, a process called optimization or learning. It’s important to note that optimizing was very easy in that case, as the number of dimensions of our data was small enough we could simply see it. However, as the number of our data features grow beyond a few, optimization becomes a challenging process. In case it’s not apparent, we can’t simply solve an equation to minimize our cost function. It would also take far too long to simply try all the possible lines that could be drawn. Instead, we need to successively try different lines, checking each time to see if the value of our cost function got smaller or larger. As long as it’s getting smaller, we’ll keep adjusting it in the direction we’ve been going. But if it starts to get bigger, we’ll adjust it in the other direction. In this case, we’re simply trying to fit our line as well as possible, but there may be other external factors we need to prioritize that might lead us to change this function. Perhaps we wanted to err on the side of a shorter journey for some reason. If that were the case, then we would make this cost function tend toward a steeper line. In a very real way, cost functions represent what you value and prioritize when training an algorithm.

 

So if we can’t simply visualize our data, how do we start? How do we draw the first line, and how do we try new lines after that? How long do we go on for, trying new lines, perhaps trying more and more complicated underlying patterns? The answers to these questions are typically referred to as the learning parameters or hyperparameters. One particularly important decision we need to make is how much to adjust our line as we try to minimize our cost function. Do we assume our existing line is mostly right and only change it a little bit at a time? As we try more lines, if we only make small changes, it may take us a long time to find the ‘correct’ line if we’re a long way off, but if we make big changes in response to outliers, we may swing wildly from one extreme to another, making it take even longer to find the ‘right’ line. This rate at which we are willing to change what we think is the underlying pattern or relationship is referred to as the learning rate. The right one depends on several factors, like what cost function you are using, how noisy your data is, and how complicated your underlying function is.

 

How to Guess Right? Guess More (Randomly)!

 

It is probably intuitively obvious that if you cannot directly solve a problem but instead must optimize it by repeatedly guessing, it helps to take many guesses. For many reasons, trying incremental improvements to a solution to the optimization problem often ‘cap out’ surprisingly far from the best solution. Something that helps solve this problem is introducing elements of randomness to different aspects of how we make guesses. This can be done in many ways, but a common one is by trying many different starting guesses. This type of randomization has multiple benefits. In addition to giving us more chances to start ‘near’ the best solution, it allows us to parallelize our search for the best solution. This is because when we are trying to find a line that minimizes our cost function by making small changes and seeing if our cost goes up or down, we need to do that sequentially. We can’t move on to the next line we might try until we finish evaluating the current one because the value of the current function will change what we try next. But if we randomly try many starting values, we can have many processors try to improve from each of those starting values simultaneously.

 

In practice, unless we have strong domain knowledge about a problem, using randomness in our guessing often improves results. Many algorithms have words like Monte Carlo or stochastic in their names. These terms signal that the algorithm contains a random element. As computer scientists have experimented with more strategies and types of models, incorporating elements of randomness has become more and more common.

 

 

Understanding Neural Networks

Thus far, we haven’t discussed any particular families of machine learning models and instead have focused on broad properties that apply to all types of learning algorithms. However, neural networks have come into very broad use and have some particular strengths and weaknesses that mean we need to discuss them separately. In the first section of this paper, we discussed a simple function that took hours at sea as an input and predicted pounds of fish caught: one input and one output. In reality, many problems we want to solve have dozens, hundreds, thousands or even more input features. Some models deal with this complexity by having more complex functions that can take many inputs. But there is another way of managing that complexity: rather than have one function that captures all the inputs, we could have many functions that each capture the relative importance of each of those individual inputs. If we could then connect all those functions together, we could still solve the problem well and capture the overall relationships between all the different inputs and the eventual output. This is, essentially, what neural networks do.

 

This may seem like a distinction without a difference, but this difference in organization allows us to do some very useful things. To explain why this is the case, first we must discuss how these networks of functions are structured. In neural networks, each ‘node’ in the network is a function, like the ones we have discussed. Computer scientists have experimented with different types of structures, but today most neural networks are ‘deep’ and consist of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer.

 

1.     Input Layer: Nodes in the input layer take in all the input features directly. These nodes, like all others in the network, will be initiated with random multipliers, or weights, that will be adjusted through the training process.

 

2.     Hidden Layers: Nodes in hidden layers take data from all of the nodes in the input layer, or the previous hidden layer if there are more than one, effectively combining them. Like nodes in the input layer, these nodes will be initiated with random weights.

 

3.     Output Layer: The output layer delivers the final results, and has the number of nodes that corresponds to the number of outputs.

 

In all learning models, the model ‘learns’ by trying new things with its inputs and seeing if it does better or worse. Using our example from the first section, we might try a different multiplier for the input, hours. This multiplier is called a feature weight. In neural networks, not just input nodes ‘learn’ weights – all the hidden layers do as well. This means that we don’t just learn the importance of individual features, we learn the importance of combinations of features. In practice, we’ve observed that earlier, ‘higher’ layers in a network learn simpler, cruder features, and deeper layers learn more complicated features. This multifaceted approach is particularly helpful in situations where we don’t have domain expertise. Unfortunately, in practice, neural networks need much, much more training data than other types of networks to perform well.

 

Even non-technical readers are likely familiar with Moore’s Law: the number of transistors on a microchip doubles every two years. This has meant that the cost of computation has fallen at roughly the same rate. This rate of improvement has held true since at least 1970, but we are still no closer to making time run faster than one second per second. This means that parallelization is the most effective means we have of being able to ‘learn’ faster. Not everything can be parallelized, but when we find ways of making the algorithms we use parallelizable, those methods can be transformational. This isn’t universally true, but neural networks, because of their distributed structure, often parallelize very well. This is another reason why as computing resources have grown and become cheaper, deep learning models have come into greater use.

 

 

It All Comes Down to the Numbers - Vectorization & Embeddings

You may have noticed that until now, we’ve been talking about things that can be measured in numbers. A natural question would be: how do these concepts – interpolation, regression, learning, networks of nodes – apply to things that are not numbers? The answer is, unfortunately, that they don’t. AI & machine learning systems that work with written language, sound, or images only work because we have found ways of representing those things with numbers. Some of these representations are simple. We have long known how to measure the frequency of a sound wave, which has a numerical representation. We have also used the concepts of pixels and color values as numerical representations of images for a long time. This is a major component of why the earliest successful AI systems worked with things that were easily represented with numbers, such as money, and the next generation of commercially successful applications were image recognition systems that relied on long-standing systems for turning image data into numbers.

 

But what about language? The effective strategies we now use for representing a document numerically are surprisingly new. Much of the advancement in functioning ML and AI systems for text has come from advances in strategies for word vectorization, also known as word embeddings, in just the last decade, rather than from new algorithms or hardware. Early approaches to word embeddings simply tried to capture which words were in a document and how often they occurred.[1] Clearly, this isn’t enough information to understand a text! Word order matters quite a bit, and earlier sentences give context to later ones. Many complicated strategies have been tried to encode the relationships between words in a document, but for the most part, those complicated strategies have largely led to dead ends that were expensive to run and did not improve performance much.

 

The current strategy for developing a numerical representation of a written document is a two-step process. First, we create relatively simple numerical representations of words and their position in a document by assigning every word a token ID number and converting every document to a long list of token IDs. Then we feed many, many documents (called a corpus) represented in this way into a very simple neural network and let it ‘learn’ the relationships between all the different words in the document. Each of these relationships is represented by a number in a vector, so each word ends up with a vector with numbers that encode the word’s meaning in the context of the entire corpus the model was trained on. If this is done well, similar words will have similar numbers in their vectors. If we think of this vector as describing a location in a high-dimension space, then very similar words will be near each other in this space. Word embeddings learned in this way typically have hundreds of dimensions. As it turns out, this is a very effective way to learn which words are like others, because similar words often wind up being used the same way in sentences, and different dimensions end up capturing different aspects of language.

 

 

Large (Language) Models

 

Large Language Models(LLMs) such as GPT-4 mark a significant breakthrough in natural language processing and understanding. But how have we been able to train such large models? This family of models, to which GPT-4, LaMDA, and LLaMA all belong, are defined by some structural similarities that allow them to represent the relationships between all the words in a document while simultaneously allowing the models to parallelize training. First, LLMs use word embeddings similar to the ones we have already described. Next, when training data is inputted, it is done not just as a series of tokens, but also with the position of each token paired to the token id. This simple step allows for the work of learning to be parallelized because positionality has been made explicit in the position number rather than implicit in the order of tokens. Unlike previous neural networks, LLMs simplify what is in each node, allowing for a higher quantity of nodes rather than more complexity in each node. This again allows these models to parallelize very well.

 

 At its core, the fundamental task of GPT-4 revolves around predicting the next word or token in a given sequence of words. It's akin to completing a sentence or continuing a conversation based on the preceding words. The primary challenge in this task lies in predicting the next token accurately, given the prior set of tokens. It involves understanding the context, syntax, semantics, and nuances of human language to make the most probable prediction. To do so, the neural networks employ methods like linear and logistic regression. Linear regression works to identify patterns and relationships between words, enabling the model to predict what comes next with precision. Logistic regression, on the other hand, guides the model in understanding the likelihood of a word fitting the context accurately. Both methods are used to make sense of the data, progressively building a coherent response.

 

 

 

A Few Closing Words About Scale

Throughout this article, we have used examples with only a few data points and only one or two features to ease understanding of the concepts and tools used to create AI systems. The difference between these examples and real-world applications is only one of scale, but the scale at which most modern AI algorithms operate can be hard to think about. GPT-4, for example, has over a trillion parameters, and while OpenAI hasn’t discussed how many computational cycles were used to train it, the company has said it cost $100 million to train. GPT-4’s proficiency is deeply intertwined with the abundance of data that it has been trained on: essentially, the entire internet. The move toward ever-larger models has been driven by the rise of neural networks. These networked algorithms only begin to outperform other, earlier ML algorithms when there is a truly enormous amount of training data available. In turn, as we have been able to build larger training data sets, the growth of the largest models has continued apace.

 

There are two major limitations on the scale of models. The first is clock time: time as humans experience it. We have discussed the process of learning, and how it is fundamentally an iterative process of making guesses and then improving them. This iterative process can take billions of cycles to begin to perform well, and that can take much longer than humans are willing to wait. This is why much of the advancement in models has come from strategies that allow us to break up work in a way that as much possible can be done simultaneously in parallel, thus reducing the amount of clock time required to learn over the same number of cycles. Broadly speaking, cutting-edge engineering is the art of doing the impossible by transforming the problem until it can be broken up into small chunks that are, on their own, possible. This is essentially what is being done when we transform an iterative algorithm that must be done in sequence to one that can be parallelized. While it transforms an impossible problem into a possible one, that possible problem is extremely expensive to solve. Unlike in many other fields, when computer scientists talk about throwing money at a problem, it is generally perceived as a good thing, and the best option if you can find it. That is because the alternative is throwing time at a problem, or not being able to solve a problem at all.

 

The second major limitation to the scale of AI systems is the availability of training data. As we have discussed, neural networks only begin to perform really well when they have a lot of training data. This is why many early machine learning successes were on problems where some natural training set of data was available. Once the low-hanging fruits of natural training sets were exhausted, the race began to build new corpora of human-labeled training data for all sorts of problems. The need for this human-labeled data has driven an entire industry of data labelers in middle-income countries as well as the Photo Captchas that are familiar to most internet users. This labeled data has been absolutely transformational in AI systems’ ability to understand visual data particularly. The other approach to generating more training data is to find ways to adjust the problems we ask AI systems to ‘learn.’ As we discussed, GPT-4 and other LLMs work by trying to predict the next word given a previous set of words. This approach transforms the entire library of human-generated text from unlabeled data to high-quality training data.

 

As scale grows, it gets more and more difficult to understand why models generate the outputs that they do. Models also sometimes generate unexpected outputs; this is sometimes referred to as emergent behavior.  When this leads to good performance on tasks models weren’t initially trained for, it is typically celebrated as transferability. When it leads to poor performance, it is often derided as hallucination. But fundamentally, both of these manifestations are unintended behavior, and the difference is one of outcome, not of process.


[1] We use the term ‘document’ to refer to any cohesive piece of text we might be working with – from a social media post to a long paper.