People Friendly: Teachers’ Guide to Artificial Intelligence – How AI Works

Headshots of Linda McIver and Laura Summers

How AI Works, and what it can and can’t do

To better understand the cluster of technologies broadly named AI, it helps to know about their history and context. We’ll talk a bit about the development of AI and Machine Learning to lay the groundwork.

Before we get there, it’s also extremely important to recognise that these three statements about AI can be simultaneously true:

Large Language Models and Generative AI represent a major achievement in statistical modelling, and a huge leap forward in our overall technological capabilities
Both corporate marketing and tech news coverage of these models hype and oversell their abilities
Building LLMs and Generative AI has obvious ethical and ecological challenges

If you can hold all of these ideas in your head concurrently, you’ll have a better sense of why you might be feeling some cognitive dissonance about this moment in time.

For example, it would be wrong to think there’s nothing to see here, that there’s no real change happening. The combination of huge amounts of compute being thrown at them, huge datasets and new techniques for training and tweaking the outputs has birthed a new generation of technologies. If you think back to the first moment you tried playing around with an image generator like Stable Diffusion, or first asked ChatGPT a somewhat tricky question, you probably had a ‘wow’ moment, and that’s a reflection of this progress.

The thing is, life’s complicated. It’s totally rational to feel optimistic about these technologies, and also worried about their risks, or pessimistic overall but enthused about specific applications.

It’s also important to remember that a chatbot induces emotional reactions that are not necessarily related to how advanced or “intelligent” the chatbot is. Indeed, the creator of Eliza, arguably the very first chatbot, an incredibly simple system that recognised keywords and responded with predefined responses, was horrified when he realised how seriously people were taking his program.

“I had not realized … that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.” Joseph Weizenbaum

https://www.dataversity.net/a-brief-history-of-large-language-models/

It’s interesting, and perhaps telling, that when you look for writing about the recent development of LLMs, they talk about size of training data and number of parameters more than they talk about actual changes in function or ability of the models. It’s actually quite difficult to write a detailed history of LLMs, because, while there are plenty of wild claims and marketing hype, there is no standard for benchmarking LLMs. Most companies do their evaluation in private, with no transparency, which means that we have no systematic way of comparing models from the user perspective. We don’t know how the systems have evolved, because it’s happening in secret. The rigorous academic benchmarking that has taken place, such as Big-Bench (Beyond the Imitation Game Benchmarking) has found that, while the models often do improve from one generation to the next, their overall performance is poor.

https://arxiv.org/pdf/2206.04615

The fundamental difference between AI systems and more “conventional” programming, is the same reason they are so dangerous – AIs are functionally non deterministic. A deterministic system, given the same state and the same inputs, will give you the same response every time. A non-deterministic system will not. This means that they are unpredictable, and also that it is very difficult to interrogate the factors that contributed to any given response. Sometimes AIs that seem to have learned to do a particular task really well have actually learned a different task, and are not doing what they seem to be doing.

A classic example of this is the Image Recognition system that had a very high success rate at detecting photos of horses. (Very high success rates in AI are typically red flags that the AI is not doing what it seems to be doing.) Further examination revealed that the system was actually successfully detecting the watermark of a photographer who produces a lot of photos of horses, which was overrepresented in the training data, as well as the testing data.

https://www.researchgate.net/publication/331397043_Unmasking_Clever_Hans_Predictors_and_Assessing_What_Machines_Really_Learn

A similar example was AI systems trained to detect pneumonia in X Rays, which were much more accurate on images from the same hospital that produced their training data. It turned out that the systems were successfully identifying the hospitals where the image was taken, and predicting pneumonia based on the prevalence of pneumonia patients at those hospitals.

https://www.mountsinai.org/about/newsroom/2018/artificial-intelligence-may-fall-short-when-analyzing-data-across-multiple-health-systems

Only rigorous, systematic evaluation can avoid this kind of issue, and rigorous, systematic evaluation is not attractive to tech companies who have a product to sell.

Activity – How can we measure AI?

As a class, discuss what AI is good for, and how you might measure it. Come up with some measurable, repeatable tests – benchmarks – and see if you can apply them to different chatbots or image generators, particularly comparing one model to the next – for example, ChatGPT 3 with ChatGPT 4. How do they compare? Is it easy, or even possible, to measure progress from one system to the next?

This is also one of the reasons why Explainable AI is so important. Explainable AI is where the system shows the factors that led to a particular decision. For example, in the horse images described above, explainable AI highlighted the important parts of the image (in that case, the watermarks) that led to the “detection” of a horse. Explainable AI is particularly crucial in cases where AI has human impact, such as human resources systems that choose candidates, and health systems that detect disease.

Of course, most AI systems do not have explainability built in, which is where rigorous, systematic, critical evaluation comes into play. So how do we evaluate AI?

What does High Quality Evaluation Look Like?

Evaluation is always tricky. When you have a known, standardised test in education, the temptation to “teach to the test” or teach students to pass that particular test, rather than to learn the concepts thoroughly, always exists. It’s the same with AI. As soon as you produce a set of tasks you can use to evaluate AI, it’s simple for AI companies to optimise their systems for those particular tasks. It’s like an arms race, with academics developing new tests, or benchmarks, and AI companies optimising for those benchmarks.

Rather than proposing tasks for testing large language models, we’re going to look at the properties of LLMs that can be measured (and the claims that can’t).

AI companies make some grandiose claims. How would we test them? Can we test them? What conditions would help them be more testable?

Anthropic says its LLM, Claude, is “safe, accurate, and secure.”

OpenAI says ChatGPT can help you “Get answers. Find inspiration. Be more productive.”

Meta says its AI is “an intelligent assistant that is capable of complex reasoning, following instructions, visualizing ideas, and solving nuanced problems.”

Google says its LLM, Gemini, is “designed to understand and generate text, images, audio, video, and code, with the goal of providing a more natural and versatile AI experience.” Oh, and that it has “sophisticated reasoning capabilities.”

Activity – Can you measure claims about AI?

Explore the advertising copy for a range of different LLMs. You might use the ones above, or find some different ones. Take the claims from their ads, their websites, or articles they published. List all of the things the ads say the LLMs can do, or features they claim they have.

Which of these features or behaviours can be measured? How would you measure them? How many of them are even able to be defined, in the context of LLMs? Why is advertising material written this way? What would you measure, if you wanted to evaluate the performance of a chatbot?

So. Which of these claims is measurable? Let’s start with Claude. “Safe, accurate, and secure.” Can you measure “safe“? Out in the world, you could measure safety by comparing the number of trips taken, or perhaps kilometres walked, with how many accidents happened, or crimes were committed against those pedestrians. But what does safe mean with respect to a chatbot? We don’t actually know. Without a useful definition, we can’t measure it. Which, of course, makes it a great claim to use in advertising, because it’s quite hard to accuse Anthropic of lying!

A truly safe AI company might choose not to work with industries with clear potential for harm, such as law enforcement, or healthcare transcription, or with authoritarian regimes.

Ok, what about “accurate”? This one we can actually measure, by critically evaluating Claude’s answers. Of course, just like on an exam, it’s easier to evaluate the accuracy of simple facts, rather than complex responses. Questions like “What year did JFK die?” “What was the average daily maximum temperature in Melbourne in 2021?” or “How many legs does a puffin have?” all have measurably correct/incorrect answers. Unlike tasks such as “write me a limerick about a toad and a kitten,” or “what is the best recipe for lasagne?” which have multiple possible answers and no real objective basis for correctness.

All of which means that benchmarks will tend to measure the “easy to get right” types of responses, rather than the complex reasoning. But it’s probably still useful. It just doesn’t tell the whole story.

Now, what about secure? Once again we come up against a definition problem. What do they mean by secure? We could say that they mean they will keep anything you enter into the system safe, and not let anyone else see it. Or perhaps just anyone outside Anthropic. Do you mean the security of my conversation with Claude? Or just the data I put into the system? Anthropic has made more efforts to be secure than most, but their privacy policy is still long and complex, and it’s difficult for the average user to understand the full implications of the fine print. Which makes it another useful, “hard to quantify or test” marketing term.

If you continue that process through the rest of the examples listed above, you can see that none of it is easily defined or measurable. The problem with defining whether they are “Intelligent,” we discussed extensively in the previous chapter. As for “inspiring“, “capable of complex reasoning“, or “solving nuanced problems“, we have similar issues. Define “inspiring“? Who is inspired, how inspired are they, how do we measure it? It’s more of a vibe kind of word. Vague but exciting. Perfect marketing copy.

Now to “capable of complex reasoning“: Define “complex“. Define “reasoning“. Reasoning can be defined as “the action of thinking about something in a logical, sensible way.” Which really just increases the problem – now we have to define logical and sensible. But also – does a chatbot think? If it’s really just laying down statistically probable word combinations, then it’s not thinking. It’s taking your prompt and mashing up things from its training data that match it, and spitting out the most statistically probably combinations.

Ok, so it’s tricky to evaluate the marketing copy. Arguably that’s the point of marketing copy, so how would we design our own objective evaluation? What attributes can we measure?

As discussed above, accuracy is something we can measure, at least in a fairly simple form. If I ask a chatbot who won the 1989 AFL Grand Final, we can test whether it got that right or wrong. If I ask it which is the best AFLW team this year, that’s a more complex question, with no single right answer. Is it the team that’s scored the most goals? Won the most games? Displayed the best sportsmanship? Now let’s try which is better, AFLW or AFL? Even worse!

So what can we actually evaluate?

As discussed above, accuracy is one thing that can readily be evaluated. Out of a given number of queries, how many results were correct? When the model did not have an answer, did it give an “I don’t know” response, or did it simply give the wrong answer?

If you have a standard set of questions, though, companies marketing AI systems can simply optimise their systems for those questions. It becomes an arms race where benchmarks are created, and companies figure out how to game the benchmark to get the best result. Therefore evaluation questions need to keep changing, so that they remain fair and valid tests of the overall accuracy of the system.

Problem solving capability is more difficult to evaluate, but not impossible. For the same reasons as testing accuracy, you can’t keep testing systems on the same problems, so it’s important to keep coming up with new problems to use for evaluation.

Maths problems are easy to mark, so they’re often used for evaluation purposes. They usually have a single, clear, right answer. Of course, they also typically have answers available on the internet – most likely found in the training data of the system, which means that correct answers might be the result of pattern matching rather than problem solving.

More complicated problems are harder to pattern match, and harder to solve, but also often harder to mark. “What’s the best solution to a given real world problem?” is often not an easy question to answer, and sometimes has multiple different solutions, depending on your definition of “best”. Is “best” most efficient? Cheapest? Fastest? Most popular? This isn’t to say that complex problem solving can’t be tested. Just that it’s not as easy as you might expect.

There’s an excellent guide to evaluation, written by Hamel Husain, on his blog. One of his key points is that you sometimes need to look at the behaviour of your system in order to come up with metrics. Look at what it’s doing, and when it’s doing it, and figure out what the problems might be. Sometimes it’s as simple as figuring out how often the LLM is producing the output you need. Evaluation doesn’t need to be high tech, or even automated. But it does need to be thoughtful.

Activity

Devise some problems to evaluate problem solving ability. Get your classmates to solve them, and rank their answers. Now get LLMs to solve them, and rank their answers. How clear do you think the outcomes are? Have you and your classmates ranked them differently? Why?

History of the Field of AI

So how did we get here? Where did the field of AI start? And what has it achieved?

A lot of histories will tell you that the field of AI started in 1959, when Marvin Minsky and John McCarthy co-founded the MIT AI lab. Or when McCarthy coined the term Artificial Intelligence in 1956. Or when Alan Turing published his paper in 1950 asking whether machines can think – the same paper in which he proposed the Imitation Game.

In truth, ever since we have had machines, we have wondered how to make them think – perhaps even wanted to believe that they think. In 1770 a machine was built called the Mechanical Turk, or the Automaton Chess Player, which was marketed as a machine capable of playing chess against the most skilled players. The object of intense fascination and excitement, the Mechanical Turk was eventually exposed, on its creator’s death, as a box with a skilled chess player inside, rather than an intelligent machine. There’s probably a lesson in that for critical evaluation of today’s generative chatbots.

https://en.wikipedia.org/wiki/Mechanical_Turk

When Computer Scientists first set out to build an Artificially Intelligent system, they assumed it would be about a summer’s work. Trivial. How complex could a brain be, after all?

https://xkcd.com/1425/

It quickly turned out to be rather more difficult than that, but ever since there’s been an expectation that Artificial General Intelligence, or human-like intelligence was “just around the corner”. Bear that in mind the next time an AI entrepreneur tells you that the company is on the cusp of developing real AGI.

The early days of Artificial Intelligence research focused on three key areas – knowledge representation, natural language processing, and trying to mimic physical brain architecture.

We’ll take a look at each of these areas, before we discuss how they led to LLMs.

Activity – Where are the women in AI?

Research the role of women in the development of AI. What have their contributions been? How have they been recognised? What differences are there in the way the achievements of women in AI are discussed versus the achievements of men? What percentage of “History of AI” articles mention women at all?

Knowledge Representation

The field of knowledge representation is largely about how to systematically remember and connect things. For example, if you hear someone mention cats, your brain tends to throw you a lot of facts connected to cats, from pictures of cats you have known, to the idea of pets, balls of string, bells on collars, their impact on wildlife, etc. To build intelligent machines, the theory went, we needed to figure out how they could store and retrieve that kind of information. Not just to recognise a picture of a cat, but to know that cats are furry, have claws, eat fish, chase birds, and so on.

Knowledge representation is actually really difficult, as one idea may have connections to hundreds, or even thousands of different concepts, and computer systems have a limited amount of storage. So do you store the idea of cats in many different places – eg under pets, under wild animals, under feral animals, under pedigree competitions, under predators, and so on. Or do you store the idea of cats in one place and link to all of the other places? The links themselves also take up space, so it doesn’t necessarily save you much in the way of storage space.

It’s not just storage, either. Following up all the related ideas for a particular concept takes a lot of compute power. It’s not easy to figure out ways to follow the links to search for information, connect related ideas, and retrieve the information in a reasonably short time. Brains are remarkably efficient at this process, both in time and in the energy required to power them. Computers have a long way to go, even now, to match that efficiency. In the early days of the field of Artificial Intelligence, we simply didn’t have the computing power, the energy, the space, or the speed to make it all work as well as a brain. Even now we still don’t have systems at the same level of complexity of a human brain, which has an estimated 86 billion neurons with 100 TRILLION connections between them. For comparison, the world’s largest supercomputer as of November 2024 has just over 11 million CPU cores. While it’s simplistic to compare CPU cores to Neurons, it does give some sense of relative scale.

In short, knowledge representation is hard. Harder than we think it is.

Activity – How could you represent your knowledge?

Pick a subject – it might be AI related, or it might be something simple, like cats – and try to list all the facts you know about them. Make categories for the facts, and group the facts into the appropriate categories. Now try to create a diagram that represents how all of these facts are connected. Something like a mind map, or you might want to create an entirely new form of diagram! How do you represent facts that fit into several different categories? How do you connect related facts, or related categories? Follow the links in your diagram and see what connections you can find.

Natural Language Processing

At first glance, writing a computer program to analyse language seems straightforward. Language is easy – even small children learn it, and surprisingly quickly! It’s just a set of definitions together with a set of rules. How hard can it be?

But, computationally speaking, it turns out that language is also startlingly difficult. Human language is frequently ambiguous, complex, and confusing. From the subtle but very significant difference between “Let’s eat, Grandma!” And “Let’s eat Grandma!” to the weirdness of idioms, it’s bizarrely difficult to put a set of rules together with a dictionary and use them to understand what people say, or even what they write. Language also changes over time, so that the word “terrific” which used to mean “terrifying” has shifted to mean “excellent”. Or “gay” which used to mean “merry” and now means “homosexual”. Or queer, which used to mean “weird” and sometimes still does, but can also mean LGBTQIA+.

There are few rules you can rely on, particularly in English, in part because of its habit of, as James Nicoll wrote, not so much borrowing from other languages, but luring them into dark alleys, knocking them down, and going through their pockets for loose vocabulary.

Inflammable is not the opposite of flammable.

The plural of cow is cows, but the plural of sheep is sheep.

Butterflies are not flies attracted to butter.

Cool means excellent but cold is not better. Hot means very attractive but warm doesn’t mean a little bit attractive.

The word “set” can mean fixed, solid (as in jelly), to bounce a ball up in a particular way (in volleyball), a grouping of objects, the background for a scene in a play, to give students a task or test, or any of over 400 different meanings.

”To change your plans” means much the same as “To alter your plans”, yet “To change your pants” is very different to “To alter your pants.”

To hamper someone is bad, but to give them a hamper is good.

And that’s not even touching on regional differences. In Canada a jumper is a dress or tunic, in Australia it’s a warm woolly top, so “take your jumper off if you get hot” can get you into a lot of trouble in the wrong context.

Not to mention the dangers of confusing the drawers of a desk with the drawers that you wear. Or “to cleave” which means either “to split apart”, or “to adhere or cling”.

https://wordfoolery.wordpress.com/2022/02/14/the-two-meanings-and-histories-of-cleave/

One of the early prompts that got ChatGPT into hot water was this one: “The paralegal married the attorney because she was pregnant.” Who was pregnant? It’s impossible to determine from that sentence, but ChatGPT twisted itself into knots trying to insist that it had to be the paralegal who was pregnant, because attorneys were obviously men.

https://adsei.org/2023/05/16/the-bias-we-swim-in/

Activity – Spot the Bias!

Pick a source of text – maybe an online newspaper, or a reddit community – or a collection of images (you can use a google search for images of “professor” “lawyer” “doctor” “nurse” “teacher” etc) . Count how many gendered words or images are used, separating out male, female, and non binary references. Is there bias in these texts?

Try the same with a chatbot. Ask it to write stories about people, and count the gendered words. Make a table of the representations over repeated chatbot queries. How many times, when you asked it to write a story about a doctor, was the doctor a man? A woman? Non-binary? What about a story about a strong person, or a President, or a powerful person? What other categories can you think of that could show bias? Collect some data, and analyse your results. What conclusions can you draw about bias on the internet, and bias in chatbots?

Now throw in idioms like “flat out like a lizard drinking” (an Australian expression meaning “very busy”) and you can start to see how writing a program to analyse language can be somewhat fraught.

Add in the difficulty of different accents and dialects, and complexity goes up dramatically. Consider, even in a single accent, the difference between the word INvalid (with the stress on the IN, meaning a sick person) and inVALid (with the stress on the VAL, meaning not valid). Completely different words, only differentiated by pronunciation.

The field of Natural Language Processing has to deal with all of that, and try to build computational systems – sets of rules, really, but with endless exceptions – that can make sense of it all. And that’s just the written word. It has taken decades, but the results have been all around us for some time now.

Siri, Alexa, and Google’s Voice recognition systems are all results of NLP. Google translate can take that email you received in German and translate it into English – albeit with occasional slips along the way. Typically the English version isn’t exactly English as you’d write it yourself, but it’s usually enough to get the general gist of the email. I wouldn’t want to use it to translate legal documents though – who knows what might happen!

Activity – write your own Text Adventure

Write a text adventure program. Python is a great programming language for this, but you can use whatever language you are comfortable with. Include several different rooms, items that your player can collect into their personal inventory, and other characters your player encounters. This is a kind of constrained chatbot, where you know where the user is, and can control what they encounter.

For extension, try to expand your text adventure game to respond to different types of input from your users. You might like to include different conditions such as weather, or emotions. Have a class discussion about how much code is needed to handle simple, constrained situations like being able to travel North, East, South, or West.

For even more extension, write a chatbot!

The first time Linda used Siri, she spent the first few days happily playing with it, and enunciating very carefully. Then her in-laws came to dinner, and with her young children around the table, she thought she’d give a demo. “Remind me at 7pm to show Grandma!” she said. Siri happily responded with “Remind me at 7pm to shag a man.” Which she then had to explain to her kids. (Should have asked Siri to do it, but who knows WHAT would have happened!) She still sometimes uses Siri to make phone calls, and it is very hit or miss whether it correctly identifies her intended callee or not. And it’s not just Siri that has trouble with Linda’s voice commands. The Google Home automation at a friend’s place recently decided that her request to “turn the AC on” was actually a request to write a poem, which did not help Linda cool down at all!

The field of Knowledge Representation is used in NLP as a way to try to build systematic structures that connect related words in all of the different ways they can be connected. The practical upshot of all of this is that natural language processing is complicated, context dependent, ambiguous, and error prone. The field has made great strides, and has improved systems like Siri/Alex and Google translate remarkably – a German colleague of Linda’s 15 years ago used to call Google translate an excellent one way encryption device, because it would garble the meaning so badly. We’ve come a long way since then. But it’s not a solved problem, and it’s actually so difficult that it might never be fully solved.

NLP has plenty of practical applications though, that don’t require it to be perfect. Speech to text systems built using processes developed in the field of Natural Language Processing are used for transcription services and auto-captioning. Text based NLP is used for email filtering and predictive text, and for making it possible for you to search using tools like Google in English, rather than having to write your search request in code. Even though they remain error prone, they are still incredibly useful, especially for folks who don’t have the mobility or dexterity to use keyboards.

One of the most interesting, and perhaps also the most difficult, uses of NLP is sentiment analysis. The idea that you can scan a lot of text – say, all of the posts on a particular topic on Bluesky or Threads, or all of the text based media coverage of an issue – and figure out whether it is positive or negative, or, more likely, how much is positive and how much is negative, is very attractive to people in fields like media and politics, but also social media managers for popular brands or bands.

The theory is that you can take a post and figure out from the words used whether it says good or bad things about the topic you’re interested in. Maybe even scan the free text fields in your company’s feedback surveys and figure out which ones you could use for social media promotion, and which ones contain an issue you might need to fix.

Sometimes it’s really obvious. The sentences “Best concert experience of my life!” versus “Worst concert experience of my life!” are super easy to compare.

Often, though, idioms and changing language can make it really hard. This sentence has at least 3 very different potential meanings: “She thought the comedian was sick.”

It could mean she thought the comedian was ill, or in poor taste, or very funny. Without context, and possibly knowing the age of the “she” in question, it’s impossible to know.

In short, Natural Language Processing is hard. Much harder than we think.

Neural Networks – Mimicking Physical Brain Architecture

In the early 1960s, scientists’ understanding of brain structure took a huge leap, with the discovery that neurons communicate with electrical and chemical signals. At this point it was clear that there were a huge number of neurons in the brain, with connections between them, and this powered our ability to remember, reason, and think.

This inspired computer scientists to try and build artificial models of the brain, what they called Neural Networks. As we saw in the section on knowledge representation, it takes an immense amount of computational power to replicate anything even close to a human brain, so early attempts at neural networks were not wildly successful. They did teach us a lot, though.

As we learned that the connections between neurons in the brain can have different strengths, and can even vary in strength over time, computer scientists and psychologists built neural networks that assigned “weights” to different connections. Think of a small part of a neural network as one neuron, D, whose output is dependent on three incoming neurons, A, B, and C, like this:

The purpose of D is to “fire” – emit a positive response – when it recognises a dog.

A, B, and C, all fire when they see parts of a dog. A might fire when it sees something four legged, B fires when it sees fur, and C fires when it hears a bark. The network assigns a different weight – or level of importance – to each incoming neuron. So A might be super important, B less so, and C not very important because it’s possible the dog isn’t barking. It’s like an equation, where instead of D being equal to the simple sum of A+B+C, it’s equal to a weighted sum, maybe 3*A + 2*B + 1*C. This is vastly over simplified, of course, but it gives a sense of how a neural network functions.

Neural networks can be quite effective at solving small, well defined problems, such as detecting certain types of tumour in images from CT scans.

Making a Neural network large enough, and complex enough, to achieve Artificial General Intelligence, or human like intelligence, though, is another story. Given that human beings have 86 Billion neurons, heavily interconnected, it would take a truly immense amount of computational power, and storage space, to even come close to approximating that.

Then, of course, there’s training. Human beings “train” their whole lives. Constantly learning new things and figuring out how to cope with new situations. Somehow we need to compress the initial training for a neural network from the 18 years it takes us to reach adulthood into something much shorter, if it’s going to be at all useful. We also need to figure out ways to feed the neural networks all of the information they need for their training. You can’t just pour the information in, either. We need to devise ways of making the interconnections – and the pruning of connections – complex and flexible enough to mimic a real brain.

Taking a simple neural network, trained for one task, and extending and improving it so that it is intelligent has been considered 10-20 years away for at least 50 years now. So next time you hear someone in the AI industry say AGI is 5-10 years away, bear that in mind.

Activity – Simulate a neural network

As a group, brainstorm how you would recognise a dog. What features make a dog recognisably a dog? Once you have a list that you think is complete, designate one class member to be the neuron responsible for “recognising” each feature. Now take a bunch of photos of dogs, and other animals, and include some trick images like the “Canine or cuisine” photos in this NPR article. How many of your “neurons” fire for each image? How many neurons should have to fire before you’re confident the image is a dog?

Introducing LLMs

In 2017, a new type of Neural Network called a Transformer was developed. It is a new design that encodes words together with their meanings, remembers the words that came before, and then predicts the most plausible next word. This is the basis of today’s Large Language Models (LLMs), that are used to make chatbots such as ChatGPT, Gemini, DeepSeek, and most of the other AI systems being hyped at the moment.

It’s not important to understand the transformer architecture. What is important is to keep in mind that this is another case of neural networks being trained to do something quite constrained and specific – initially, translating text from one language to another – which incidentally meant they were quite good at producing statistically plausible text, or images. They are not intelligent. They don’t understand the content they are fed, nor the content they produce. Nor are they search engines, returning facts out of a database of stored information.

They are statistical models and excellent pattern matching machines that use the vast amounts of training data they have been fed to predict the most likely next word, next sentence, next paragraph, or image, using your initial prompt as a starting point.

Remember that AI is really great at pattern recognition? Your prompts are used to search in the training data for patterns that match. That’s how Google’s AI managed to take the question “Can there be bees in my computer?” and return this summary of a hivesystems.com April Fools’ Day post from 2021. “Yes, there are microscopic bees in most desktop computers built after the mid-2000s. These bees are a specially bred strain of Apis Arithmeticam that perform basic calculations in the CPU.” LLMs have no way of separating April Fools’ posts from genuine press releases. Nor do they try.

https://www.hivesystems.com/blog/this-major-vulnerability-could-fill-your-computer-with-bees

They look for patterns that match. Not patterns that are accurate, or true. Not even patterns that make sense. While these systems, and their image counterparts (which do much the same thing, but with pictures) are often called Generative AI, they don’t generate new material. What they do is regurgitate combinations of existing material. They should really be called Regurgitative AI, but that’s pretty hard to say!

We do not currently have AI technology that will solve novel, complex problems, and there’s no evidence that we’re anywhere close to it. Despite what the AI industry says, there is no mechanism for AIs to solve the climate crisis. The issues of climate change are social more than technical, and LLMs are being designed, built, and used, under the same structures of neo liberalism and capitalism that gave us climate change. Indeed, they are contributing to it with their profligate use of energy and water.

We cannot tackle climate change without restructuring the systems that are preventing us from meaningful action – primarily the power wielded by money, and hence by folks like Rupert Murdoch, Gina Rinehart, and companies such as Shell, BP, and Woodside. They’ve done decades of work to make us think that it’s an individual problem, and that if we recycle and ride to work enough it will fix it, but no one who understands climate science and takes it seriously believes that. There’s no mechanism for Artificial Intelligence, even Artificial General Intelligence, to enact meaningful sociopolitical change against the will of the billionaires and companies in charge.

As Audre Lorde famously said, “The master’s tools will not dismantle the master’s house.” AI is not the solution to climate change.

Even Meta’s head of AI, Yann LeCun, recently admitted that LLMs are not intelligent.
“LLMs are good at manipulating language, but not at thinking,” LeCun said. “So that’s what we’re working on — having systems build mental models of the world. If the plan that we’re working on succeeds, with the timetable that we hope, within three to five years we’ll have systems that are a completely different paradigm. They may have some level of common sense. They may be able to learn how the world works from observing the world and maybe interacting with it.”

Let’s be clear, though. There is zero evidence of this new breakthrough being imminent. The phrase “If the plan that we’re working on succeeds” is doing an awful lot of work here. AGI has been “ten years away” for at least 60 years.

https://www.wheresyoured.at/longcon/

The fact that LLMs are good at manipulating language is really important, because it explains why they can put text together, and sound as though they are intelligent. Based on the way they work, though, there is no way that LLMs can reason, or analyse, or create anything new. They can put together old things in a different way (regurgitative AI), but they cannot create. Everything that comes out of an LLM is made up of things it has been trained on. Things it has seen before.

This means that for analysing and marking assignments, assigning grades to work, or writing meaningful reports, LLMs may give you a response, but they cannot give you a meaningful response. The problem is that among the things they cannot analyse is their own behaviour, which means that they cannot tell you that they can’t do it. So they give you something, and they have no idea that it only looks like what you asked for. Lilly Ryan says that LLMs don’t give you facts, they give you things that are “fact shaped”. Similarly they can’t give you reasoning or analysis, or meaningful reports. They can only give you things that are shaped like reasoning, analysis, or meaningful reports.

Large Language Models at the moment are an excellent example of what is sometimes called “A solution looking for a problem.” We’ve figured out how to do something kind of neat. Now we’re trying to find applications for it. Applications where its habit of hallucinating, and answering serious questions with rambles about glue on pizza, or bees in your computer, don’t matter so much. It’s about this point (or perhaps some years ago) where it’s worth considering the questions:

What problems do we want to solve?
What problems should we solve?
What technology do we need to solve them?

But before we talk about what we should do with the technology, let’s talk about the ethics of it. Our next question will address some of the ethical concerns around the AI industry.

Back to Defining Artificial Intelligence

Forward to Resources