Cross-entropy
Crossing the bridge between physics and machine learning.
We’re all gonna die.
Blame the Second Law of Thermodynamics. Entropy increases, we get older, and we die.

This post is about a variant of entropy—called cross-entropy—that I wrote about in a post on machine learning. There, I presented cross-entropy as a measure of the difference between two probability distributions. Most explanations of the concept, including its Wikipedia entry, mainly focus on its relevance in information theory, not physics.
I learned in a talk by Sean Carroll during the Maryland Quantum-Thermodynamics Symposium that cross-entropy plays a central role in the informational reformulation of the Second Law. This way of thinking about entropy and the Second Law builds a fascinating bridge between machine learning and physics. Before we cross that bridge, let’s talk about plain old entropy.
Entropy Without a Cross
Entropy is one of the most important concepts in physics. It’s the main character of the Second Law of Thermodynamics, which states that the entropy of an isolated system increases over time.
Despite its importance, entropy is not as widely known or used as energy. Whether you’re trying to count your calories, arguing about the geopolitics of natural gas, or worrying about climate change, energy seems to be the main character. But it doesn’t quite make sense. We know that energy is conserved; all we do is transform energy from one form to another. Yet we sense that something is irreversibly lost when we “spend energy.” What exactly are we spending when we burn food or natural gas? Check out the next paragraph. The answer will surprise you!
Well, it’s entropy. And its story starts with heat.
The Birth of Heat
Thermodynamics is the study of heat, energy, and work. It was born in the 19th century during the Industrial Revolution from the desire to understand how to efficiently convert heat energy into mechanical work.
Sadi Carnot showed that the efficiency of a heat engine depends only on the temperature difference between the hot and cold reservoirs and not on the specific working substance or the details of the engine design.1 While this observation had huge practical implications, his main contribution for our purposes is the distinction between reversible and irreversible processes, which led to the notion of entropy.
The term entropy was coined by the German physicist Rudolf Clausius in 1865 as a counterpart to the term energy. The 19th-century German intellectuals were enamored with neoclassical hellenism, which resulted in lots of Greek words in scientific literature. Entropia means “transformation to” in Greek. So the German word “Entropie” is the germanized Greek translation of the German word “Verwandlungsinhalt,” which Clausius used to describe the transformational content of energy.
When you burn natural gas to generate heat, you spend the transformational content of the natural gas. The heat that results in this process cannot be transformed back; entropy increases. Clausius formulated the observation that heat flows naturally from a hot body to a cooler one through the inequality
Entropy encapsulates the irreversible processes that we typically associate with energy usage. Concepts like energy crisis actually refer to entropy crisis: we need a continuous supply of low entropy to keep the world running.
Clausius’ inequality doesn’t give an origin story or an explanation for entropy. For that, we need statistical physics.
Atoms
The famous equality that describes entropy is engraved in Boltzmann’s tombstone2
The logarithm in the formula arises from known observations about entropy and simple combinatorics. Consider two systems. It was known that their total entropy is the sum of their entropies,
The Second Law is then a probabilistic statement: among different macrostates, the system evolves towards a more probable configuration, one with a larger number of microstates. In this picture, we don’t expect entropy to always increase. It just happens to be more probable. You will run into fluctuations where entropy goes down if you wait enough.3 A rather outrageous extrapolation of this idea is the Boltzmann brain: a self-aware brain that spontaneously appears in a universe through random fluctuations rather than through biological evolution.4

Surpriiise!
With the rise of calculators, computers, and communication devices in the 20th century, information started to play a fundamental role in our description of physical phenomena.5 Shannon’s reformulation of entropy in The Mathematical Theory of Communication relates information to surprise.
What is surprise? To be surprised, you must have a prior expectation, some sense that things happen in a certain way. The more you expect something, the less surprised you are to see it, and vice versa. Therefore, surprise
- If you’re absolutely certain of
, then and you’re not surprised: . - If you’re absolutely certain that
can never happen, then and its occurence surprises you infinitely: .6 - Surprise should be additive: the total surprise for multiple events should be the addition of the surprise associated with each event. For two events
and , the combined probability is and the total surprise should be
These conditions are satisfied by a formula that depends logarithmically on the inverse of
Entropy is then the probability-weighted sum of surprise. In other words, entropy is expected surprise:
Boltzmann used a similar formula already in 1866, yet the expression is named after Gibbs and Shannon. It reduces to Boltzmann’s first formula
An increase in entropy means that the expected surprise increases. This might sound a bit counterintuitive. We learned that entropy is a measure of disorder. How are disorder and surprise related?
It may be simpler to understand that patterns reduce total expected surprise. Let’s say every time I order a taxi, I get a yellow cab. Over time, the total expected surprise about the color of the taxi cab will be low even though I might get a blue cab once in a blue moon. If, however, the color of the taxi cab is different every single time, those little surprises add up and maximize the total expected surprise. Disorder increases total expected surprise over a collection of events. It’s highest when the events are random.
The Cross of Entropy
Boltzmann’s entropy
Learning to Expect the Unexpected
In Gibbs-Shannon entropy
We may formally use the true distribution,
This property is why it’s so useful in machine learning where cross-entropy is used to construct the loss function in multiclass classification tasks. The true labels of the training samples serve as the true distribution; the output labels of the neural network serve as the assumed distribution. The cross-entropy loss function is iteratively reduced by numerical optimization. Eventually, the true distribution of the labels matches the predicted distribution from the neural network sufficiently well. At that point, we say the machine learned the training set.
The Second Coming of the Second Law
The Second Law of Thermodynamics has been reformulated using cross-entropy by Bartolotta, Carroll, Leichenauer, and Pollack to “incorporate the effects of a measurement of a system at some point in its evolution.” The Bayesian Second Law of Thermodynamics uses an information-theoretic approach. Sean Carroll has a great blog post about this paper; you should read it. Here’s a short description in our context.
According to the Bayesian Second Law, the cross-entropy of the updated (“true”) distribution with respect to the original (“assumed”) distribution, plus the generalized heat flow, is larger when evaluated at the end of the experiment than at the beginning. For zero heat transfer, the expected amount of information an observer would learn by being told the true microstate of the system is larger at the final time than at the initial one. Therefore, cross-entropy can change over time according to how well our initial assumptions about a system match its true underlying distribution and how much new information we gain through measurements and updates to our assumptions.
This updated Second Law describes the increase in cross-entropy as
When the assumed distribution differs significantly from the correct distribution during time evolution, it can lead to information loss and, therefore, a large increase in cross-entropy. Cross-entropy increases with time even with zero heat transfer. In this interpretation, what happens during optimization in a machine learning model (decreasing cross-entropy) is the opposite of what happens in stochastic evolution (increasing cross-entropy): The act of learning is a revolt against disorder and decay!
The Death of Heat
At the beginning of the post, I mentioned that the relationship between life and entropy is complicated. When it comes to the Universe, however, things are much simpler. The Universe is evolving towards heat death.
As the Universe continues to expand and matter becomes more dispersed, it will become increasingly difficult for matter to interact with other matter, and energy will become more evenly distributed. Eventually, all stars will have exhausted their fuel, and the Universe will be a cold, dark, lifeless place where nothing happens.
One of my favorite science-fiction short stories is Asimov’s The Last Question. It’s a story about the heat death of the Universe with the perfect punch line. The story begins with two technicians attending to a giant, self-adjusting, and self-correcting computer, called Multivac that found a way to fulfill the Earth’s energy needs by drawing energy from the Sun. The technicians argue that the Sun and all the stars in the Universe will eventually run out. They ask Multivac whether entropy can be reversed, to which Multivac replies, “INSUFFICIENT DATA FOR MEANINGFUL ANSWER.” The story follows the history of humanity across many eons, through interstellar travel and immortality. The last question remains and is asked repeatedly.
I won’t give away the punchline but it does fit into our observation that learning acts against entropy. I posed last the question to ChatGPT, our version of the Multivac. Maybe somewhere among the weights and biases in the billions of its connections, ChatGPT is still thinking about it.

Footnotes
Unfortunately, Carnot died from cholera at a relatively young age of 36. His book, Reflections on the Motive Power of Fire, self-published in 1824, was largely ignored by the scientific community at the time. ↩︎
Unfortunately, Boltzmann committed suicide while on a beach vacation with his wife and daughter near Trieste, shortly before the experimental verification of his ideas. ↩︎
In small systems with a few parts, such fluctuations happen frequently. Their study is a relatively new topic of research that falls under stochastic thermodynamics. One of the main results in that area is the Jarzynski equality that relates the free energy difference between two equilibrium states to the average work performed on the system during a non-equilibrium process. As the system size increases, however, it becomes increasingly unlikely that such fluctuations reduce entropy and we recover classical thermodynamics. ↩︎
It took me about 5 minutes to generate, modify, and upscale this image using Midjourney. An actual Boltzmann brain would presumably take much longer to form but some people argue that it’s more likely than the formation of our entire Universe. Personally, I don’t like talking about likelihood in the context of the entire Universe. I rather think darüber muss man schweigen. ↩︎
As an example on how fundamental information became in physics, consider that one of the most influential physicists of the 20th century, John Wheeler, divided his physics career into three phases: “Everything is Particles,” “Everything is Fields,” and “Everything is Information.” These stages may sum up the development of physics in the last four centuries. As we are now fully in the informational stage, it will be fascinating to see how machine learning will impact fundamental developments in physics, not only as a tool, but as a conceptual framework for our quest to understand Nature. ↩︎
We could consider setting a maximum here. We now know that, indeed, there is a maximum amount of entropy for a given volume of space. This upper bound for entropy is named after John Wheeler’s student Jacob Bekenstein and has to do with black holes. But let’s leave the quantization of gravity to a later time. ↩︎
There are some subtleties here related to the dimensions and underlying probability distributions. The equivalence of the various formulations of entropy must be demonstrated using certain assumptions. If you notice such subtleties, you probably didn’t need to read this post, but I hope you enjoyed it. ↩︎
There are other generalizations, such as Rényi entropy, that are interesting but today’s focus is on cross-entropy. ↩︎
This interpretation is better understood with the Kullback–Leibler divergence defined by
This expression vanishes when in accordance with the interpretation that there is nothing left to learn when the true distribution equals our assumed expectation. ↩︎