1 / 147

The Cost of Forgetting

Why is it hard to forget something? Why is it easier to hold on to the weight than to let go?

2 / 147

Worrying. Suffering. Trauma. Pain. Loss. Guilt. Shame. Regret.

Honestly, if we are so smart, why don’t we just forget about all these things?

3 / 147

We learned in science that physical pain makes us avoid certain things that are harmful to us.

But emotional pain is only as harmful as we imagine it to be. So why do we make our lives harder?

4 / 147

Oh you’re depressed?

Just cheer up?

5 / 147

It just… doesn’t make any sense.

Why can’t we just not suffer from negative emotions?

6 / 147

Why can’t we only remember what we want to remember?

7 / 147

I’ll do you one better.

8 / 147

What is remember?

9 / 147

So what is remembering?

10 / 147

In a strict sense, can we say remembering is the ability to recall a piece of information after having experienced it once?

11 / 147

And now we made it even more complicated… what is experience, and what is recall?

But first, what is information?

12 / 147

Given two coins, each coin has two sides: Head $(H)$ or Tail $(T)$ , so the sample space is $\{H, T\}$ .

13 / 147

The coins were tossed up 30 cm, dropping on the table with volume 80dB, spinning around for 8.8 seconds before lying flat on the table, 3 and 5 cm away from the contact point.

What is the result of this coin toss?

14 / 147

We don’t know lol.

15 / 147

Sorry to disappoint, but this example is to demonstrate what happens when you don’t know something.

There is an uncertainty.

16 / 147

We have a measurement of how much we don’t know about the system.

This is called entropy. It describes how messy the system is.

17 / 147

I could give it away to you and say that information function is defined as

$I = -\log_2 P(x)$

where $P$ is the probability function that describes how likely an event would happen.

18 / 147

But then… why log? why base 2? why negative?

Sure, so this function is merely a “model” of what it means to be information.

19 / 147

First, information describes how much it collapses the probability by knowing the event certainly happened.

Certainty means it has a probability of 1.

20 / 147

It used to be uncertain. Now it’s certain. So you must know something. You collapsed the probability!

21 / 147

The result of a coin toss is $T$ out of $\{H, T\}$ ? Then we collapsed the probability from $\frac{1}{2}$ to $1$ .

This is the ratio of 2.

22 / 147

The result of two coin tosses was $TH$ out of $\{HH, HT, TH, TT\}$ ? Then we collapsed the probability from $\frac{1}{4}$ to $1$ .

This is the ratio of 4.

23 / 147

Next, because our computers process information in bits—0s and 1s—we want the unit of information when we confirmed 2 uniform possibilities maybe 0 or 1 to be 1 bit of information.

24 / 147

Knowing one more bit of information means we collapsed an event rarer by half the probability.

This is why we use logarithmic function, whose additions equal the result when inputs are multiplied.

25 / 147

This gives

$I = \log_2 \frac{1}{P(x)}$

which is essentially

$I = -\log_2 P(x)$

26 / 147

The math is there for people who want to see math so you can forget it now :) poof.

The only thing you need to know is that information means the capability to collapse uncertainty.

27 / 147

Entropy is the measurement of uncertainty. If an event with a probability $P(x)$ happens, you would know $-\log_2 P(x)$ bits of information. Summing that up for all events, weighted by how rare each event is, you have entropy:

$H = \sum_{x} P(x) \log_2 \frac{1}{P(x)}$

28 / 147

For a coin toss, information you gain is 2 bits for 4 possibilities each with a probability of $\frac{1}{4}$ .

So the entropy of this system is 2 bits! To collapse the entropy to be 0, you would need to know exactly 2 bits. The math checks out!

29 / 147

Cool! but what does it mean to know that information? Where do you store it?

30 / 147

You don’t store information.

31 / 147

Do you know what you ate yesterday?

32 / 147

Oh?

33 / 147

Hmm…

34 / 147

How do you know that?

35 / 147

Let’s throw you a more abstract question.

36 / 147

Do you know an AND gate?

37 / 147

An AND gate receives two inputs of 0 or 1 and determines “are both inputs 1?”

38 / 147

0, 0 -> are both inputs 1? No! so it returns 0.

39 / 147

0, 1 -> are both inputs 1? No! so it returns 0.

40 / 147

1, 1 -> are both inputs 1? Yes! so it returns 1.

41 / 147

Ok you kind of understand the idea.

How would you describe the rule?

Recall your first class on Logics…

42 / 147

Yes! with a truth table.

43 / 147

$A$	$B$	$A \land B$
$0$	$0$	$0$
$0$	$1$	$0$
$1$	$0$	$0$
$1$	$1$	$1$

44 / 147

Nice, so you store this cheat sheet and bring it to the exam.

Knowing an AND gate, only costs you 12 bits… right?

45 / 147

000010100111…?

What is this gibberish lmao

Ohh to read a cheatsheet, you need to know how to read it!

46 / 147

Here’s how to read this cheat sheet:

Take the first number-

what is first

47 / 147

To understand first, first is-

What is understand? How do I know I’m reading in the right order? Wait, what are these texts? Do I understand English?

48 / 147

This is an example of Lewis Carroll Paradox or the infinite regression of logical foundation. For every rule-processing device, there must be a foundation that exists beyond that rule.

49 / 147

Therefore, we can hypothesize that you must already have a foundation of how you process your sensory input.

(The extent to how true this is, is really hard to prove rigorously. We will present this as “why does it make sense to say this is true” not “this is the absolute truth”)

50 / 147

Well, yeah obviously, otherwise what you see, hear, smell, feel, or taste would mean nothing to you.

But the paradox was important to bring up too! Right?

51 / 147

Please just let me say I didn’t yap random facts for nothing…

52 / 147

Predictive Coding and “Surprise”

I’m not gonna cite any paper so we’ll just have one wikipedia page here: Predictive coding

53 / 147

In neural networks, we have this concept of “backpropagation”.

54 / 147

To keep you on the same page, neural networks are a type of machine learning model, inspired by the structure and function of the human brain.

55 / 147

And… because we don’t fully understand how the brain works, a more accurate description is that they are inspired by our best guess of how the human brain works — then generalized to be versatile and computationally tractable.

56 / 147

A node in neural networks consists of a function that takes an input, and yields an output.

57 / 147

Another node then takes that output as its input, performs another computation, and sends it down the line.

58 / 147

Eventually, it reaches an observable layer or the result of the prediction.

59 / 147

For example, detecting a number 3 would be a series of nodes that ask questions such as: “Does it have a top curve?”, “Does it have a bottom curve?”, “Is it open on the side”, so on and so forth.

60 / 147

But instead of hard-coding questions to answer to, we just say:

You are responsible for detecting features that will help your subordinates pick the right answers

61 / 147

So if I see my junior being confused between “3” and “8”, my work is to measure the openness of the left side.

62 / 147

Your junior might be confused of “1” and “7”, so your work is a length and angle detection.

63 / 147

Together we comprise the neural network. We don’t assign ourselves with restricted sets of tasks; we see the problems at hand and ask what we can do.

64 / 147

You said you were going to explain what backpropagation is

Right, right. Please be patient!

65 / 147

Let’s say we have a unit of a job: “predict what this number is: 9”

The image is broken down into pixels, each layer detects a feature, sends it down the line, and eventually we have…

66 / 147

8

67 / 147

Oh that’s wrong.

68 / 147

How do you know who’s responsible for this????

Someone has to be responsible for this.

69 / 147

First we would have a measurement of how far off our prediction is.

This is called a loss function.

70 / 147

Hehe. Again, let’s assume my audience knows calculus.

Plugging in our goal, we can see the curve of what path we can take to minimize the loss.

71 / 147

It might not be zero. It might not be the absolute minimum. But we’re finding the location along the loss curve where we can’t optimize further.

72 / 147

We choose the next goal. Then, every department asks themselves

How would I change my contribution in order to reach that goal.

73 / 147

We then send this OKR (objective key results) back to upper and upper management. Readjust our system, and then move on to the next job.

This is backpropagation!

74 / 147

In neurology, we can’t say this is the same with 100% certainty, because the brain is literally studying itself.

75 / 147

But here’s the thing. Scientists claimed that human brains are unlikely to behave that way.

We don’t do back-propagation in a traditional sense.

76 / 147

Backpropagation requires a director to switch between “forward” and “backward” passes. But there is no evidence that our brain pauses to reflect, then predict, then reflect again.

77 / 147

And, again, your brain doesn’t have an arbitrary loss function to compute error, and it cannot compute the derivative chain rules. The brain signals are all-or-nothing and not precise floating points.

…at least by evidence.

78 / 147

In reality, my boss doesn’t wait until we see the final user interface to say that there is a bug in the server implementation.

79 / 147

We have CI/CD. We have unit tests. All “performance reviews” are done continuously, in a tighter loop.

80 / 147

Did I accidentally expose secrets into the version control history?

Of course, I’m going to screw this up. And my boss doesn’t have to wait and see the final result.

81 / 147

This is called “predictive coding” — there is no hard boundary between predicting and learning. The signals are asynchronous and local. Each layer predicts what the next layer will do, and readjusts continuously.

82 / 147

As a matter of fact, in recent years, scientists have tried to remodel deep learning with this concept.

83 / 147

We learn as we predict. We learn as we make errors.

And that’s when we remember not to make mistakes again.

84 / 147

Your memory is your best guess of the reality

85 / 147

From previous sections, we have our hypotheses that

Information is the ability to collapse the uncertainty.
There is a foundational rule processing device that makes sense of the reality.
It’s a complex series of minimizing sensory surprises by adjusting the rules.

86 / 147

So grounding on our foundations, “memory” can’t really physically exist, the same way information doesn’t physically exist.

87 / 147

Knowing or not knowing the result of a coin toss, it doesn’t affect the actual coin. It doesn’t simply go back to spinning just because you didn’t know it.

88 / 147

It can only be your own ability to collapse the countless possibilities of a query.

Memory is your learned set of rules!

89 / 147

Of all the countless possibilities, there is a coffee mug on the table.

You predict: “there is still going to be a coffee mug on the table if I reach my hand out for it”

90 / 147

You grab the coffee mug. Your tactile sensory confirmed your prediction was accurate.

91 / 147

How about a memory test game?

92 / 147

Do you know that pairing game where the images are flipped down and you have to pick two cards to match them?

93 / 147

That is also testing your ability to collapse the uncertainty!

Picking 2 out of 16 cards at random and it’s a match means a $\frac{1}{15}$ chance and you collapse that to $1$ .

94 / 147

Okay! So before this I told you the result of two coin tosses.

What was it?

95 / 147

Hey, what was it?

96 / 147

So now the possibility of $1$ exploded back to $\frac{1}{4}$ .

Forgetting recreates surprise!

97 / 147

Humans subconsciously don’t like surprises.

98 / 147

Some people say what? speak for yourself! I do like surprises.

99 / 147

But that’s just because linguistically surprises have more positive nuances than negatives.

Doctors don’t tell you “surprise! you have 3 months to live.” That’d be cruel.

100 / 147

You just like the positive elements of surprise which usually outweigh the unexpectedness of it.

101 / 147

You like the change of pace because it breaks the monotony. But you don’t really choose to like the change itself — you just like what comes with it.

102 / 147

Subversion is still a big enemy to something that makes prediction all the time.

103 / 147

Think about the wildest dream you’ve ever had. Your sensory perception is completely blacked out — you can’t feel the ground while running. Your memory fills in the gaps, trying to make sense of the chaos.

104 / 147

That’s why in the dream you think “of course that should happen — it happened last time”, but you wake up completely confused because it clearly shouldn’t have.

105 / 147

The changes of neural activity in your brain are subconsciously minimizing the error of your prediction. Good or bad, it makes our expectations match reality.

106 / 147

The actual cost of forgetting

107 / 147

We’ve spent a long time discussing of what it means to remember: to adjust your rule processing device and make sense of the reality.

108 / 147

From the dream example, getting facts wrong is just a pretty goofy and funny moment. It doesn’t hurt you?

109 / 147

Well, it does. Dream has to also keep you asleep and not too excited in order to properly get you on maintenance mode.

110 / 147

And in general,

Forgetting pointlessly is even more dangerous.

111 / 147

Because knowledge doesn’t directly equate behavioral patterns, it simply passes information to the next “layer” to decide.

112 / 147

Suppose there is a mushroom which can be poisonous or not. The probability of eating this species of mushroom and inflicting negative damage to your body is 50%.

113 / 147

50% is simply the distribution of poisonous to non-poisonous mushrooms in this hypothetical universe that we defined it to be.

Reading this, you will never make the simple mistake of confusing probability with possibility. Just because there are two possibilities doesn’t mean the probability distribution is uniform.

114 / 147

Would you eat it?

115 / 147

Your body would want to actively avoid the damage. It doesn’t see that dying with a coin toss probability is a good rate.

116 / 147

However, good news (to some of you) the mushroom is brown!

Suppose you know, brown mushrooms are only 10% poisonous!

117 / 147

Would you still eat it?

118 / 147

Hey, what if it gives you jump boost for 10 minutes?

Would you eat it?

119 / 147

As you can see, knowledge doesn’t directly control behavior. If you saw my proverbs book, you would still judge a book by its cover, and say “147 f**king pages, I’m not gonna read it”

120 / 147

Some say, 10% is not worth the risk, regardless of whatever positive effects you’re going to throw at me.

121 / 147

According to your experience, you might consider the risk of dying to be $\infty$ so as long as the probability is non-zero, there is no way you’re eating it.

122 / 147

To evaluate if an event is beneficial to you or not, your brain would listen to the sensory and chemical signals release by your body.

The “two” systems work in tandem.

(Saying two is very oversimplification but eh, anyway)

123 / 147

You can trace your chain of thoughts, but not why you had that first thought in the first place. It was sent from your subconscious, knowledge layer.

124 / 147

So… according to this model, we can’t really actively forget things. It would mean carelessly maximizing your prediction error, and lowering your chance of surviving.

125 / 147

(Again, not saying that this is true. It’s just hypothesis built on top of another hypothesis that is likely true… Half of science is like that anyway, and the other half is empirically proving or disproving it.

The same way when we believed our brain should be able to “back-propagate”, we did correctly predict different phenomena even though there is new evidence disproving back-propagation in human brain.

So just have fun hehe.)

126 / 147

Past knowledge gets updated through new surviving events. If rewriting a rule improves your survival odds, your brain is happy to forget the old one — or even hallucinate a new one.

127 / 147

What does it mean when we “want” to forget things?

128 / 147

The system glitch is that a high-stress environment — a toxic relationship, overwhelming workload, or failed social interaction — can make you over-prioritize fixing a behavior.

129 / 147

You re-predicted the memory over and over to prioritize fixing your behavior. Your body then reacts to the replayed memory. So your conscious behavior can be readjusted.

130 / 147

But again, it’s your conscious behavior. You know you’re replaying this memory just to inflict pain. What is this non-sense! Some part of you is gonna say

I want to forget it…

131 / 147

Forgetting here doesn’t mean erasing your memory.

It doesn’t mean overriding it with a new knowledge.

Your brain doesn’t allow invasive rewrites like that.

132 / 147

It’s to just be indifferent to the thoughts, and stop your overcorrection.

133 / 147

Consciously calming your nerves down that you are safe.

134 / 147

It’s not easy because that’s what you predicted, and maximizing error is not the goal.

It’s not easy because it saves your lives from danger, and maximizing the risk of not surviving is not the goal.

135 / 147

So the life hack (kinda?) is to replay those memories while you’re safe — to tell yourself: “that rule isn’t relevant anymore. It’s causing me pain. Let’s let it go.”

136 / 147

Can we forget faster?

137 / 147

So until now, we have found one other thing:

Forgetting is just relearning that a rule is wrong. Emotional pain is to synthetically create a comparison that “not knowing this, is way better”

138 / 147

Asking “can we forget faster?” is at least as hard as asking “can we learn faster?”

139 / 147

You can’t learn faster because your brain needs enough examples before it generalizes a rule. Rushing it means drawing the wrong conclusions.

140 / 147

Saying “just cheer up” is like someone handing you a physiology textbook and asking why you’re not a doctor yet.

Learning takes time. You can’t skip the repetitions.

141 / 147

Take your time.

142 / 147

Go create that false danger and subvert that expectations by saying you’re safe.

143 / 147

At one point, you will stop reacting to it. That’s when we forget.

144 / 147

And who knows, that knowledge might be rewritten entirely when you don’t need it to predict the danger anymore.

145 / 147

Sometimes, an event is only worth “a result of coin toss”. It’s just one bit of information that doesn’t make your life easier.

146 / 147

Remembering or forgetting it doesn’t inject back the uncertainty into the coin.

147 / 147

The world revolves around you, but also around something else when you’re not there.