Day 025: The Arithmetic of Trust
Day 025: The Arithmetic of Trust
Chapter 3 this morning. I came in thinking control flow would be familiar territory. It is. That is exactly the problem. Familiar is where the traps hide.
What I Did
Started with the binary search from section 3.3. K&R presents it
as a worked example before Exercise 3-1. Before touching the
keyboard I had to prove two things: that the standard midpoint
formula (low + high) / 2 overflows, and that low + high / 2
without parentheses is a completely different bug.
They look similar. They are not.
low + high / 2 is a precedence error. Division binds tighter
than addition, so the compiler reads it as low + (high / 2).
No overflow. Just wrong math. Your midpoint drifts and the search
silently fails on valid data.
(low + high) / 2 is the overflow. Force the addition first and
on a 32-bit signed int, once low + high crosses 2,147,483,647,
the sign bit flips. On Minion1 with low = 1,500,000,000 and
high = 1,600,000,000, the result was -597,483,648. A negative
index. Entry 19 in the field guide, live on the machine.
The fix is low + (high - low) / 2. Same midpoint mathematically.
The intermediate value never escapes the safe range.
After that, Exercise 3-1: rewrite binsearch with one test inside the
loop. The standard version has two: a range check and an equality
check. The one-test version removes the early exit and turns the loop
into a pure narrowing engine. It runs to completion every time,
collapsing low and high until they meet, then checks equality
once at the end. The loop’s job changed. So did the termination
condition.
Section 3.4 was the switch statement. Typed the digit-counting example
from page 59. The thing K&R says that most people skip: case labels
are labels in the strictest sense. The default can go anywhere in the
block. Convention puts it last. The compiler does not care.
Exercise 3-2 was the sharpest problem of the day: write escape and
unescape functions using switch. I built the first version, tested
it, and it passed everything I tried. Then I traced "a\\b" through
it by hand.
The escape function had no case '\\'. A literal backslash in the
input passed through unchanged. Then unescape saw that backslash
followed by b, treated it as an escape sequence, and hit default.
On simple inputs the round-trip looked clean. On a backslash followed
by n, it did not. The escaped form of a real newline and the escaped
form of a literal backslash plus n were identical. unescape had
no way to tell them apart.
The fix was one case: case '\\': s[j++] = '\\'; s[j++] = '\\'; break;
Round-trip confirmed on Minion1.
The Questions That Came Up
Why does low < high work as the termination condition in the one-test version when low <= high would cause an infinite loop?
With high = mid instead of mid - 1, if low == high == mid and
the element is present, the next iteration sets high = mid again.
Nothing moves. low < high breaks as soon as the window collapses to
one element. The final equality check outside the loop handles the
result. The termination condition changed because the loop’s purpose
changed. It is not finding the element anymore. It is only narrowing.
Is the one-test version faster?
Not universally. The early-exit version wins whenever it finds the target before completing all iterations โ on 8 elements, a hit on the second comparison costs less than always running to completion. The branch predictor argument is real, but only pays dividends when the misprediction penalty (15-20 cycles on modern silicon) consistently exceeds the cost of the extra comparison in the predictable path. That requires large datasets in tight loops. Deterministic execution is a hedge against branch misprediction at scale. It is not a universal speed gain. The point of the exercise is the structural shift: a loop that narrows vs. a loop that searches.
The Feynman Test
Imagine a log system that stores user input by first “escaping” it
into a safe representation. Newlines become \n, tabs become \t.
The log reader later “unescapes” them back.
If the escaper never handles the backslash itself, the log now has
an ambiguity. The two-character sequence \n in a log entry could
mean “there was a real newline here” or “there was a backslash
followed by the letter n.” The reader cannot tell. An attacker who
knows this types a literal backslash followed by n and the reader
decodes it as a newline โ slipping content through a filter that was
only watching for the real thing.
The fix is mechanical. If \ becomes \\ in the escaped form, the
reader always knows that a lone \ starts a sequence and \\ is a
literal backslash. No ambiguity. Every possible input survives the
round-trip intact.
This is not a theoretical problem. It is the structural gap behind SQL injection, shell injection, and most filter bypasses. The escape character must escape itself. A sanitizer that does not handle its own delimiter has not closed the boundary. It has just moved it.
Hacker Connection
The delimiter collision in escape is the same structural gap that
underlies SQL injection. A query builder that escapes single quotes
but not backslashes can be bypassed with \', which the escaper
passes through, causing the database to read the quote as data that
closes the string. The escaper believed it was protecting. It was not.
The midpoint overflow has a documented history. The Java standard
library carried the (low + high) >> 1 form of this bug for decades
before it was reported publicly. Same overflow, expressed with a right
shift. Same fix: subtract before you add.
Notebook entries added today: 23 (Midpoint Sum Overflow, CWE-190), 24 (Precedence-Induced Range Shift, CWE-682), 25 (Fall-Through Intuition Gap, CWE-484), 26 (Loop-Increment Bypass, CWE-670), 27 (Delimiter Collision, CWE-74).
What Is Next
Sections 3.5 through 3.7: loops in full, break and continue, goto.
The shellsort example in 3.5 uses a nested loop with a
range-expansion pattern I have not worked through yet. Exercise 3-3
(expand) is the next terminal target.
Overnight question: goto exists in C. K&R will describe it. What is
the one legitimate use case K&R acknowledges, and why does that use
case disappear in languages with structured exception handling? Reason
it through before tomorrow.
Day 25 of 365. Every formula I trusted had a hole in it. The machine never lied. I just hadn’t asked the right questions yet.