Day 012 Two Counters
Day 012: Two Counters
Three exercises today. All built on the getline and copy pattern
from yesterday. All forced me to think about what happens when reality
exceeds the space you gave it.
What I Did
Exercise 1-16 asks you to handle arbitrarily long input lines. The
original K&R longest-line program silently truncates and lies about
the length. The fix needs two counters. i tracks the actual length
of the line. Every character increments it. j tracks the buffer
position. It stops when the buffer is full. The function returns i
so the caller gets truth. The buffer only holds what fits safely.
That separation between “how big is this really” and “how much can I safely store” is the question every input handler has to answer. When those two values diverge, you have a problem. Or an opportunity, depending on which side of the keyboard you sit on.
I also hit a name collision. getline is a POSIX standard library
function now. The compiler rejected it. Renamed to get_line and
moved on. Function name collisions are a real bug class. Imagine
accidentally shadowing a security-critical library function and
nobody notices until production.
Exercise 1-17 prints lines longer than 80 characters. Simple filter.
get_line already returns true length. Define THRESHOLD as 80 and
check against it. But then I tested with exactly 80 characters and
it printed. Off by one. get_line counts the newline, so 80 visible
characters plus \n equals 81, and 81 > 80 is true. Fixed it to
len - 1 > THRESHOLD so the comparison works on visible characters.
Then I wanted to see it for myself.
Loaded the program into GDB. Created a test file with exactly 81 A’s
and no newline. Set a breakpoint on get_line, ran it, used finish
to let the function complete, and checked $eax. It said 0x51. That
is 81 in hex. len - 1 gives 80. 80 > 80 is false. The program
correctly produces no output.
I could have trusted the math. I asked the machine instead.
Exercise 1-18 removes trailing blanks and tabs and deletes blank
lines. Walking the string backward with a for loop felt natural.
Start at the end, check for spaces and tabs, replace with '\0'
until you hit a real character. Efficient. No wasted passes.
Then I tested with blank lines in the middle of the input and
everything after them vanished. The bug was subtle. My first version
of get_line did not store the newline. A blank line returned 0.
The while loop in main treated 0 as EOF and quit early. The
program never saw the lines after the blank ones.
Zero meant two things. Empty line and end of file. Same return value, different realities. The program could not tell them apart.
Fixed it by putting the newline back in get_line. Now a blank
line returns 1. The loop survives. trim strips the newline along
with other trailing whitespace. Returns 0 after trimming. The
if (len > 0) guard suppresses the blank line. EOF still returns 0
from get_line because there is no newline to store. The two cases
are distinguishable again.
The Questions That Came Up
Why does the newline-terminated edge case matter?
Because every protocol, file format, and input stream has to decide how it signals “end of record” versus “end of stream.” When those two signals can be confused, parsers break. I have seen this in log ingestion pipelines, HTTP chunked encoding, and certificate parsing. The details change. The pattern does not.
Why walk backward for trimming?
Walking forward means scanning the entire string to find the end, then backtracking. Walking backward from the known length gets to the trailing whitespace immediately. In C, where you are responsible for every operation, the efficient approach is the correct approach.
The Feynman Test
Imagine you are filling out a form and the “Name” field allows 20 characters. Your name is 25 characters long. A good form tells you “your name is 25 characters but we can only show 20.” A bad form silently chops your name at 20 and pretends that is all you typed. The dual counter is the difference between those two forms. One counter tracks reality. The other tracks capacity. When they disagree, the honest program tells you both numbers.
Hacker Connection
Sentinel value collisions. When a function uses the same return
value to mean two different things, the caller cannot distinguish
between them. In my exercise, 0 meant both “empty line” and “no
more input.” In real systems, this pattern shows up everywhere.
malloc returns NULL for both “zero bytes requested” and
“allocation failed.” Old APIs return -1 for errors but also for
legitimate values. strstr returns NULL for “not found” which
is the same pointer value as a null pointer dereference target.
Every ambiguous sentinel is a potential logic bug. Every logic bug in input handling is a potential vulnerability. The fix is always the same: make sure every distinct condition has a distinct signal.
What Is Next
Section 1.10 on scope and external variables. This is the last section before Chapter 1 ends. Then we close out Phase 1’s K&R track and Chapter 2 begins. The vulnerability pattern notebook gets a new entry today: sentinel value collisions.
Day 12 of 365. When zero means two things, the program believes whichever one it hits first.