[CSDb] - User Forums - The tale of an elusive bug

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > The tale of an elusive bug

2016-07-27 07:39

lft

Registered: Jul 2007
Posts: 369

The tale of an elusive bug

This is the story of a strange little bug that I ran into while working on some
VSP-safe code. It is the middle ground between light and shadow, between
science and superstition, and it lies between the pit of man's fears and the
summit of his knowledge. It is an area which we call... the Twilight Zone.

So I'm working on this program that is similar to VSP Lab; it performs VSP and
gathers statistics about the memory corruption. As you may remember, VSP may
randomly corrupt all memory locations with addresses ending in 7 or F (I call
these locations "fragile"). So the program itself must not depend on the value
of any such byte, and I have to jump through hoops to avoid every eighth byte
of memory. All good fun, especially when the code also needs to be
cycle-accurate.

My approach to this is to organise the code in eight-byte chunks, where each
chunk ends with one of two macros that I call "nim" and "nab". The first macro
("nim" for "nop immediate") expands to $80 $00, which is an illegal instruction
that can be used to skip a byte, in this case a fragile byte. It takes two
cycles and doesn't affect CPU state. The second macro ("nab" for "nop
absolute") expands to $0c $00 $00, which is likewise an illegal instruction, in
this case a three-byte, four-cycle nop.

The overall behaviour of my program is this: On every frame, some fragile
memory locations are filled with a predictable pattern (I generate a new
pattern each time, using an LFSR). I have enough rastertime to prepare three
pages in this way (3 * 32 = 96 bytes), and I gradually cycle through all pages.
At the the bottom of the screen, the VSP effect is triggered. Then, the
prepared memory locations are verified against the known pattern. If there's a
difference, a VSP crash must have happened, and this fact is recorded in the
statistics. There are also six pages of video matrix and font definition (ECM),
and since the VSP crash may have corrupted the fragile bytes in those areas,
they need to be restored from safe copies. This is done using speedcode
(lda #value, sta addr, nab) and there's enough time to do this just before the
VIC chip is ready to fetch the data again for the next frame.

Finally, on each frame I update the video matrix to reflect the new statistics.
I also update an on-screen timer that shows for how many minutes and seconds
the program has been running. The timer is driven by the TOD clock in CIA#2.
And this is where the bug comes in.

I suddenly noticed that the on-screen timer had stopped, so I assumed the
program had crashed. This is not unusual when writing VSP-safe code: It only
takes a very small lapse in vigilance, e.g. mistaking a three-byte instruction
for a two-byte instruction because the operand looked liked a zero-page
variable when in fact it wasn't, to throw off the alignment of all code that
follows. A VSP crash may then suddenly and subtly corrupt that code, and as a
result the system will most probably hang. So when the timer stopped, that's
naturally what I assumed had happened. But then, after a few seconds, the timer
suddenly started counting again. What's more, it jerked ahead several seconds,
compensating for the time during which it had remained static. And the program
seemed to run fine afterwards.

My first instinct was that the computer must have been busy with something else
for a while, so that the code for redrawing the timer didn't get to run. This
instinct was dismissed immediately of course; we are used to this sort of
non-responsive behaviour on modern computers, but the C64 doesn't work that
way. In particular, I knew that all code in the program runs in the same raster
interrupt, so if anything were to take too much time, that'd mess up the
display. Next idea: It could have been some kind of hickup in the TFT monitor.
That's modern technology, so it can be non-responsive. It's at least
conceivable that the TFT monitor could, under some weird conditions, freeze the
display for a couple of seconds.

But then the phenomenon occurred again, and yet again. This was all the more
strange because I was sure that this exact version of the program had worked
flawlessly before. It certainly worked flawlessly in vice. The next time the
timer froze, I entered the machine-code monitor and verified that, indeed, the
frozen time was there in the video matrix. So it wasn't the TFT monitor. I also
dumped the IO registers, and found that the correct time was in the CIA
register. So apparently my code did not successfully copy the value from the
CIA register onto the screen. Because, quite simply, the value in BCD in the
CIA register was several seconds ahead of the value represented by digits in
the video matrix. But the code for updating the video matrix was nevertheless
running on every frame.

So I scrutinised the code, but I couldn't find anything wrong with it. The way
I do on-screen timers (in compo-tunes, for instance) is that I read the seconds
register from the TOD clock, then the minutes register, and then I verify that
the seconds register still contains the old value. If not, I branch back and do
it all over again. This ensures that the minutes and seconds match up, that
when it goes from e.g. 0:59 to 1:00 it doesn't briefly display 1:59 or 0:00. I
was doing the exact same thing here (while also navigating around every eighth
byte using my macros), and somehow the code didn't work, although I couldn't
see anything wrong with it.

Then, from the back of my mind, I seemed to remember something. Looking up the
datasheet for the CIA chips, I found it: The TOD clock has a latch function.
When you read the hours register, the other registers freeze so that you can
retrieve them without any risk of getting values that don't match up. As soon
as you read the tenths-of-a-second register, all the registers return to their
normal function. Incidentally, this un-latching would have happened when I
dumped the IO registers in the machine-code monitor, which would explain why I
could see one value in the register and a different value in the video matrix.
The latch feature would also explain how the on-screen timer could stop and
then jerk ahead and resume. Except that I didn't read the hours register. And
the datasheet explicitly said that if you don't read the hours register, the
other registers behave normally, and you don't need to worry about the latching
business at all.

So now I was pretty confident that the TOD latch was the culprit, and so I was
at a crossroads: There was a quick fix, which was to simply read the
tenths-of-a-second register (disabling the latch) before reading out the
minutes and seconds. But that would only hide the real bug, which might have
other consequences. And even if it didn't have any other consequences, there
was still some element of my code that behaved in an unpredictable way, that I
didn't fully understand. I couldn't just leave that in, now, could I? So I
decided that I was not allowed to read the tenths-of-a-second register (except
once, at startup). I would make sure to find and fix the root cause of the
problem instead. And that meant figuring out why in the name of all that is
sacred the TOD latch was sometimes enabled or disabled, spuriously, sometimes
not at all, and only on real hardware.

After thinking a bit, I homed in on the "nab" macro as a primary suspect. This
is a three-byte, four-cycle nop instruction with the operand $0000. What
happens in the fourth cycle? Well, I had assumed that it would read from the
address $0000. Now, due to VSP crashes and my own corruption of the fragile
bytes (by overwriting them with predictable but ever-changing bit patterns),
the MSB of that address would change randomly all the time. Inevitably, the MSB
of a "nab" operand would occasionally be $dd. So in fact those instructions
would read from address $dd00 in CIA#2. But the hours register is $dd0b! So
what if the instruction actually performs an indexed read, and one of the index
registers happens to contain the value $0b? Visual6502 to the rescue! But,
disappointingly, I'd been right. The instruction does an ordinary, absolute,
non-indexed memory access. And the "nab" macro expands to $0c $00 $00, so the
LSB is always zero. So where does the spurious read come from?

I went away from the computer and took a bath, and then I realised it. (The
bath was planned anyway, but the analogy with Archimedes is still amusing.)
When I wrote the code that generates the speedcode for restoring the video
matrix and font definition, I hadn't bothered to fill in the operand bytes. So
the code generator wrote the $0c, but left the operand as whatever random value
that happened to be in that memory. And thus, when I wrote my patterns into
fragile bytes all over memory, I would occasionally set the MSB of the operand
of a "nab" instruction in the speedcode to $dd. And if the LSB of that operand
happened to be $0b (or any value ending in $b, due to CIA mirroring), that
speedcode would then enable the TOD clock latch. And if the LSB happened to end
in $8, the latch would be disabled again, and the on-screen timer would jerk
ahead. After a cold boot, or in vice, the uninitialised memory would always be
$00 or $ff, and the program would appear to work flawlessly. But after having
run some other program, the bug would manifest itself, and keep doing that even
across power-cycles (which, if done quickly enough, do not clear the ram).

My first fix was to modify the speedcode generator to put $00 in the operands,
and also to modify the pattern generator to never write any value in the range
$d0-$df anywhere. This made the bug go away. But of course, there was still the
odd chance that an actual VSP crash might write, say, $de or $df into an MSB
somewhere, and this would cause spurious I/O activity that might cause some
cartridge somewhere to do something strange. The solution to this puzzle was
surprisingly difficult to find, even though it is trivial and even mundane in
hindsight. Instead of having my "nab" macro expand into a three-byte
instruction, I can make it expand into a two-instruction sequence: A regular
nop followed by a nop-immediate ($ea $80 $00). This way, there is no spurious
read cycle, and the program is truly independent of the values of the fragile
bytes.

I really wonder what my next bug will be.

... 15 posts hidden. Click here to view all posts....

2016-07-28 19:19

Copyfault

Registered: Dec 2001
Posts: 478

Great read!!!

@lft: just wondering why you started with two different macros "nim" and "nab"; if I get it right, a "nim" would fully suffice to circumvent the potentially trashed bytes (at the fragile memory positions as you called them). Guess you wanted to ensure 4 read cylces in a row in some of the routines. Funny thing is that if you didn't need it, that TOD (un)latch features would've never been brought back to discussion that prominently ;))

Now some go exploit that for some new trick...

2016-07-28 20:44

Slajerek

Registered: May 2015
Posts: 63

Cool read. Bed time story it is. Great finding. Anyway, I also find myself the best in solving coding puzzles during a bath or some other bathroom activities... :)

2016-07-28 21:15

lft

Registered: Jul 2007
Posts: 369

Quoting Copyfault

@lft: just wondering why you started with two different macros "nim" and "nab"; if I get it right, a "nim" would fully suffice to circumvent the potentially trashed bytes (at the fragile memory positions as you called them).

That's because instructions are 1, 2 or 3 bytes. When filling up an 8-byte chunk, you may for instance have:

    lda #1
    sta addr

And the next instruction is perhaps an "lda #2".

Now you've used up five bytes out of eight, but you can't put a two-byte instruction next, because then the fragile byte would follow that. So you need a quick way of padding the remaining three bytes. Hence a macro that expands into three bytes that do nothing (supposedly). And I couldn't find an instruction that would do that in less than four cycles. A branch would do, if its condition is always true.

Very rarely one ends up with four remaining bytes just when a three-byte instruction is due. In these cases, I pad with two "nim" instructions rather than define a four-byte macro.

2016-07-28 21:58

Copyfault

Registered: Dec 2001
Posts: 478

Ahh I see. So it was not due to the precise no.of cycles; instead, you wanted to "crunch" (in a sense) the loose bytes of each chunk to a single command.

Thanks for that explanation... and thanks again for that really "novellesque bug report" ;)

2016-07-29 05:48

Radiant

Registered: Sep 2004
Posts: 639

Oh the wonderful memories of debugging sessions, and the sense of simultaneous elation and forehead-slapping when you realise what's happening. :-) Well written, thank you for the tale!

2016-07-29 07:19

bugjam

Registered: Apr 2003
Posts: 2589

[shamelessselfpromotion]

Coder's Delight!

[/shamelessselfpromotion]

2016-07-29 07:22

Oswald

Registered: Apr 2002
Posts: 5094

btw this every 8 th byte fragile is an excellent area for kickass scripting, I guess it could be totally automatized.

2016-07-29 07:31

Radiant

Registered: Sep 2004
Posts: 639

Is "Kickass scripting" the new name for "preprocessing" or what? :-)

This would be pretty easy to automate using any method - as long as you don't care about timing.

2016-07-29 08:00

Perplex

Registered: Feb 2009
Posts: 255

Nice story, thanks for posting.

2016-07-29 20:11

Digger

Registered: Mar 2005
Posts: 437

Ghost in the machine, great read! :) I'd like to see the .prg running in Slajerek's visual debugger.

Previous - 1 | 2 | 3 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

HOL2001/Quantum
Steffan/BOOM!
Flashback
diabolus
The Human Co../Maste..
LightSide
Guests online: 108

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Webmasters

1 Slaygon  (9.6)
2 Perff  (9.6)
3 Sabbi  (9.5)
4 Morpheus  (9.4)
5 CreaMD  (9.1)

Page generated in: 0.072 sec.