Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > The tale of an elusive bug
2016-07-27 07:39
lft

Registered: Jul 2007
Posts: 369
The tale of an elusive bug

This is the story of a strange little bug that I ran into while working on some
VSP-safe code. It is the middle ground between light and shadow, between
science and superstition, and it lies between the pit of man's fears and the
summit of his knowledge. It is an area which we call... the Twilight Zone.

So I'm working on this program that is similar to VSP Lab; it performs VSP and
gathers statistics about the memory corruption. As you may remember, VSP may
randomly corrupt all memory locations with addresses ending in 7 or F (I call
these locations "fragile"). So the program itself must not depend on the value
of any such byte, and I have to jump through hoops to avoid every eighth byte
of memory. All good fun, especially when the code also needs to be
cycle-accurate.

My approach to this is to organise the code in eight-byte chunks, where each
chunk ends with one of two macros that I call "nim" and "nab". The first macro
("nim" for "nop immediate") expands to $80 $00, which is an illegal instruction
that can be used to skip a byte, in this case a fragile byte. It takes two
cycles and doesn't affect CPU state. The second macro ("nab" for "nop
absolute") expands to $0c $00 $00, which is likewise an illegal instruction, in
this case a three-byte, four-cycle nop.

The overall behaviour of my program is this: On every frame, some fragile
memory locations are filled with a predictable pattern (I generate a new
pattern each time, using an LFSR). I have enough rastertime to prepare three
pages in this way (3 * 32 = 96 bytes), and I gradually cycle through all pages.
At the the bottom of the screen, the VSP effect is triggered. Then, the
prepared memory locations are verified against the known pattern. If there's a
difference, a VSP crash must have happened, and this fact is recorded in the
statistics. There are also six pages of video matrix and font definition (ECM),
and since the VSP crash may have corrupted the fragile bytes in those areas,
they need to be restored from safe copies. This is done using speedcode
(lda #value, sta addr, nab) and there's enough time to do this just before the
VIC chip is ready to fetch the data again for the next frame.

Finally, on each frame I update the video matrix to reflect the new statistics.
I also update an on-screen timer that shows for how many minutes and seconds
the program has been running. The timer is driven by the TOD clock in CIA#2.
And this is where the bug comes in.

I suddenly noticed that the on-screen timer had stopped, so I assumed the
program had crashed. This is not unusual when writing VSP-safe code: It only
takes a very small lapse in vigilance, e.g. mistaking a three-byte instruction
for a two-byte instruction because the operand looked liked a zero-page
variable when in fact it wasn't, to throw off the alignment of all code that
follows. A VSP crash may then suddenly and subtly corrupt that code, and as a
result the system will most probably hang. So when the timer stopped, that's
naturally what I assumed had happened. But then, after a few seconds, the timer
suddenly started counting again. What's more, it jerked ahead several seconds,
compensating for the time during which it had remained static. And the program
seemed to run fine afterwards.

My first instinct was that the computer must have been busy with something else
for a while, so that the code for redrawing the timer didn't get to run. This
instinct was dismissed immediately of course; we are used to this sort of
non-responsive behaviour on modern computers, but the C64 doesn't work that
way. In particular, I knew that all code in the program runs in the same raster
interrupt, so if anything were to take too much time, that'd mess up the
display. Next idea: It could have been some kind of hickup in the TFT monitor.
That's modern technology, so it can be non-responsive. It's at least
conceivable that the TFT monitor could, under some weird conditions, freeze the
display for a couple of seconds.

But then the phenomenon occurred again, and yet again. This was all the more
strange because I was sure that this exact version of the program had worked
flawlessly before. It certainly worked flawlessly in vice. The next time the
timer froze, I entered the machine-code monitor and verified that, indeed, the
frozen time was there in the video matrix. So it wasn't the TFT monitor. I also
dumped the IO registers, and found that the correct time was in the CIA
register. So apparently my code did not successfully copy the value from the
CIA register onto the screen. Because, quite simply, the value in BCD in the
CIA register was several seconds ahead of the value represented by digits in
the video matrix. But the code for updating the video matrix was nevertheless
running on every frame.

So I scrutinised the code, but I couldn't find anything wrong with it. The way
I do on-screen timers (in compo-tunes, for instance) is that I read the seconds
register from the TOD clock, then the minutes register, and then I verify that
the seconds register still contains the old value. If not, I branch back and do
it all over again. This ensures that the minutes and seconds match up, that
when it goes from e.g. 0:59 to 1:00 it doesn't briefly display 1:59 or 0:00. I
was doing the exact same thing here (while also navigating around every eighth
byte using my macros), and somehow the code didn't work, although I couldn't
see anything wrong with it.

Then, from the back of my mind, I seemed to remember something. Looking up the
datasheet for the CIA chips, I found it: The TOD clock has a latch function.
When you read the hours register, the other registers freeze so that you can
retrieve them without any risk of getting values that don't match up. As soon
as you read the tenths-of-a-second register, all the registers return to their
normal function. Incidentally, this un-latching would have happened when I
dumped the IO registers in the machine-code monitor, which would explain why I
could see one value in the register and a different value in the video matrix.
The latch feature would also explain how the on-screen timer could stop and
then jerk ahead and resume. Except that I didn't read the hours register. And
the datasheet explicitly said that if you don't read the hours register, the
other registers behave normally, and you don't need to worry about the latching
business at all.

So now I was pretty confident that the TOD latch was the culprit, and so I was
at a crossroads: There was a quick fix, which was to simply read the
tenths-of-a-second register (disabling the latch) before reading out the
minutes and seconds. But that would only hide the real bug, which might have
other consequences. And even if it didn't have any other consequences, there
was still some element of my code that behaved in an unpredictable way, that I
didn't fully understand. I couldn't just leave that in, now, could I? So I
decided that I was not allowed to read the tenths-of-a-second register (except
once, at startup). I would make sure to find and fix the root cause of the
problem instead. And that meant figuring out why in the name of all that is
sacred the TOD latch was sometimes enabled or disabled, spuriously, sometimes
not at all, and only on real hardware.

After thinking a bit, I homed in on the "nab" macro as a primary suspect. This
is a three-byte, four-cycle nop instruction with the operand $0000. What
happens in the fourth cycle? Well, I had assumed that it would read from the
address $0000. Now, due to VSP crashes and my own corruption of the fragile
bytes (by overwriting them with predictable but ever-changing bit patterns),
the MSB of that address would change randomly all the time. Inevitably, the MSB
of a "nab" operand would occasionally be $dd. So in fact those instructions
would read from address $dd00 in CIA#2. But the hours register is $dd0b! So
what if the instruction actually performs an indexed read, and one of the index
registers happens to contain the value $0b? Visual6502 to the rescue! But,
disappointingly, I'd been right. The instruction does an ordinary, absolute,
non-indexed memory access. And the "nab" macro expands to $0c $00 $00, so the
LSB is always zero. So where does the spurious read come from?

I went away from the computer and took a bath, and then I realised it. (The
bath was planned anyway, but the analogy with Archimedes is still amusing.)
When I wrote the code that generates the speedcode for restoring the video
matrix and font definition, I hadn't bothered to fill in the operand bytes. So
the code generator wrote the $0c, but left the operand as whatever random value
that happened to be in that memory. And thus, when I wrote my patterns into
fragile bytes all over memory, I would occasionally set the MSB of the operand
of a "nab" instruction in the speedcode to $dd. And if the LSB of that operand
happened to be $0b (or any value ending in $b, due to CIA mirroring), that
speedcode would then enable the TOD clock latch. And if the LSB happened to end
in $8, the latch would be disabled again, and the on-screen timer would jerk
ahead. After a cold boot, or in vice, the uninitialised memory would always be
$00 or $ff, and the program would appear to work flawlessly. But after having
run some other program, the bug would manifest itself, and keep doing that even
across power-cycles (which, if done quickly enough, do not clear the ram).

My first fix was to modify the speedcode generator to put $00 in the operands,
and also to modify the pattern generator to never write any value in the range
$d0-$df anywhere. This made the bug go away. But of course, there was still the
odd chance that an actual VSP crash might write, say, $de or $df into an MSB
somewhere, and this would cause spurious I/O activity that might cause some
cartridge somewhere to do something strange. The solution to this puzzle was
surprisingly difficult to find, even though it is trivial and even mundane in
hindsight. Instead of having my "nab" macro expand into a three-byte
instruction, I can make it expand into a two-instruction sequence: A regular
nop followed by a nop-immediate ($ea $80 $00). This way, there is no spurious
read cycle, and the program is truly independent of the values of the fragile
bytes.

I really wonder what my next bug will be.
 
... 20 posts hidden. Click here to view all posts....
 
2016-07-29 07:19
bugjam

Registered: Apr 2003
Posts: 2589
[shamelessselfpromotion]

Coder's Delight!

[/shamelessselfpromotion]
2016-07-29 07:22
Oswald

Registered: Apr 2002
Posts: 5094
btw this every 8 th byte fragile is an excellent area for kickass scripting, I guess it could be totally automatized.
2016-07-29 07:31
Radiant

Registered: Sep 2004
Posts: 639
Is "Kickass scripting" the new name for "preprocessing" or what? :-)

This would be pretty easy to automate using any method - as long as you don't care about timing.
2016-07-29 08:00
Perplex

Registered: Feb 2009
Posts: 255
Nice story, thanks for posting.
2016-07-29 20:11
Digger

Registered: Mar 2005
Posts: 437
Ghost in the machine, great read! :) I'd like to see the .prg running in Slajerek's visual debugger.
Previous - 1 | 2 | 3 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
csabanw
Alakran_64
Durandal
Jammer
heavymett
/Panor..
Guests online: 106
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 No Listen  (9.6)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Graphicians
1 Mirage  (9.8)
2 Archmage  (9.7)
3 Pal  (9.6)
4 Carrion  (9.6)
5 Sulevi  (9.6)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.08 sec.