[CSDb] - User Forums

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Drivecode

2012-03-02 17:31

Bitbreaker

Registered: Oct 2002
Posts: 508

Drivecode

Hi guys,

finally i wanted to give drivecode a try, but the transfer is the bottleneck (and 2kb of memory sucks as well). Actually i only would push this further if transfer of two bytes is seriously faster than 154 cycles, as that is what i need to transform one vertice, what i thought of offloading to the drive. I'd love to also implement backface-culling within the drive, but that seems to be mostly impossible due to lack of memory (assumed we do more complex stuff than a cube).

So what i do so far is on c64 side:

-
     lda $d012
     sbc #$31
     bcc +
     clc
     and #$07
     beq -
+
     lda #%00001011
     sta $dd00
     nop
     eor #%00001000
     sta $dd00
     lda #$ff
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     asr #$fe     ;lets carry be cleared after lsr!
     eor $dd00

And on 1541 side:

!align 255,0
bin2ser
     !byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001
     !byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000


     ldx #$0f
     sbx #$00
     lsr
     lsr
     lsr
     lsr
     sta .y1+1     ;keep y free
     lda bin2ser,x
-
     ldx $1800
     bpl -
     sta $1800
     asl
     and #$0f
     sta $1800
.y1  lda bin2ser
     sta $1800
     asl
     and #$0f
     sta $1800

Any idea how to get this reasonably faster? I'd also be okay if just bit 0-6 are transferred form each byte, but that does not seem to help much, as bit 6 and 7 are the last in the transfer. I also thought of doing a burst of two bytes per sync, but that did somehow not work as i get jitter into the second byte then :-(

Bitbreaker

2012-03-02 17:57

tlr

Registered: Sep 2003
Posts: 1790

Really fast transfers are done by syncronizing more exactly than just a bit:bne (i.e 0-6 cycles) and then transfering many bytes in a row.

This syncronization can be accomplished by sending a pattern from the transmitting side and then use several single cycle adjustment polls on the receiving side.

Check stuff like the ar turbo, oliver stillers loaders and graham's warpcopy for examples of this.

2012-03-02 18:47

Bitbreaker

Registered: Oct 2002
Posts: 508

But how much is the overhead of tedious synchronisation compared to the gain of a burst transfer? On 1541 side the preparation of the byte to be transferred is also somewhat costly, so not too much to save here except the sync. Or is there also a faster way of splitting up a byte into 2 bit slices on 1541 side?
Also needless to say that i want the screen turn on during transfer and thus have to cope with badlines, that should not interrupt the transfer.

2012-03-03 10:27

MagerValp

Registered: Dec 2001
Posts: 1078

According to my calculations you should be able to sync and transfer 8 bytes between two badlines, at 44 (drive) cycles per byte.

2012-03-04 02:16

Repose

Registered: Oct 2010
Posts: 225

There's code for this
http://codebase64.org/doku.php?id=base:drivecalc_vectors

And off the top of my head, you can get almost 256 bytes from precise sync, someone told me they had to do 1 cycle sync every once in a while to keep it going.
Really I doubt there's much advantage to drivecalc, there's a huge overhead to the communication, it has to be a huge calculation to data ratio, and the 1541 is quite slow due to low memory, where the c64 can use huge tables.

2012-03-04 02:39

Repose

Registered: Oct 2010
Posts: 225

Looks like:
; NTSC: 16 bytes * 45 cycles at 1022727 Hz = 704.0002 ms
; drive: 16 bytes * 44 cycles at 1000000 Hz = 704 ms

2012-03-04 09:23

MagerValp

Registered: Dec 2001
Posts: 1078

That's with the screen closed, or in the border, which doesn't help here. It also doesn't work out so well in PAL, where you have to alternate between 43 and 44 cycles.

2012-03-04 10:04

Bitbreaker

Registered: Oct 2002
Posts: 508

@Respose:
The article from codebase64 is well known to me, but it uses a full handshake on each byte and even calling the transfer-routine with an jsr. So i would not say it is the fastest way to transfer.
I am more asking, if the preparation of the 2 nybbles on 1541 side could be made faster, as well as the stuff before the sync on c64 side. And if the eor/lsr/lsr is only way to go for transferring 2 bits, or if there are other (faster) ways to do so.
And yes, please refrain from telling me that drivecode make no sense. That is not part of my question. My math is at least good enough to calc the gain/loss when using drivecode. I just want to give this a try and make it perform as fast as possible. If it is not of much use in the yet case, it might come handy for some future demo i am coding, or be used for a dedicated drivecode-article on codebase64.

2012-03-04 10:10

Bitbreaker

Registered: Oct 2002
Posts: 508

@MagerValp
Hmm 44 cycles sounds tight, but when having the ~63 cycles of the badline to prepare the next 8 bytes on 1541 side to setup the next burst it could work out well. I'll give that a try and would then sync to the first good line only.

2012-03-04 16:37

MagerValp

Registered: Dec 2001
Posts: 1078

You have 63 * 7 + 20 = 461 cycles between each badline. At 43.5 cycles per byte in PAL you have 461 - (43 * 4 + 44 * 4) = 113 cycles to sync. You have 8 ms between each bit pair, and a clock skew of about 1.2 ms, so you should be fine with a sync to within ±2 cycles.

2012-03-04 16:42

Dano

Registered: Jul 2004
Posts: 234

afaik krill did the drivecalls in the lower border. imho drivecalc makes sense where you can parallelize computing, like clearing the screen while the drive does the calc. so you can compensate slower code within the drive.

i'd be pretty much interested in some sources to try a little myself, yet i'm not into that stuff more than doing some bits of thinking.

2012-03-05 09:36

Bitbreaker

Registered: Oct 2002
Posts: 508

Thanks for talking about the obvious and explaining the sense of drivecode again *sigh*. Now as we have discussed all the irrelevant stuff, i'd be happy to return to the core questions: Is it possible to save cycles within the code? Can we transfer 7 bit only in less time? At least the possibility of bursts is now clarified after the comments from MagerValp, so thanks for that! Seems as i have to do more proper syncing, but can therefor burst for quite a while. But still, is there more to optimize?

And: Don't think code, write code. Quickly writing some code-uploader for the floppy and the transfer routines i presented here, was a piece of cake with all that information and documentation at hand. Now it is about optimizing, that's the fun part.

2012-03-05 13:44

Fresh

Registered: Jan 2005
Posts: 101

Can't think anything better than a 2 byte burst copy (As you may have already tried). I'vent tested the code, take it just as a suggestion. IIRC, 1541's cpu is a bit faster than pal c64 so you may need to wait a cycle to prevent jittering. The instructions commented with (*) can be switched with sta $1800,y (provided you put an ldy #$00 somewhere before the routine).
You said you only need bit 0-6 so you may even skip the last 2 bit by previously rolling one bit of val2 in val1.
My humble 2 cents.

(C64)

	 
     ...
     lda #%00001011
     sta $dd00
     nop
     eor #%00001000
     sta $dd00	 
     lda #$ff
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     asr #$fe     ;lets carry be cleared after lsr!
     eor $dd00
     tay
     lda #$ff
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     lsr
     eor $dd00
     lsr
     asr #$fe     ;lets carry be cleared after lsr!
     eor $dd00	 
     ...          ; 1st byte on Y, 2nd byte in A

(1541)

     !align 255,0
bin2ser
     !byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001
     !byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000

     ...
     lda val1
     ldx #$0f
     sbx #$00
     stx .y0+1
     lsr
     lsr
     lsr
     lsr
     sta .y1+1
     lda val2
     ldx #$0f
     sbx #$00
     stx .y2+1
     lsr
     lsr
     lsr
     lsr
     sta .y3+1
.y0
     lda bin2ser
-
     ldx $1800
     bpl -
     sta $1800
     asl
     and #$0f
     sta $1800
.y1  
     lda bin2ser
     sta $1800
     asl
     and #$0f
     sta $1800
.y2
     lda bin2ser
     sta $1800 ; (*)
     asl
     and #$0f
     sta $1800 ; (*)
.y3
     lda bin2ser
     sta $1800 ; (*)
     asl
     and #$0f
     sta $1800
     ...

2012-03-05 19:16

Bitbreaker

Registered: Oct 2002
Posts: 508

@Freshness79:
I am afraid that this won't work out, as preparation of the data on 1541 side consumes too much cycles then (need to prepare 2 bytes, while on c64 side sync is only done once, rest of the transaction is 28 cycles per byte on both sides)
The aggregation of bit 6 of both bytes might however be an option, i'll think about that and see if it will be faster.

2012-03-06 01:17

Fresh

Registered: Jan 2005
Posts: 101

Ok, I've worked some more on the problem.
I've found a solution with.. ehm... some constrains:
- Only 7 bit supported, highest bit MUST be 0
- You have to live with scrambled bits (which however can be easily descrambled with 256 byte table).
Beware I'vent tested it!

(64)

	 ...
         lda #%00001011
         sta $dd00
         nop
         eor #%00001000
	 sta $dd00
	 nop
         lda $dd00	     	; c=x A=hf000011
	 lsr			; c=1 A=0hf00001
	 ora $dd00		; c=1 A=gef00011 (h must be 0!)
	 lsr			; c=1 A=0gef0001
	 ldx $dd00		; X = db000000
	 lsr			; c=1 A=00gef000 
	 ora $dd00		; c=0 A=cagef000
	 ora table,x		; Translation table only for moving X on lower bits
	 ...

(1541)

	...
	ldx #$0f	; 2 - abcdefgh
	sbx #$00	; 2 X = 0000efgh
	lsr		; 2
	lsr		; 2
	lsr		; 2
	tay		; 2 Y = 000abcde
	txa		; 2	
	asl		; 2
	ldx #%00001010	; 2 A = 000efgh0 => 0000C0D0
loop
	bit $1800
	bpl loop
	sax $1800	; 4 fh
	lsr		; 2
	sax $1800	; 4 eg
	tya		; 2
	sax $1800	; 4 bd
	lsr		; 2
	sax $1800	; 4 ac
	...

2012-03-06 07:20

Bitbreaker

Registered: Oct 2002
Posts: 508

Kudos for the nice brainfuck :-) And nice to see sax in action, but still, takes more cycles on c64 than my first proposal :-) The additional

ora table,x
tax
lda descramble,x

just adds 10 additional cycles (though 3 times lsr is saved).

2012-03-06 12:50

Fresh

Registered: Jan 2005
Posts: 101

Yep, ending with a scrambled byte may be considered cheating, I guess. :)
Anyway, I post a corrected version just for completeness:

(1541)

	...
	ldx #$0f		        ; 2 - abcdefgh
	sbx #$00		        ; 2 X = 0000efgh
	lsr				; 2
	lsr				; 2
	lsr				; 2
	tay				; 2 Y = 000abcde
	txa				; 2 A = 0000efgh
	ldx #%00001010	                ; 2 X mask 0000C0D0
loop
	bit $1800
	bpl loop
	sax $1800	        ; 4 save (eg)
	asl			; 2 A=000efgh0
	sax $1800	        ; 4 save (fh)
	tya			; 2 A=000abcde
	sax $1800	        ; 4 save (bd)
	lsr			; 2 A=0000abcd
	sax $1800	        ; 4 save (ac)
	...

(64)

	...
	lda #%00001011
	sta $dd00
	nop
	eor #%00001000
	sta $dd00
	nop
        lda $dd00		;4 A=ge000011
	lsr			;2 A=0ge00001
	ldx $dd00		;4 X=hf000011
	lsr			;2 A=00ge0000
	ora $dd00		;4 A=dbge0011
	lsr			;2 A=0dbge001	
	ora $dd00		;4 A=cdbge011
	ora table,x		;4 A=cdbgehf1 (table move bits 7,6 to 2,^1)
	...

Partially OT comment: this may not be useful in your case, but there are at least two situations in which this solution may become interesting:
- if transmitted 7 bits will be used as an index, then it's just a matter of scrambling the indexed table.
- if transmitted 7 bits need further calculations anyway, then you can build a table to descramble and do that calc in one go.

2012-03-07 08:33

Dano

Registered: Jul 2004
Posts: 234

don't know if this helps a little?

http://www.pagetable.com/?p=568

2012-03-07 10:23

The Human Code Machine

Registered: Sep 2005
Posts: 112

Scrambling the bytes in the floppy before sending the data should do the trick. Everything saving cycles on the c64 side should be highest priority.

2012-03-07 16:24

Fresh

Registered: Jan 2005
Posts: 101

Two different proposals for 1541 side, both of them include scrambling.
In the former the only gain is an iny 'inside' the transfer. In the latter you gain more cycles but it's quite expensive in terms of memory.
Can't imagine anything faster on c64 side: those 2 adjacent bits are a nightmare.
I post a link to avoid flooding the thread.

http://pastebin.com/ZS8kUNKb

2012-03-07 17:31

The Human Code Machine

Registered: Sep 2005
Posts: 112

I wouldn't do the scrambling inside the transfer loop. Just calculate your stuff using the floppy, scramble the data, sync with the c64 and then burst all data to the c64 as fast as possible.

2012-03-07 18:13

Fresh

Registered: Jan 2005
Posts: 101

Quote:

Everything saving cycles on the c64 side should be highest priority.

@THCM
I didn't quite understand immediately what you meant.
You're definetly true: there's no point in wasting cycles during the transfer in something that can easily be done beforehand.
@Bitbreaker
Maybe the problem is the fact that you need to get little pieces of data at a constant (high?) pace. It mainly depends on the data size and frequency that you're expecting on c64 side.
If you think you can do a big burst on a per-frame basis, then follow THCM advice.

2012-03-09 07:12

Bitbreaker

Registered: Oct 2002
Posts: 508

Well, i could scramble the data to be transfered on the 1541 as the 1541 is ready with its calculations more than in time. I'd do things in software however, using big tables is not an option, i'd anyway need every single byte on the 1541 for my calculations. This way i could indeed save a bunch of cycles on a single transfer and also burst without long preparations.

2012-03-09 08:07

The Human Code Machine

Registered: Sep 2005
Posts: 112

Shouldn't it be possible to take over the whole 2kb of memory when using custom transfer routines? You could use the whole zeropage for selfmodifying code etc. and the complete stack.

2012-03-09 09:26

Bitbreaker

Registered: Oct 2002
Posts: 508

$100 squaretables
$100 sine/cosinetable
$100 table for perspective correction
$156 bytes for vertices (no indexes for faces yet)
$e4 bytes for the resulting data (stored in zeropage)

and then there still some code for doing the caluclations and transfer. So it'll getting tight :-)

2012-03-09 11:35

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: $100 squaretables
$100 sine/cosinetable
$100 table for perspective correction
$156 bytes for vertices (no indexes for faces yet)
$e4 bytes for the resulting data (stored in zeropage)

and then there still some code for doing the caluclations and transfer. So it'll getting tight :-)

$156 bytes for vertices for ONE object? If it's for many objects then simply upload new vertices when you switch object to save memory in the drive (to save max space, simply reset the drive to get the original transfer routines back and upload the whole shit again but with a different object).

2012-03-09 13:31

Bitbreaker

Registered: Oct 2002
Posts: 508

Yes, for ONE object :-) As you can read in my first post: "(assumed we do more complex stuff than a cube)"
If it is all just about a cube, i would not bother at all, done that already in my last demo, and optimized that for the codebase64 example.
However i just managed to get the transformation even faster (128 cycles per vertice) so now drivecode is even gaining less compared to doing all stuff on the c64 :-)

2012-03-09 16:36

Graham
Account closed

Registered: Dec 2002
Posts: 990

Quoting JackAsser

simply reset the drive to get the original transfer routines back

That's a bad idea, since old 1541's will do a head bump on reset which is quite unhealthy + noisy and will cause a deadlock when the serial bus is accessed during that time.

2012-03-09 18:21

tlr

Registered: Sep 2003
Posts: 1790

Quote: Quoting JackAsser
simply reset the drive to get the original transfer routines back

That's a bad idea, since old 1541's will do a head bump on reset which is quite unhealthy + noisy and will cause a deadlock when the serial bus is accessed during that time.

Which revision is that? I can't remember that from my long board 1541 but perhaps I've stuck a non-standard ROM in there. Can't remember.

2012-03-09 18:42

chatGPZ

Registered: Dec 2001
Posts: 11386

isnt it just the 1541-II that does that crap on reset?

Refresh

Subscribe to this thread: