| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Drivecode
Hi guys,
finally i wanted to give drivecode a try, but the transfer is the bottleneck (and 2kb of memory sucks as well). Actually i only would push this further if transfer of two bytes is seriously faster than 154 cycles, as that is what i need to transform one vertice, what i thought of offloading to the drive. I'd love to also implement backface-culling within the drive, but that seems to be mostly impossible due to lack of memory (assumed we do more complex stuff than a cube).
So what i do so far is on c64 side:
-
lda $d012
sbc #$31
bcc +
clc
and #$07
beq -
+
lda #%00001011
sta $dd00
nop
eor #%00001000
sta $dd00
lda #$ff
eor $dd00
lsr
lsr
eor $dd00
lsr
lsr
eor $dd00
lsr
asr #$fe ;lets carry be cleared after lsr!
eor $dd00
And on 1541 side:
!align 255,0
bin2ser
!byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001
!byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000
ldx #$0f
sbx #$00
lsr
lsr
lsr
lsr
sta .y1+1 ;keep y free
lda bin2ser,x
-
ldx $1800
bpl -
sta $1800
asl
and #$0f
sta $1800
.y1 lda bin2ser
sta $1800
asl
and #$0f
sta $1800
Any idea how to get this reasonably faster? I'd also be okay if just bit 0-6 are transferred form each byte, but that does not seem to help much, as bit 6 and 7 are the last in the transfer. I also thought of doing a burst of two bytes per sync, but that did somehow not work as i get jitter into the second byte then :-(
Bitbreaker |
|
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Really fast transfers are done by syncronizing more exactly than just a bit:bne (i.e 0-6 cycles) and then transfering many bytes in a row.
This syncronization can be accomplished by sending a pattern from the transmitting side and then use several single cycle adjustment polls on the receiving side.
Check stuff like the ar turbo, oliver stillers loaders and graham's warpcopy for examples of this.
|
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
But how much is the overhead of tedious synchronisation compared to the gain of a burst transfer? On 1541 side the preparation of the byte to be transferred is also somewhat costly, so not too much to save here except the sync. Or is there also a faster way of splitting up a byte into 2 bit slices on 1541 side?
Also needless to say that i want the screen turn on during transfer and thus have to cope with badlines, that should not interrupt the transfer. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
According to my calculations you should be able to sync and transfer 8 bytes between two badlines, at 44 (drive) cycles per byte. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
There's code for this
http://codebase64.org/doku.php?id=base:drivecalc_vectors
And off the top of my head, you can get almost 256 bytes from precise sync, someone told me they had to do 1 cycle sync every once in a while to keep it going.
Really I doubt there's much advantage to drivecalc, there's a huge overhead to the communication, it has to be a huge calculation to data ratio, and the 1541 is quite slow due to low memory, where the c64 can use huge tables.
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
Looks like:
; NTSC: 16 bytes * 45 cycles at 1022727 Hz = 704.0002 ms
; drive: 16 bytes * 44 cycles at 1000000 Hz = 704 ms
|
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
That's with the screen closed, or in the border, which doesn't help here. It also doesn't work out so well in PAL, where you have to alternate between 43 and 44 cycles. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
@Respose:
The article from codebase64 is well known to me, but it uses a full handshake on each byte and even calling the transfer-routine with an jsr. So i would not say it is the fastest way to transfer.
I am more asking, if the preparation of the 2 nybbles on 1541 side could be made faster, as well as the stuff before the sync on c64 side. And if the eor/lsr/lsr is only way to go for transferring 2 bits, or if there are other (faster) ways to do so.
And yes, please refrain from telling me that drivecode make no sense. That is not part of my question. My math is at least good enough to calc the gain/loss when using drivecode. I just want to give this a try and make it perform as fast as possible. If it is not of much use in the yet case, it might come handy for some future demo i am coding, or be used for a dedicated drivecode-article on codebase64.
|
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
@MagerValp
Hmm 44 cycles sounds tight, but when having the ~63 cycles of the badline to prepare the next 8 bytes on 1541 side to setup the next burst it could work out well. I'll give that a try and would then sync to the first good line only. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
You have 63 * 7 + 20 = 461 cycles between each badline. At 43.5 cycles per byte in PAL you have 461 - (43 * 4 + 44 * 4) = 113 cycles to sync. You have 8 ms between each bit pair, and a clock skew of about 1.2 ms, so you should be fine with a sync to within ±2 cycles. |
| |
Dano
Registered: Jul 2004 Posts: 234 |
afaik krill did the drivecalls in the lower border. imho drivecalc makes sense where you can parallelize computing, like clearing the screen while the drive does the calc. so you can compensate slower code within the drive.
i'd be pretty much interested in some sources to try a little myself, yet i'm not into that stuff more than doing some bits of thinking. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Thanks for talking about the obvious and explaining the sense of drivecode again *sigh*. Now as we have discussed all the irrelevant stuff, i'd be happy to return to the core questions: Is it possible to save cycles within the code? Can we transfer 7 bit only in less time? At least the possibility of bursts is now clarified after the comments from MagerValp, so thanks for that! Seems as i have to do more proper syncing, but can therefor burst for quite a while. But still, is there more to optimize?
And: Don't think code, write code. Quickly writing some code-uploader for the floppy and the transfer routines i presented here, was a piece of cake with all that information and documentation at hand. Now it is about optimizing, that's the fun part. |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
Can't think anything better than a 2 byte burst copy (As you may have already tried). I'vent tested the code, take it just as a suggestion. IIRC, 1541's cpu is a bit faster than pal c64 so you may need to wait a cycle to prevent jittering. The instructions commented with (*) can be switched with sta $1800,y (provided you put an ldy #$00 somewhere before the routine).
You said you only need bit 0-6 so you may even skip the last 2 bit by previously rolling one bit of val2 in val1.
My humble 2 cents.
(C64)
...
lda #%00001011
sta $dd00
nop
eor #%00001000
sta $dd00
lda #$ff
eor $dd00
lsr
lsr
eor $dd00
lsr
lsr
eor $dd00
lsr
asr #$fe ;lets carry be cleared after lsr!
eor $dd00
tay
lda #$ff
eor $dd00
lsr
lsr
eor $dd00
lsr
lsr
eor $dd00
lsr
asr #$fe ;lets carry be cleared after lsr!
eor $dd00
... ; 1st byte on Y, 2nd byte in A
(1541)
!align 255,0
bin2ser
!byte %1111, %0111, %1101, %0101, %1011, %0011, %1001, %0001
!byte %1110, %0110, %1100, %0100, %1010, %0010, %1000, %0000
...
lda val1
ldx #$0f
sbx #$00
stx .y0+1
lsr
lsr
lsr
lsr
sta .y1+1
lda val2
ldx #$0f
sbx #$00
stx .y2+1
lsr
lsr
lsr
lsr
sta .y3+1
.y0
lda bin2ser
-
ldx $1800
bpl -
sta $1800
asl
and #$0f
sta $1800
.y1
lda bin2ser
sta $1800
asl
and #$0f
sta $1800
.y2
lda bin2ser
sta $1800 ; (*)
asl
and #$0f
sta $1800 ; (*)
.y3
lda bin2ser
sta $1800 ; (*)
asl
and #$0f
sta $1800
...
|
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
@Freshness79:
I am afraid that this won't work out, as preparation of the data on 1541 side consumes too much cycles then (need to prepare 2 bytes, while on c64 side sync is only done once, rest of the transaction is 28 cycles per byte on both sides)
The aggregation of bit 6 of both bytes might however be an option, i'll think about that and see if it will be faster. |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
Ok, I've worked some more on the problem.
I've found a solution with.. ehm... some constrains:
- Only 7 bit supported, highest bit MUST be 0
- You have to live with scrambled bits (which however can be easily descrambled with 256 byte table).
Beware I'vent tested it!
(64)
...
lda #%00001011
sta $dd00
nop
eor #%00001000
sta $dd00
nop
lda $dd00 ; c=x A=hf000011
lsr ; c=1 A=0hf00001
ora $dd00 ; c=1 A=gef00011 (h must be 0!)
lsr ; c=1 A=0gef0001
ldx $dd00 ; X = db000000
lsr ; c=1 A=00gef000
ora $dd00 ; c=0 A=cagef000
ora table,x ; Translation table only for moving X on lower bits
...
(1541)
...
ldx #$0f ; 2 - abcdefgh
sbx #$00 ; 2 X = 0000efgh
lsr ; 2
lsr ; 2
lsr ; 2
tay ; 2 Y = 000abcde
txa ; 2
asl ; 2
ldx #%00001010 ; 2 A = 000efgh0 => 0000C0D0
loop
bit $1800
bpl loop
sax $1800 ; 4 fh
lsr ; 2
sax $1800 ; 4 eg
tya ; 2
sax $1800 ; 4 bd
lsr ; 2
sax $1800 ; 4 ac
...
|
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Kudos for the nice brainfuck :-) And nice to see sax in action, but still, takes more cycles on c64 than my first proposal :-) The additional
ora table,x
tax
lda descramble,x
just adds 10 additional cycles (though 3 times lsr is saved). |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
Yep, ending with a scrambled byte may be considered cheating, I guess. :)
Anyway, I post a corrected version just for completeness:
(1541)
...
ldx #$0f ; 2 - abcdefgh
sbx #$00 ; 2 X = 0000efgh
lsr ; 2
lsr ; 2
lsr ; 2
tay ; 2 Y = 000abcde
txa ; 2 A = 0000efgh
ldx #%00001010 ; 2 X mask 0000C0D0
loop
bit $1800
bpl loop
sax $1800 ; 4 save (eg)
asl ; 2 A=000efgh0
sax $1800 ; 4 save (fh)
tya ; 2 A=000abcde
sax $1800 ; 4 save (bd)
lsr ; 2 A=0000abcd
sax $1800 ; 4 save (ac)
...
(64)
...
lda #%00001011
sta $dd00
nop
eor #%00001000
sta $dd00
nop
lda $dd00 ;4 A=ge000011
lsr ;2 A=0ge00001
ldx $dd00 ;4 X=hf000011
lsr ;2 A=00ge0000
ora $dd00 ;4 A=dbge0011
lsr ;2 A=0dbge001
ora $dd00 ;4 A=cdbge011
ora table,x ;4 A=cdbgehf1 (table move bits 7,6 to 2,^1)
...
Partially OT comment: this may not be useful in your case, but there are at least two situations in which this solution may become interesting:
- if transmitted 7 bits will be used as an index, then it's just a matter of scrambling the indexed table.
- if transmitted 7 bits need further calculations anyway, then you can build a table to descramble and do that calc in one go.
|
| |
Dano
Registered: Jul 2004 Posts: 234 |
don't know if this helps a little?
http://www.pagetable.com/?p=568 |
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
Scrambling the bytes in the floppy before sending the data should do the trick. Everything saving cycles on the c64 side should be highest priority. |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
Two different proposals for 1541 side, both of them include scrambling.
In the former the only gain is an iny 'inside' the transfer. In the latter you gain more cycles but it's quite expensive in terms of memory.
Can't imagine anything faster on c64 side: those 2 adjacent bits are a nightmare.
I post a link to avoid flooding the thread.
http://pastebin.com/ZS8kUNKb |
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
I wouldn't do the scrambling inside the transfer loop. Just calculate your stuff using the floppy, scramble the data, sync with the c64 and then burst all data to the c64 as fast as possible. |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
Quote:
Everything saving cycles on the c64 side should be highest priority.
@THCM
I didn't quite understand immediately what you meant.
You're definetly true: there's no point in wasting cycles during the transfer in something that can easily be done beforehand.
@Bitbreaker
Maybe the problem is the fact that you need to get little pieces of data at a constant (high?) pace. It mainly depends on the data size and frequency that you're expecting on c64 side.
If you think you can do a big burst on a per-frame basis, then follow THCM advice. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Well, i could scramble the data to be transfered on the 1541 as the 1541 is ready with its calculations more than in time. I'd do things in software however, using big tables is not an option, i'd anyway need every single byte on the 1541 for my calculations. This way i could indeed save a bunch of cycles on a single transfer and also burst without long preparations.
|
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
Shouldn't it be possible to take over the whole 2kb of memory when using custom transfer routines? You could use the whole zeropage for selfmodifying code etc. and the complete stack. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
$100 squaretables
$100 sine/cosinetable
$100 table for perspective correction
$156 bytes for vertices (no indexes for faces yet)
$e4 bytes for the resulting data (stored in zeropage)
and then there still some code for doing the caluclations and transfer. So it'll getting tight :-) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: $100 squaretables
$100 sine/cosinetable
$100 table for perspective correction
$156 bytes for vertices (no indexes for faces yet)
$e4 bytes for the resulting data (stored in zeropage)
and then there still some code for doing the caluclations and transfer. So it'll getting tight :-)
$156 bytes for vertices for ONE object? If it's for many objects then simply upload new vertices when you switch object to save memory in the drive (to save max space, simply reset the drive to get the original transfer routines back and upload the whole shit again but with a different object).
|
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Yes, for ONE object :-) As you can read in my first post: "(assumed we do more complex stuff than a cube)"
If it is all just about a cube, i would not bother at all, done that already in my last demo, and optimized that for the codebase64 example.
However i just managed to get the transformation even faster (128 cycles per vertice) so now drivecode is even gaining less compared to doing all stuff on the c64 :-) |
| |
Graham Account closed
Registered: Dec 2002 Posts: 990 |
Quoting JackAssersimply reset the drive to get the original transfer routines back
That's a bad idea, since old 1541's will do a head bump on reset which is quite unhealthy + noisy and will cause a deadlock when the serial bus is accessed during that time. |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Quote: Quoting JackAssersimply reset the drive to get the original transfer routines back
That's a bad idea, since old 1541's will do a head bump on reset which is quite unhealthy + noisy and will cause a deadlock when the serial bus is accessed during that time.
Which revision is that? I can't remember that from my long board 1541 but perhaps I've stuck a non-standard ROM in there. Can't remember. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
isnt it just the 1541-II that does that crap on reset? |