| | Mr SQL
Registered: Feb 2023 Posts: 158 |
Fastest time printing binary in BASIC or Assembly
Interesting Video on 8-bit show and tell:
https://www.youtube.com/watch?v=P8t6otqoz_E
What's the fastest time you can print binary to the screen in BASIC or Assembly?
I got it down to 26 seconds in BASIC without a pre-calc routine. |
|
... 29 posts hidden. Click here to view all posts.... |
| | chatGPZ
Registered: Dec 2001 Posts: 11523 |
Thats probably correct - so just a faster scroll routine would speed this up signficantly :) |
| | ws
Registered: Apr 2012 Posts: 251 |
nice challenge, tho |
| | JackAsser
Registered: Jun 2002 Posts: 2038 |
Quote: This probably does not count, but something like this seems obvious if speed is really the priority. This one assumes that you have a custom charset that has a "0" char at position 0 and a "1" char at position 1 in the charset, and it will always print the result at a specific fixed location on the screen.
;a register contains the byte to display as binary
ldx #1
sax $0407
lsr a
sax $0406
lsr a
sax $0405
lsr a
sax $0404
lsr a
sax $0403
lsr a
sax $0402
lsr a
sax $0401
lsr a
sax $0400
=8*2+8*4=48 cycles, so you could do it thousands of times in a single second.
SAX is one of the illegal opcodes, in case someone happens to be unfamiliar with that:
http://unusedino.de/ec64/technical/aay/c64/bsax.htm
I find this a much more interesting challenge! Can this be improved?!
Rules: it should be printed on a normal text screen, any charset is ok. One digit per char. |
| | TWW
Registered: Jul 2009 Posts: 557 |
Quote: I find this a much more interesting challenge! Can this be improved?!
Rules: it should be printed on a normal text screen, any charset is ok. One digit per char.
14 bytes, 16 cycles excl. call/ret:
sta $2000
lda #$3b
sta $d011
lda #$18
sta $d018
rts
How about bitmap with a 1x1 pixel charset :D
Ahyeah normal text screen...
But seriously, are we still talking about peeking the CIA data port (in which case 5 bits would be enough) or a general hexToBin()?
EDIT:
2 variants without resorting to charset trickery;
// Variant #1:
ldx #$18 // 2
stx $0400 // 4
stx $0401 // 4
stx $0402 // 4
stx $0403 // 4
stx $0404 // 4
stx $0405 // 4
stx $0406 // 4
stx $0407 // 4
lsr // 2
asl $0400 // 4
lsr // 2
asl $0401 // 4
lsr // 2
asl $0402 // 4
lsr // 2
asl $0403 // 4
lsr // 2
asl $0404 // 4
lsr // 2
asl $0405 // 4
lsr // 2
asl $0406 // 4
lsr // 2
asl $0407 // 4
// 16 x 4 + 9 * 2 = 82 cycles / 58 bytes
// Probably has some illegal OPC voodo potential.
// Variant #2
ldx #'0' // 2
ldy #'1' // 2 -> 4 cycles 'overhead'
lsr // 2
bcc !next0+ // 2 / 3
sty $0400 // 4 -> 8/9 cycles dep. the branch
lsr
bcc !Next1+
!Prev0:
sty $0401
lsr
bcc !Next2+
!Prev1:
sty $0402
lsr
bcc !Next3+
!Prev2:
sty $0403
lsr
bcc !Next4+
!Prev3:
sty $0404
lsr
bcc !Next5+
!Prev4:
sty $0405
lsr
bcc !Next6+
!Prev5:
sty $0406
lsr
bcc !Next7+
!Prev6:
sty $0407
rts
!Next0:
stx $0400
lsr
bcs !Prev0-
!Next1:
stx $0401
lsr
bcc !Prev1-
!Next2:
stx $0402
lsr
bcc !Prev1-
!Next3:
stx $0403
lsr
bcc !Prev1-
!Next4:
stx $0404
lsr
bcc !Prev1-
!Next5:
stx $0405
lsr
bcc !Prev1-
!Next6:
stx $0406
lsr
bcc !Prev1-
!Next7:
stx $0407
rts // ~8 x 8 (+8) + 2 = ~66/74 cycles / ~100 bytes
[code]
Edit 2: 2 variants with voodo and shameless charset trickery:
[/code]
// Variant #3 with voodo
ldx #$18 // 2
stx $0400 // 4
stx $0401 // 4
stx $0402 // 4
stx $0403 // 4
stx $0404 // 4
stx $0405 // 4
stx $0406 // 4
stx $0407 // 4
slo $0400 // 4
slo $0401 // 4
slo $0402 // 4
slo $0403 // 4
slo $0404 // 4
slo $0405 // 4
slo $0406 // 4
slo $0407 // 4
// 16 x 4 + 2 = 66 cycles / 50 bytes
// Variant #4 with voodo & charset (space = "0", ! = "1")
slo $0400 // 4
slo $0401 // 4
slo $0402 // 4
slo $0403 // 4
slo $0404 // 4
slo $0405 // 4
slo $0406 // 4
slo $0407 // 4
// 8 x 4 = 32 cycles / 24 bytes
|
| | Frantic
Registered: Mar 2003 Posts: 1661 |
slo $ffff is 6 cycles, not 4.
http://unusedino.de/ec64/technical/aay/c64/bslo.htm |
| | Mr SQL
Registered: Feb 2023 Posts: 158 |
Quoting spider-jTo be fair I also used the KERNAL output routine for printing the chars.
Small solution with loops / 78 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary.prg
!cpu 6510
; ==============================================================================
save_num = 0x02
CHAROUT = 0xF1CA
; ==============================================================================
*= 0x0801
; basic TI$ timer wrapper program:
; --------------------------------
; 0 TI$="000000":SYS2092
; 1 PRINT"TIME:"TI/60
; --------------------------------
; RESULT: 8.91666667
!byte 0x18, 0x08, 0x00, 0x00
!byte 0x54, 0x49, 0x24, 0xB2
!byte 0x22, 0x30, 0x30, 0x30
!byte 0x30, 0x30, 0x30, 0x22
!byte 0x3A, 0x9E, 0x32, 0x30
!byte 0x39, 0x32, 0x00, 0x2A
!byte 0x08, 0x01, 0x00, 0x99
!byte 0x22, 0x54, 0x49, 0x4D
!byte 0x45, 0x3A, 0x22, 0x54
!byte 0x49, 0xAD, 0x36, 0x30
!byte 0x00, 0x00, 0x00
; ==============================================================================
*= 0x082C
ldy #0
-- sty save_num
ldx #0
- clc
rol save_num
bcc +
lda #'1'
!byte 0x2C
+ lda #'0'
jsr CHAROUT
inx
cpx #8
bne -
lda #0x0D
jsr CHAROUT
iny
bne --
rts
Slightly faster solution with unrolled loops / 28974 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary_unrolled.prg
!cpu 6510
; ==============================================================================
save_num = 0x02
CHAROUT = 0xF1CA
; ==============================================================================
*= 0x0801
; basic TI$ timer wrapper program:
; --------------------------------
; 0 TI$="000000":SYS2092
; 1 PRINT"TIME:"TI/60
; --------------------------------
; RESULT: 8.9
!byte 0x18, 0x08, 0x00, 0x00
!byte 0x54, 0x49, 0x24, 0xB2
!byte 0x22, 0x30, 0x30, 0x30
!byte 0x30, 0x30, 0x30, 0x22
!byte 0x3A, 0x9E, 0x32, 0x30
!byte 0x39, 0x32, 0x00, 0x2A
!byte 0x08, 0x01, 0x00, 0x99
!byte 0x22, 0x54, 0x49, 0x4D
!byte 0x45, 0x3A, 0x22, 0x54
!byte 0x49, 0xAD, 0x36, 0x30
!byte 0x00, 0x00, 0x00
; ==============================================================================
*= 0x082C
!for i, 0, 255 {
lda #i
sta save_num
!for j, 0, 7 {
clc
rol save_num
bcc +
lda #'1'
!byte 0x2C
+ lda #'0'
jsr CHAROUT
}
lda #0x0D
jsr CHAROUT
}
rts
EDIT: corrected ror -> rol.
Both good solutions! Keep in mind we also have to print the decimal number with a space and then the binary string followed by a carriage return for each number from 0-255. This will add to the time.
I put my BASIC solution in the comments on the video and noticed as observed in this thread that the kernel CHAROUT routine is consuming most of the time. Just printing 0's and 1's instead of "0" and "1" as strings increased the time of my BASIC solution by 12 seconds, the kernel routine handles BASIC strings faster than BASIC numbers.
My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.
This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT. |
| | spider-j
Registered: Oct 2004 Posts: 505 |
Quoting Mr SQLMy idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.
This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.
So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.
One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.
https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes
!cpu 6510
; ==============================================================================
CRSRX = 0xD3
CHAROUT = 0xF1CA
CLRSCR = 0xE544
; ==============================================================================
*= 0x0801
; basic TI$ timer wrapper program:
; --------------------------------
; 0 TI$="000000":SYS2092
; 1 PRINT"TIME:"TI/60
; --------------------------------
; RESULT: ~9
!byte 0x18, 0x08, 0x00, 0x00
!byte 0x54, 0x49, 0x24, 0xB2
!byte 0x22, 0x30, 0x30, 0x30
!byte 0x30, 0x30, 0x30, 0x22
!byte 0x3A, 0x9E, 0x32, 0x30
!byte 0x39, 0x32, 0x00, 0x2A
!byte 0x08, 0x01, 0x00, 0x99
!byte 0x22, 0x54, 0x49, 0x4D
!byte 0x45, 0x3A, 0x22, 0x54
!byte 0x49, 0xAD, 0x36, 0x30
!byte 0x00, 0x00, 0x00
; ==============================================================================
*= 0x082C
jsr CLRSCR
ldx #0
- lda #' '
jsr CHAROUT
lda dec2,x
jsr CHAROUT
lda dec1,x
jsr CHAROUT
lda dec0,x
jsr CHAROUT
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
lda bit7,x
jsr CHAROUT
lda bit6,x
jsr CHAROUT
lda bit5,x
jsr CHAROUT
lda bit4,x
jsr CHAROUT
lda bit3,x
jsr CHAROUT
lda bit2,x
jsr CHAROUT
lda bit1,x
jsr CHAROUT
lda bit0,x
jsr CHAROUT
lda #0x0D
jsr CHAROUT
inx
bne -
rts
; ==============================================================================
!align 255, 0, 0
bit7: !for i, 0, 255 {
!byte ((i AND %10000000) >> 7) OR 0x30
}
bit6: !for i, 0, 255 {
!byte ((i AND %01000000) >> 6) OR 0x30
}
bit5: !for i, 0, 255 {
!byte ((i AND %00100000) >> 5) OR 0x30
}
bit4: !for i, 0, 255 {
!byte ((i AND %00010000) >> 4) OR 0x30
}
bit3: !for i, 0, 255 {
!byte ((i AND %00001000) >> 3) OR 0x30
}
bit2: !for i, 0, 255 {
!byte ((i AND %00000100) >> 2) OR 0x30
}
bit1: !for i, 0, 255 {
!byte ((i AND %00000010) >> 1) OR 0x30
}
bit0: !for i, 0, 255 {
!byte ((i AND %00000001) >> 0) OR 0x30
}
; ==============================================================================
dec2: !for i, 0, 255 {
!if i < 100 {
!byte 0x20
} else if i < 200 {
!byte 0x31
} else {
!byte 0x32
}
}
dec1: !for i, 0, 255 {
!if i < 10 {
!byte 0x20
} else {
!byte ((i / 10) - ((i / 100) * 10)) OR 0x30
}
}
dec0: !for i, 0, 255 {
!byte (i % 10) OR 0x30
}
As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.
But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.
That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him. |
| | Mr SQL
Registered: Feb 2023 Posts: 158 |
Quoting spider-jQuoting Mr SQLMy idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.
This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.
So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.
One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.
https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes
!cpu 6510
; ==============================================================================
CRSRX = 0xD3
CHAROUT = 0xF1CA
CLRSCR = 0xE544
; ==============================================================================
*= 0x0801
; basic TI$ timer wrapper program:
; --------------------------------
; 0 TI$="000000":SYS2092
; 1 PRINT"TIME:"TI/60
; --------------------------------
; RESULT: ~9
!byte 0x18, 0x08, 0x00, 0x00
!byte 0x54, 0x49, 0x24, 0xB2
!byte 0x22, 0x30, 0x30, 0x30
!byte 0x30, 0x30, 0x30, 0x22
!byte 0x3A, 0x9E, 0x32, 0x30
!byte 0x39, 0x32, 0x00, 0x2A
!byte 0x08, 0x01, 0x00, 0x99
!byte 0x22, 0x54, 0x49, 0x4D
!byte 0x45, 0x3A, 0x22, 0x54
!byte 0x49, 0xAD, 0x36, 0x30
!byte 0x00, 0x00, 0x00
; ==============================================================================
*= 0x082C
jsr CLRSCR
ldx #0
- lda #' '
jsr CHAROUT
lda dec2,x
jsr CHAROUT
lda dec1,x
jsr CHAROUT
lda dec0,x
jsr CHAROUT
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
inc CRSRX
lda bit7,x
jsr CHAROUT
lda bit6,x
jsr CHAROUT
lda bit5,x
jsr CHAROUT
lda bit4,x
jsr CHAROUT
lda bit3,x
jsr CHAROUT
lda bit2,x
jsr CHAROUT
lda bit1,x
jsr CHAROUT
lda bit0,x
jsr CHAROUT
lda #0x0D
jsr CHAROUT
inx
bne -
rts
; ==============================================================================
!align 255, 0, 0
bit7: !for i, 0, 255 {
!byte ((i AND %10000000) >> 7) OR 0x30
}
bit6: !for i, 0, 255 {
!byte ((i AND %01000000) >> 6) OR 0x30
}
bit5: !for i, 0, 255 {
!byte ((i AND %00100000) >> 5) OR 0x30
}
bit4: !for i, 0, 255 {
!byte ((i AND %00010000) >> 4) OR 0x30
}
bit3: !for i, 0, 255 {
!byte ((i AND %00001000) >> 3) OR 0x30
}
bit2: !for i, 0, 255 {
!byte ((i AND %00000100) >> 2) OR 0x30
}
bit1: !for i, 0, 255 {
!byte ((i AND %00000010) >> 1) OR 0x30
}
bit0: !for i, 0, 255 {
!byte ((i AND %00000001) >> 0) OR 0x30
}
; ==============================================================================
dec2: !for i, 0, 255 {
!if i < 100 {
!byte 0x20
} else if i < 200 {
!byte 0x31
} else {
!byte 0x32
}
}
dec1: !for i, 0, 255 {
!if i < 10 {
!byte 0x20
} else {
!byte ((i / 10) - ((i / 100) * 10)) OR 0x30
}
}
dec0: !for i, 0, 255 {
!byte (i % 10) OR 0x30
}
As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.
But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.
That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him.
Very cool I tested this Assembly version at 8.75 seconds!
This is probably as fast as we can make it run in asm if we follow Robin's exercise strictly.
I like the way you had the Assembler create the tables! Which Assembler are you using?
I was motivated to try that in BASIC and got the BASIC version down to 12 seconds including the pre-calc to load the array with a clustered index on a single table since BASIC is a high-level language that cannot handle narrow tables as efficiently.
The first part of the prg builds the data statements which is already done, goto 1000 loads the array and iterates:
https://relationalframework.com/basic12seconds.prg |
| | Street Tuff Account closed
Registered: Feb 2002 Posts: 88 |
>> I like the way you had the Assembler create the tables! Which Assembler are you using?
Thats the ACME-Assembler. https://sourceforge.net/projects/acme-crossass/ |
| | ChristopherJam
Registered: Aug 2004 Posts: 1423 |
This one takes 13.18 seconds (as compared to just doing print:return in the printing routine, which takes 8.91 - the binary conversion+print is hence around 4.27). So yeah, ws is correct. It's dominated by screen scroll time.
Those times are if you run after loading from reset - if you start at bottom of the screen everything is slower, because yous start scrolling immediately.
0 gosub9:ti$="000000":goto2
1 printd$(i/16)d$(iand15):return
2 fori=0to255:gosub1:next
3 print"time:"ti/60:end
7 data"0000","0001","0010","0011","0100","0101","0110","0111"
8 data"1000","1001","1010","1011","1100","1101","1110","1111"
9 dimd$(16):fori=0to15:readd$(i):next:return
|
Previous - 1 | 2 | 3 | 4 - Next | |