[CSDb] - User Forums - Fastest time printing binary in BASIC or Assembly

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fastest time printing binary in BASIC or Assembly

2024-01-11 08:49

Mr SQL

Registered: Feb 2023
Posts: 159

Fastest time printing binary in BASIC or Assembly

Interesting Video on 8-bit show and tell:
https://www.youtube.com/watch?v=P8t6otqoz_E

What's the fastest time you can print binary to the screen in BASIC or Assembly?

I got it down to 26 seconds in BASIC without a pre-calc routine.

... 29 posts hidden. Click here to view all posts....

2024-01-11 21:14

chatGPZ

Registered: Dec 2001
Posts: 11540

Thats probably correct - so just a faster scroll routine would speed this up signficantly :)

2024-01-11 22:15

ws

Registered: Apr 2012
Posts: 251

nice challenge, tho

2024-01-12 00:08

JackAsser

Registered: Jun 2002
Posts: 2038

Quote: This probably does not count, but something like this seems obvious if speed is really the priority. This one assumes that you have a custom charset that has a "0" char at position 0 and a "1" char at position 1 in the charset, and it will always print the result at a specific fixed location on the screen.

;a register contains the byte to display as binary
ldx #1
sax $0407
lsr a
sax $0406
lsr a
sax $0405
lsr a
sax $0404
lsr a
sax $0403
lsr a
sax $0402
lsr a
sax $0401
lsr a
sax $0400

=8*2+8*4=48 cycles, so you could do it thousands of times in a single second.

SAX is one of the illegal opcodes, in case someone happens to be unfamiliar with that:
http://unusedino.de/ec64/technical/aay/c64/bsax.htm

I find this a much more interesting challenge! Can this be improved?!

Rules: it should be printed on a normal text screen, any charset is ok. One digit per char.

2024-01-12 06:46

TWW

Registered: Jul 2009
Posts: 557

Quote: I find this a much more interesting challenge! Can this be improved?!

Rules: it should be printed on a normal text screen, any charset is ok. One digit per char.

14 bytes, 16 cycles excl. call/ret:

    sta $2000
    lda #$3b
    sta $d011
    lda #$18
    sta $d018
    rts

How about bitmap with a 1x1 pixel charset :D

Ahyeah normal text screen...

But seriously, are we still talking about peeking the CIA data port (in which case 5 bits would be enough) or a general hexToBin()?

EDIT:

2 variants without resorting to charset trickery;

    // Variant #1:

    ldx #$18        // 2
    stx $0400       // 4
    stx $0401       // 4
    stx $0402       // 4
    stx $0403       // 4
    stx $0404       // 4
    stx $0405       // 4
    stx $0406       // 4
    stx $0407       // 4
    lsr             // 2
    asl $0400       // 4
    lsr             // 2
    asl $0401       // 4
    lsr             // 2
    asl $0402       // 4
    lsr             // 2
    asl $0403       // 4
    lsr             // 2
    asl $0404       // 4
    lsr             // 2
    asl $0405       // 4
    lsr             // 2
    asl $0406       // 4
    lsr             // 2
    asl $0407       // 4
                    // 16 x 4 + 9 * 2 = 82 cycles / 58 bytes
    // Probably has some illegal OPC voodo potential.


    // Variant #2
    ldx #'0'                        // 2
    ldy #'1'                        // 2 -> 4 cycles 'overhead'
    lsr                             // 2
    bcc !next0+                     // 2 / 3
    sty $0400                       // 4 -> 8/9 cycles dep. the branch
    lsr
    bcc !Next1+
!Prev0:
    sty $0401
    lsr
    bcc !Next2+
!Prev1:
    sty $0402
    lsr
    bcc !Next3+
!Prev2:
    sty $0403
    lsr
    bcc !Next4+
!Prev3:
    sty $0404
    lsr
    bcc !Next5+
!Prev4:
    sty $0405
    lsr
    bcc !Next6+
!Prev5:
    sty $0406
    lsr
    bcc !Next7+
!Prev6:
    sty $0407
    rts
!Next0:
    stx $0400
    lsr
    bcs !Prev0-
!Next1:
    stx $0401
    lsr
    bcc !Prev1-
!Next2:
    stx $0402
    lsr
    bcc !Prev1-
!Next3:
    stx $0403
    lsr
    bcc !Prev1-
!Next4:
    stx $0404
    lsr
    bcc !Prev1-
!Next5:
    stx $0405
    lsr
    bcc !Prev1-
!Next6:
    stx $0406
    lsr
    bcc !Prev1-
!Next7:
    stx $0407
    rts                     // ~8 x 8 (+8) + 2 = ~66/74 cycles / ~100 bytes
[code]

Edit 2: 2 variants with voodo and shameless charset trickery:

[/code]
    // Variant #3 with voodo
    ldx #$18        // 2
    stx $0400       // 4
    stx $0401       // 4
    stx $0402       // 4
    stx $0403       // 4
    stx $0404       // 4
    stx $0405       // 4
    stx $0406       // 4
    stx $0407       // 4
    slo $0400       // 4
    slo $0401       // 4
    slo $0402       // 4
    slo $0403       // 4
    slo $0404       // 4
    slo $0405       // 4
    slo $0406       // 4
    slo $0407       // 4
                    // 16 x 4 + 2 = 66 cycles / 50 bytes

    // Variant #4 with voodo & charset (space = "0", ! = "1")
    slo $0400       // 4
    slo $0401       // 4
    slo $0402       // 4
    slo $0403       // 4
    slo $0404       // 4
    slo $0405       // 4
    slo $0406       // 4
    slo $0407       // 4
                    // 8 x 4 = 32 cycles / 24 bytes

2024-01-12 16:59

Frantic

Registered: Mar 2003
Posts: 1661

slo $ffff is 6 cycles, not 4.

http://unusedino.de/ec64/technical/aay/c64/bslo.htm

2024-01-12 20:49

Mr SQL

Registered: Feb 2023
Posts: 159

Quoting spider-j

To be fair I also used the KERNAL output routine for printing the chars.

Small solution with loops / 78 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary.prg

!cpu 6510 ; ============================================================================== save_num = 0x02 CHAROUT = 0xF1CA ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: 8.91666667 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C ldy #0 -- sty save_num ldx #0 - clc rol save_num bcc + lda #'1' !byte 0x2C + lda #'0' jsr CHAROUT inx cpx #8 bne - lda #0x0D jsr CHAROUT iny bne -- rts

Slightly faster solution with unrolled loops / 28974 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary_unrolled.prg

!cpu 6510 ; ============================================================================== save_num = 0x02 CHAROUT = 0xF1CA ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: 8.9 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C !for i, 0, 255 { lda #i sta save_num !for j, 0, 7 { clc rol save_num bcc + lda #'1' !byte 0x2C + lda #'0' jsr CHAROUT } lda #0x0D jsr CHAROUT } rts

EDIT: corrected ror -> rol.

Both good solutions! Keep in mind we also have to print the decimal number with a space and then the binary string followed by a carriage return for each number from 0-255. This will add to the time.

I put my BASIC solution in the comments on the video and noticed as observed in this thread that the kernel CHAROUT routine is consuming most of the time. Just printing 0's and 1's instead of "0" and "1" as strings increased the time of my BASIC solution by 12 seconds, the kernel routine handles BASIC strings faster than BASIC numbers.

My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

2024-01-12 23:52

spider-j

Registered: Oct 2004
Posts: 505

Quoting Mr SQL

My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.

One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.

https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes

                    !cpu 6510
; ==============================================================================
CRSRX       = 0xD3
CHAROUT     = 0xF1CA
CLRSCR      = 0xE544
; ==============================================================================
                    *= 0x0801
                    ; basic TI$ timer wrapper program:
                    ; --------------------------------
                    ; 0 TI$="000000":SYS2092
                    ; 1 PRINT"TIME:"TI/60
                    ; --------------------------------
                    ; RESULT: ~9
                    !byte 0x18, 0x08, 0x00, 0x00
                    !byte 0x54, 0x49, 0x24, 0xB2
                    !byte 0x22, 0x30, 0x30, 0x30
                    !byte 0x30, 0x30, 0x30, 0x22
                    !byte 0x3A, 0x9E, 0x32, 0x30
                    !byte 0x39, 0x32, 0x00, 0x2A
                    !byte 0x08, 0x01, 0x00, 0x99
                    !byte 0x22, 0x54, 0x49, 0x4D
                    !byte 0x45, 0x3A, 0x22, 0x54
                    !byte 0x49, 0xAD, 0x36, 0x30
                    !byte 0x00, 0x00, 0x00
; ==============================================================================
                    *= 0x082C
                    jsr CLRSCR
                    ldx #0
-                   lda #' '
                    jsr CHAROUT
                    lda dec2,x
                    jsr CHAROUT
                    lda dec1,x
                    jsr CHAROUT
                    lda dec0,x
                    jsr CHAROUT
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    lda bit7,x
                    jsr CHAROUT
                    lda bit6,x
                    jsr CHAROUT
                    lda bit5,x
                    jsr CHAROUT
                    lda bit4,x
                    jsr CHAROUT
                    lda bit3,x
                    jsr CHAROUT
                    lda bit2,x
                    jsr CHAROUT
                    lda bit1,x
                    jsr CHAROUT
                    lda bit0,x
                    jsr CHAROUT
                    lda #0x0D
                    jsr CHAROUT
                    inx
                    bne -
                    rts
; ==============================================================================
                    !align 255, 0, 0
bit7:               !for i, 0, 255 {
                         !byte ((i AND %10000000) >> 7) OR 0x30
                    }
bit6:               !for i, 0, 255 {
                         !byte ((i AND %01000000) >> 6) OR 0x30
                    }
bit5:               !for i, 0, 255 {
                         !byte ((i AND %00100000) >> 5) OR 0x30
                    }
bit4:               !for i, 0, 255 {
                         !byte ((i AND %00010000) >> 4) OR 0x30
                    }
bit3:               !for i, 0, 255 {
                         !byte ((i AND %00001000) >> 3) OR 0x30
                    }
bit2:               !for i, 0, 255 {
                         !byte ((i AND %00000100) >> 2) OR 0x30
                    }
bit1:               !for i, 0, 255 {
                         !byte ((i AND %00000010) >> 1) OR 0x30
                    }
bit0:               !for i, 0, 255 {
                         !byte ((i AND %00000001) >> 0) OR 0x30
                    }
; ==============================================================================
dec2:               !for i, 0, 255 {
                         !if i < 100 {
                              !byte 0x20
                         } else if i < 200 {
                              !byte 0x31
                         } else {
                              !byte 0x32
                         }
                    }
dec1:               !for i, 0, 255 {
                         !if i < 10 {
                              !byte 0x20
                         } else {
                              !byte ((i / 10) - ((i / 100) * 10)) OR 0x30
                         }
                    }
dec0:               !for i, 0, 255 {
                         !byte (i % 10) OR 0x30
                    }

As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.

But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.

That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him.

2024-01-13 05:12

Mr SQL

Registered: Feb 2023
Posts: 159

Quoting spider-j

Quoting Mr SQL
My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.

One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.

https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes

!cpu 6510 ; ============================================================================== CRSRX = 0xD3 CHAROUT = 0xF1CA CLRSCR = 0xE544 ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: ~9 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C jsr CLRSCR ldx #0 - lda #' ' jsr CHAROUT lda dec2,x jsr CHAROUT lda dec1,x jsr CHAROUT lda dec0,x jsr CHAROUT inc CRSRX inc CRSRX inc CRSRX inc CRSRX inc CRSRX inc CRSRX lda bit7,x jsr CHAROUT lda bit6,x jsr CHAROUT lda bit5,x jsr CHAROUT lda bit4,x jsr CHAROUT lda bit3,x jsr CHAROUT lda bit2,x jsr CHAROUT lda bit1,x jsr CHAROUT lda bit0,x jsr CHAROUT lda #0x0D jsr CHAROUT inx bne - rts ; ============================================================================== !align 255, 0, 0 bit7: !for i, 0, 255 { !byte ((i AND %10000000) >> 7) OR 0x30 } bit6: !for i, 0, 255 { !byte ((i AND %01000000) >> 6) OR 0x30 } bit5: !for i, 0, 255 { !byte ((i AND %00100000) >> 5) OR 0x30 } bit4: !for i, 0, 255 { !byte ((i AND %00010000) >> 4) OR 0x30 } bit3: !for i, 0, 255 { !byte ((i AND %00001000) >> 3) OR 0x30 } bit2: !for i, 0, 255 { !byte ((i AND %00000100) >> 2) OR 0x30 } bit1: !for i, 0, 255 { !byte ((i AND %00000010) >> 1) OR 0x30 } bit0: !for i, 0, 255 { !byte ((i AND %00000001) >> 0) OR 0x30 } ; ============================================================================== dec2: !for i, 0, 255 { !if i < 100 { !byte 0x20 } else if i < 200 { !byte 0x31 } else { !byte 0x32 } } dec1: !for i, 0, 255 { !if i < 10 { !byte 0x20 } else { !byte ((i / 10) - ((i / 100) * 10)) OR 0x30 } } dec0: !for i, 0, 255 { !byte (i % 10) OR 0x30 }

As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.

But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.

That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him.

Very cool I tested this Assembly version at 8.75 seconds!

This is probably as fast as we can make it run in asm if we follow Robin's exercise strictly.

I like the way you had the Assembler create the tables! Which Assembler are you using?

I was motivated to try that in BASIC and got the BASIC version down to 12 seconds including the pre-calc to load the array with a clustered index on a single table since BASIC is a high-level language that cannot handle narrow tables as efficiently.

The first part of the prg builds the data statements which is already done, goto 1000 loads the array and iterates:

https://relationalframework.com/basic12seconds.prg

2024-01-13 14:10

Street Tuff
Account closed

Registered: Feb 2002
Posts: 88

>> I like the way you had the Assembler create the tables! Which Assembler are you using?

Thats the ACME-Assembler. https://sourceforge.net/projects/acme-crossass/

2024-01-13 17:02

ChristopherJam

Registered: Aug 2004
Posts: 1427

This one takes 13.18 seconds (as compared to just doing print:return in the printing routine, which takes 8.91 - the binary conversion+print is hence around 4.27). So yeah, ws is correct. It's dominated by screen scroll time.

Those times are if you run after loading from reset - if you start at bottom of the screen everything is slower, because yous start scrolling immediately.

0 gosub9:ti$="000000":goto2
1 printd$(i/16)d$(iand15):return
2 fori=0to255:gosub1:next
3 print"time:"ti/60:end
7 data"0000","0001","0010","0011","0100","0101","0110","0111"
8 data"1000","1001","1010","1011","1100","1101","1110","1111"
9 dimd$(16):fori=0to15:readd$(i):next:return

Previous - 1 | 2 | 3 | 4 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

theK/ATL
BOMB/ACRISE
Guests online: 52

Top Demos

1 Harminc  (9.7)
2 Next Level  (9.7)
3 13:37  (9.7)
4 Codeboys & Endians  (9.7)
5 Mojo  (9.7)
6 Coma Light 13  (9.6)
7 Edge of Disgrace  (9.6)
8 Comaland 100%  (9.6)
9 Wonderland XIV  (9.6)
10 Signal Carnival  (9.6)

Top onefile Demos

1 Nine  (9.8)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.5)
6 Scan and Spin  (9.5)
7 Onscreen 5k  (9.5)
8 Grey  (9.5)
9 Dawnfall V1.1  (9.5)
10 Rainbow Connection  (9.5)

Top Groups

1 Booze Design  (9.3)
2 Censor Design  (9.3)
3 Performers  (9.3)
4 Oxyron  (9.3)
5 Artline Designs  (9.3)

Top Cover Designers

1 Duce  (9.8)
2 Electric  (9.8)
3 Junkie  (9.6)
4 The Elegance  (9.5)
5 Mermaid  (9.3)

Page generated in: 0.096 sec.