[CSDb] - User Forums - Fastest time printing binary in BASIC or Assembly

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fastest time printing binary in BASIC or Assembly

2024-01-11 08:49

Mr SQL

Registered: Feb 2023
Posts: 116

Fastest time printing binary in BASIC or Assembly

Interesting Video on 8-bit show and tell:
https://www.youtube.com/watch?v=P8t6otqoz_E

What's the fastest time you can print binary to the screen in BASIC or Assembly?

I got it down to 26 seconds in BASIC without a pre-calc routine.

2024-01-11 11:42

spider-j

Registered: Oct 2004
Posts: 449

To be fair I also used the KERNAL output routine for printing the chars.

Small solution with loops / 78 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary.prg

                    !cpu 6510
; ==============================================================================
save_num    = 0x02
CHAROUT     = 0xF1CA
; ==============================================================================
                    *= 0x0801
                    ; basic TI$ timer wrapper program:
                    ; --------------------------------
                    ; 0 TI$="000000":SYS2092
                    ; 1 PRINT"TIME:"TI/60
                    ; --------------------------------
                    ; RESULT: 8.91666667
                    !byte 0x18, 0x08, 0x00, 0x00
                    !byte 0x54, 0x49, 0x24, 0xB2
                    !byte 0x22, 0x30, 0x30, 0x30
                    !byte 0x30, 0x30, 0x30, 0x22
                    !byte 0x3A, 0x9E, 0x32, 0x30
                    !byte 0x39, 0x32, 0x00, 0x2A
                    !byte 0x08, 0x01, 0x00, 0x99
                    !byte 0x22, 0x54, 0x49, 0x4D
                    !byte 0x45, 0x3A, 0x22, 0x54
                    !byte 0x49, 0xAD, 0x36, 0x30
                    !byte 0x00, 0x00, 0x00
; ==============================================================================
                    *= 0x082C
                    ldy #0
--                  sty save_num
                    ldx #0
-                   clc
                    rol save_num
                    bcc +
                    lda #'1'
                    !byte 0x2C
+                   lda #'0'
                    jsr CHAROUT
                    inx
                    cpx #8
                    bne -
                    lda #0x0D
                    jsr CHAROUT
                    iny
                    bne --
                    rts

Slightly faster solution with unrolled loops / 28974 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary_unrolled.prg

                    !cpu 6510
; ==============================================================================
save_num    = 0x02
CHAROUT     = 0xF1CA
; ==============================================================================
                    *= 0x0801
                    ; basic TI$ timer wrapper program:
                    ; --------------------------------
                    ; 0 TI$="000000":SYS2092
                    ; 1 PRINT"TIME:"TI/60
                    ; --------------------------------
                    ; RESULT: 8.9
                    !byte 0x18, 0x08, 0x00, 0x00
                    !byte 0x54, 0x49, 0x24, 0xB2
                    !byte 0x22, 0x30, 0x30, 0x30
                    !byte 0x30, 0x30, 0x30, 0x22
                    !byte 0x3A, 0x9E, 0x32, 0x30
                    !byte 0x39, 0x32, 0x00, 0x2A
                    !byte 0x08, 0x01, 0x00, 0x99
                    !byte 0x22, 0x54, 0x49, 0x4D
                    !byte 0x45, 0x3A, 0x22, 0x54
                    !byte 0x49, 0xAD, 0x36, 0x30
                    !byte 0x00, 0x00, 0x00
; ==============================================================================
                    *= 0x082C
                    !for i, 0, 255 {
                         lda #i
                         sta save_num
                         !for j, 0, 7 {
                              clc
                              rol save_num
                              bcc +
                              lda #'1'
                              !byte 0x2C
+                             lda #'0'
                              jsr CHAROUT
                         }
                         lda #0x0D
                         jsr CHAROUT
                    }
                    rts

EDIT: corrected ror -> rol.

2024-01-11 12:00

spider-j

Registered: Oct 2004
Posts: 449

Ah, no need to clear carry. Doesn't change anything in the loop version though. But unrolled version gets down to 8.88333334:

                    !cpu 6510
; ==============================================================================
save_num    = 0x02
CHAROUT     = 0xF1CA
; ==============================================================================
                    *= 0x0801
                    ; basic TI$ timer wrapper program:
                    ; --------------------------------
                    ; 0 TI$="000000":SYS2092
                    ; 1 PRINT"TIME:"TI/60
                    ; --------------------------------
                    ; RESULT: 8.88333334
                    !byte 0x18, 0x08, 0x00, 0x00
                    !byte 0x54, 0x49, 0x24, 0xB2
                    !byte 0x22, 0x30, 0x30, 0x30
                    !byte 0x30, 0x30, 0x30, 0x22
                    !byte 0x3A, 0x9E, 0x32, 0x30
                    !byte 0x39, 0x32, 0x00, 0x2A
                    !byte 0x08, 0x01, 0x00, 0x99
                    !byte 0x22, 0x54, 0x49, 0x4D
                    !byte 0x45, 0x3A, 0x22, 0x54
                    !byte 0x49, 0xAD, 0x36, 0x30
                    !byte 0x00, 0x00, 0x00
; ==============================================================================
                    *= 0x082C
                    !for i, 0, 255 {
                         lda #i
                         sta save_num
                         !for j, 0, 7 {
                              asl save_num
                              bcc +
                              lda #'1'
                              !byte 0x2C
+                             lda #'0'
                              jsr CHAROUT
                         }
                         lda #0x0D
                         jsr CHAROUT
                    }
                    rts

2024-01-11 12:13

spider-j

Registered: Oct 2004
Posts: 449

Hm. Okay, measuring time with TI$ seems very unreliable. At least when using VICE autostart and / or warp. Someone should test it on real machine. But even there I'm not sure if TI$ is consistent (?) ...

2024-01-11 13:43

Jetboy

Registered: Jul 2006
Posts: 227

Probably not good enough to time machine language code.(TI$)

2024-01-11 16:16

Martin Piper

Registered: Nov 2007
Posts: 645

If you're using Vice use the monitor command "stopwatch" (or "sw") to reset and display the stopwatch. Can be used at the start of the run command until it returns by adding a couple of break points. Also remember to SEI early to avoid the regular IRQ.

2024-01-11 18:33

Frantic

Registered: Mar 2003
Posts: 1629

This probably does not count, but something like this seems obvious if speed is really the priority. This one assumes that you have a custom charset that has a "0" char at position 0 and a "1" char at position 1 in the charset, and it will always print the result at a specific fixed location on the screen.

;a register contains the byte to display as binary
ldx #1
sax $0407
lsr a
sax $0406
lsr a
sax $0405
lsr a
sax $0404
lsr a
sax $0403
lsr a
sax $0402
lsr a
sax $0401
lsr a
sax $0400

=8*2+8*4=48 cycles, so you could do it thousands of times in a single second.

SAX is one of the illegal opcodes, in case someone happens to be unfamiliar with that:
http://unusedino.de/ec64/technical/aay/c64/bsax.htm

2024-01-11 19:02

TWW

Registered: Jul 2009
Posts: 541

I Watched the video, and it seemed that the criteria was that it was supposed to be possible to invoke it from normal BASIC and that printing the result to the screen should update cursor etc like the print command does. Not to familiar with BASIC, but I figgure 'unrolling' some of the loops he tested should help. Also the assembler code snippet relied heavily on BASIC to fetch parameters and print. A specialized routine would be much faster but hell, what a hazzle :D

2024-01-11 20:48

spider-j

Registered: Oct 2004
Posts: 449

Quoting Martin Piper

If you're using Vice use the monitor command "stopwatch" (or "sw") to reset and display the stopwatch. Can be used at the start of the run command until it returns by adding a couple of break points. Also remember to SEI early to avoid the regular IRQ.

Well in the mentioned video the TI$ basic snippet is used to track the time. So I stuck to it, because otherwise it wouldn't be comparable at all.

2024-01-11 21:03

ws

Registered: Apr 2012
Posts: 229

Am i missing something here or is the most time actually wasted by kernal scrolling the output? Why not poke the results to the screen to a fixed location? Or even use chr$(19) aka home?

2024-01-11 21:14

chatGPZ

Registered: Dec 2001
Posts: 11146

Thats probably correct - so just a faster scroll routine would speed this up signficantly :)

2024-01-11 22:15

ws

Registered: Apr 2012
Posts: 229

nice challenge, tho

2024-01-12 00:08

JackAsser

Registered: Jun 2002
Posts:

Quote: This probably does not count, but something like this seems obvious if speed is really the priority. This one assumes that you have a custom charset that has a "0" char at position 0 and a "1" char at position 1 in the charset, and it will always print the result at a specific fixed location on the screen.

;a register contains the byte to display as binary
ldx #1
sax $0407
lsr a
sax $0406
lsr a
sax $0405
lsr a
sax $0404
lsr a
sax $0403
lsr a
sax $0402
lsr a
sax $0401
lsr a
sax $0400

=8*2+8*4=48 cycles, so you could do it thousands of times in a single second.

SAX is one of the illegal opcodes, in case someone happens to be unfamiliar with that:
http://unusedino.de/ec64/technical/aay/c64/bsax.htm

I find this a much more interesting challenge! Can this be improved?!

Rules: it should be printed on a normal text screen, any charset is ok. One digit per char.

2024-01-12 06:46

TWW

Registered: Jul 2009
Posts: 541

Quote: I find this a much more interesting challenge! Can this be improved?!

Rules: it should be printed on a normal text screen, any charset is ok. One digit per char.

14 bytes, 16 cycles excl. call/ret:

    sta $2000
    lda #$3b
    sta $d011
    lda #$18
    sta $d018
    rts

How about bitmap with a 1x1 pixel charset :D

Ahyeah normal text screen...

But seriously, are we still talking about peeking the CIA data port (in which case 5 bits would be enough) or a general hexToBin()?

EDIT:

2 variants without resorting to charset trickery;

    // Variant #1:

    ldx #$18        // 2
    stx $0400       // 4
    stx $0401       // 4
    stx $0402       // 4
    stx $0403       // 4
    stx $0404       // 4
    stx $0405       // 4
    stx $0406       // 4
    stx $0407       // 4
    lsr             // 2
    asl $0400       // 4
    lsr             // 2
    asl $0401       // 4
    lsr             // 2
    asl $0402       // 4
    lsr             // 2
    asl $0403       // 4
    lsr             // 2
    asl $0404       // 4
    lsr             // 2
    asl $0405       // 4
    lsr             // 2
    asl $0406       // 4
    lsr             // 2
    asl $0407       // 4
                    // 16 x 4 + 9 * 2 = 82 cycles / 58 bytes
    // Probably has some illegal OPC voodo potential.


    // Variant #2
    ldx #'0'                        // 2
    ldy #'1'                        // 2 -> 4 cycles 'overhead'
    lsr                             // 2
    bcc !next0+                     // 2 / 3
    sty $0400                       // 4 -> 8/9 cycles dep. the branch
    lsr
    bcc !Next1+
!Prev0:
    sty $0401
    lsr
    bcc !Next2+
!Prev1:
    sty $0402
    lsr
    bcc !Next3+
!Prev2:
    sty $0403
    lsr
    bcc !Next4+
!Prev3:
    sty $0404
    lsr
    bcc !Next5+
!Prev4:
    sty $0405
    lsr
    bcc !Next6+
!Prev5:
    sty $0406
    lsr
    bcc !Next7+
!Prev6:
    sty $0407
    rts
!Next0:
    stx $0400
    lsr
    bcs !Prev0-
!Next1:
    stx $0401
    lsr
    bcc !Prev1-
!Next2:
    stx $0402
    lsr
    bcc !Prev1-
!Next3:
    stx $0403
    lsr
    bcc !Prev1-
!Next4:
    stx $0404
    lsr
    bcc !Prev1-
!Next5:
    stx $0405
    lsr
    bcc !Prev1-
!Next6:
    stx $0406
    lsr
    bcc !Prev1-
!Next7:
    stx $0407
    rts                     // ~8 x 8 (+8) + 2 = ~66/74 cycles / ~100 bytes
[code]

Edit 2: 2 variants with voodo and shameless charset trickery:

[/code]
    // Variant #3 with voodo
    ldx #$18        // 2
    stx $0400       // 4
    stx $0401       // 4
    stx $0402       // 4
    stx $0403       // 4
    stx $0404       // 4
    stx $0405       // 4
    stx $0406       // 4
    stx $0407       // 4
    slo $0400       // 4
    slo $0401       // 4
    slo $0402       // 4
    slo $0403       // 4
    slo $0404       // 4
    slo $0405       // 4
    slo $0406       // 4
    slo $0407       // 4
                    // 16 x 4 + 2 = 66 cycles / 50 bytes

    // Variant #4 with voodo & charset (space = "0", ! = "1")
    slo $0400       // 4
    slo $0401       // 4
    slo $0402       // 4
    slo $0403       // 4
    slo $0404       // 4
    slo $0405       // 4
    slo $0406       // 4
    slo $0407       // 4
                    // 8 x 4 = 32 cycles / 24 bytes

2024-01-12 16:59

Frantic

Registered: Mar 2003
Posts: 1629

slo $ffff is 6 cycles, not 4.

http://unusedino.de/ec64/technical/aay/c64/bslo.htm

2024-01-12 20:49

Mr SQL

Registered: Feb 2023
Posts: 116

Quoting spider-j

To be fair I also used the KERNAL output routine for printing the chars.

Small solution with loops / 78 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary.prg

!cpu 6510 ; ============================================================================== save_num = 0x02 CHAROUT = 0xF1CA ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: 8.91666667 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C ldy #0 -- sty save_num ldx #0 - clc rol save_num bcc + lda #'1' !byte 0x2C + lda #'0' jsr CHAROUT inx cpx #8 bne - lda #0x0D jsr CHAROUT iny bne -- rts

Slightly faster solution with unrolled loops / 28974 Bytes PRG:
https://trans.jansalleine.com/c64/num2binary_unrolled.prg

!cpu 6510 ; ============================================================================== save_num = 0x02 CHAROUT = 0xF1CA ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: 8.9 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C !for i, 0, 255 { lda #i sta save_num !for j, 0, 7 { clc rol save_num bcc + lda #'1' !byte 0x2C + lda #'0' jsr CHAROUT } lda #0x0D jsr CHAROUT } rts

EDIT: corrected ror -> rol.

Both good solutions! Keep in mind we also have to print the decimal number with a space and then the binary string followed by a carriage return for each number from 0-255. This will add to the time.

I put my BASIC solution in the comments on the video and noticed as observed in this thread that the kernel CHAROUT routine is consuming most of the time. Just printing 0's and 1's instead of "0" and "1" as strings increased the time of my BASIC solution by 12 seconds, the kernel routine handles BASIC strings faster than BASIC numbers.

My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

2024-01-12 23:52

spider-j

Registered: Oct 2004
Posts: 449

Quoting Mr SQL

My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.

One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.

https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes

                    !cpu 6510
; ==============================================================================
CRSRX       = 0xD3
CHAROUT     = 0xF1CA
CLRSCR      = 0xE544
; ==============================================================================
                    *= 0x0801
                    ; basic TI$ timer wrapper program:
                    ; --------------------------------
                    ; 0 TI$="000000":SYS2092
                    ; 1 PRINT"TIME:"TI/60
                    ; --------------------------------
                    ; RESULT: ~9
                    !byte 0x18, 0x08, 0x00, 0x00
                    !byte 0x54, 0x49, 0x24, 0xB2
                    !byte 0x22, 0x30, 0x30, 0x30
                    !byte 0x30, 0x30, 0x30, 0x22
                    !byte 0x3A, 0x9E, 0x32, 0x30
                    !byte 0x39, 0x32, 0x00, 0x2A
                    !byte 0x08, 0x01, 0x00, 0x99
                    !byte 0x22, 0x54, 0x49, 0x4D
                    !byte 0x45, 0x3A, 0x22, 0x54
                    !byte 0x49, 0xAD, 0x36, 0x30
                    !byte 0x00, 0x00, 0x00
; ==============================================================================
                    *= 0x082C
                    jsr CLRSCR
                    ldx #0
-                   lda #' '
                    jsr CHAROUT
                    lda dec2,x
                    jsr CHAROUT
                    lda dec1,x
                    jsr CHAROUT
                    lda dec0,x
                    jsr CHAROUT
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    inc CRSRX
                    lda bit7,x
                    jsr CHAROUT
                    lda bit6,x
                    jsr CHAROUT
                    lda bit5,x
                    jsr CHAROUT
                    lda bit4,x
                    jsr CHAROUT
                    lda bit3,x
                    jsr CHAROUT
                    lda bit2,x
                    jsr CHAROUT
                    lda bit1,x
                    jsr CHAROUT
                    lda bit0,x
                    jsr CHAROUT
                    lda #0x0D
                    jsr CHAROUT
                    inx
                    bne -
                    rts
; ==============================================================================
                    !align 255, 0, 0
bit7:               !for i, 0, 255 {
                         !byte ((i AND %10000000) >> 7) OR 0x30
                    }
bit6:               !for i, 0, 255 {
                         !byte ((i AND %01000000) >> 6) OR 0x30
                    }
bit5:               !for i, 0, 255 {
                         !byte ((i AND %00100000) >> 5) OR 0x30
                    }
bit4:               !for i, 0, 255 {
                         !byte ((i AND %00010000) >> 4) OR 0x30
                    }
bit3:               !for i, 0, 255 {
                         !byte ((i AND %00001000) >> 3) OR 0x30
                    }
bit2:               !for i, 0, 255 {
                         !byte ((i AND %00000100) >> 2) OR 0x30
                    }
bit1:               !for i, 0, 255 {
                         !byte ((i AND %00000010) >> 1) OR 0x30
                    }
bit0:               !for i, 0, 255 {
                         !byte ((i AND %00000001) >> 0) OR 0x30
                    }
; ==============================================================================
dec2:               !for i, 0, 255 {
                         !if i < 100 {
                              !byte 0x20
                         } else if i < 200 {
                              !byte 0x31
                         } else {
                              !byte 0x32
                         }
                    }
dec1:               !for i, 0, 255 {
                         !if i < 10 {
                              !byte 0x20
                         } else {
                              !byte ((i / 10) - ((i / 100) * 10)) OR 0x30
                         }
                    }
dec0:               !for i, 0, 255 {
                         !byte (i % 10) OR 0x30
                    }

As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.

But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.

That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him.

2024-01-13 05:12

Mr SQL

Registered: Feb 2023
Posts: 116

Quoting spider-j

Quoting Mr SQL
My idea for an optimized assembly solution would be to use 8 page aligned 256 byte tables of 0 and 1 characters, one for each bit.

This would allow a shared index for 8 lda's of 4 cycles each without having to branch after each load, just pushing the values loaded to CHAROUT.

So I did it like you suggested (+ including decimal number and clear screen to get the "exact" (don't know what CHR$(5) does) output like in the video.

One could also unroll the main loop, but this won't do much because of the unprecise TI$ measuring.

https://trans.jansalleine.com/c64/num2binary_table.prg
3073 Bytes

!cpu 6510 ; ============================================================================== CRSRX = 0xD3 CHAROUT = 0xF1CA CLRSCR = 0xE544 ; ============================================================================== *= 0x0801 ; basic TI$ timer wrapper program: ; -------------------------------- ; 0 TI$="000000":SYS2092 ; 1 PRINT"TIME:"TI/60 ; -------------------------------- ; RESULT: ~9 !byte 0x18, 0x08, 0x00, 0x00 !byte 0x54, 0x49, 0x24, 0xB2 !byte 0x22, 0x30, 0x30, 0x30 !byte 0x30, 0x30, 0x30, 0x22 !byte 0x3A, 0x9E, 0x32, 0x30 !byte 0x39, 0x32, 0x00, 0x2A !byte 0x08, 0x01, 0x00, 0x99 !byte 0x22, 0x54, 0x49, 0x4D !byte 0x45, 0x3A, 0x22, 0x54 !byte 0x49, 0xAD, 0x36, 0x30 !byte 0x00, 0x00, 0x00 ; ============================================================================== *= 0x082C jsr CLRSCR ldx #0 - lda #' ' jsr CHAROUT lda dec2,x jsr CHAROUT lda dec1,x jsr CHAROUT lda dec0,x jsr CHAROUT inc CRSRX inc CRSRX inc CRSRX inc CRSRX inc CRSRX inc CRSRX lda bit7,x jsr CHAROUT lda bit6,x jsr CHAROUT lda bit5,x jsr CHAROUT lda bit4,x jsr CHAROUT lda bit3,x jsr CHAROUT lda bit2,x jsr CHAROUT lda bit1,x jsr CHAROUT lda bit0,x jsr CHAROUT lda #0x0D jsr CHAROUT inx bne - rts ; ============================================================================== !align 255, 0, 0 bit7: !for i, 0, 255 { !byte ((i AND %10000000) >> 7) OR 0x30 } bit6: !for i, 0, 255 { !byte ((i AND %01000000) >> 6) OR 0x30 } bit5: !for i, 0, 255 { !byte ((i AND %00100000) >> 5) OR 0x30 } bit4: !for i, 0, 255 { !byte ((i AND %00010000) >> 4) OR 0x30 } bit3: !for i, 0, 255 { !byte ((i AND %00001000) >> 3) OR 0x30 } bit2: !for i, 0, 255 { !byte ((i AND %00000100) >> 2) OR 0x30 } bit1: !for i, 0, 255 { !byte ((i AND %00000010) >> 1) OR 0x30 } bit0: !for i, 0, 255 { !byte ((i AND %00000001) >> 0) OR 0x30 } ; ============================================================================== dec2: !for i, 0, 255 { !if i < 100 { !byte 0x20 } else if i < 200 { !byte 0x31 } else { !byte 0x32 } } dec1: !for i, 0, 255 { !if i < 10 { !byte 0x20 } else { !byte ((i / 10) - ((i / 100) * 10)) OR 0x30 } } dec0: !for i, 0, 255 { !byte (i % 10) OR 0x30 }

As already stated in this thread: KERNAL char out / print is the most expensive operation in this whole scenario anyway.

But without changing the "rules" like others suggested there's no way around that other than implementing your own faster routines for that – what goes a little bit beyond what so small "excersises" usually want to accomplish.

That guy making those isn't a scener and has a very "oldskool" approach to everything – working with your C64 as "intended" by the user manual :-) It's sometimes still kind of fun, but I usually also fast forward a lot when watching videos from him.

Very cool I tested this Assembly version at 8.75 seconds!

This is probably as fast as we can make it run in asm if we follow Robin's exercise strictly.

I like the way you had the Assembler create the tables! Which Assembler are you using?

I was motivated to try that in BASIC and got the BASIC version down to 12 seconds including the pre-calc to load the array with a clustered index on a single table since BASIC is a high-level language that cannot handle narrow tables as efficiently.

The first part of the prg builds the data statements which is already done, goto 1000 loads the array and iterates:

https://relationalframework.com/basic12seconds.prg

2024-01-13 14:10

Street Tuff

Registered: Feb 2002
Posts: 88

>> I like the way you had the Assembler create the tables! Which Assembler are you using?

Thats the ACME-Assembler. https://sourceforge.net/projects/acme-crossass/

2024-01-13 17:02

ChristopherJam

Registered: Aug 2004
Posts: 1381

This one takes 13.18 seconds (as compared to just doing print:return in the printing routine, which takes 8.91 - the binary conversion+print is hence around 4.27). So yeah, ws is correct. It's dominated by screen scroll time.

Those times are if you run after loading from reset - if you start at bottom of the screen everything is slower, because yous start scrolling immediately.

0 gosub9:ti$="000000":goto2
1 printd$(i/16)d$(iand15):return
2 fori=0to255:gosub1:next
3 print"time:"ti/60:end
7 data"0000","0001","0010","0011","0100","0101","0110","0111"
8 data"1000","1001","1010","1011","1100","1101","1110","1111"
9 dimd$(16):fori=0to15:readd$(i):next:return

2024-01-13 22:42

spider-j

Registered: Oct 2004
Posts: 449

Quoting Mr SQL

I like the way you had the Assembler create the tables! Which Assembler are you using?

As Street Tuff already said: it's ACME. But loops and arithmetics to generate / calculate tables etc. is a quite common feature in most (if not all) of modern PC cross assemblers.

2024-01-16 00:19

Krill

Registered: Apr 2002
Posts: 2854

Quoting spider-j

But loops and arithmetics to generate / calculate tables etc. is a quite common feature in most (if not all) of modern PC cross assemblers.

The hard part is knowing when to drop that hammer and search for another tool to beat that non-nail. =)

2024-01-16 15:32

chatGPZ

Registered: Dec 2001
Posts: 11146

As soon as javascript is involved - whats so hard about it? :-)

2024-01-16 23:21

spider-j

Registered: Oct 2004
Posts: 449

Quoting Krill

The hard part is knowing when to drop that hammer and search for another tool to beat that non-nail. =)

Don't get your comment. Do you mean one should generate tables elsewhere and then include a huge list of bytes in their sources? I find it much more readable when tables are generated to see the calculation directly in the assembly source.

2024-01-17 00:42

chatGPZ

Registered: Dec 2001
Posts: 11146

There are tables and then there are tables.

2024-01-17 09:53

spider-j

Registered: Oct 2004
Posts: 449

WAT? You guys seem to have fun to confuse me :-P

2024-01-17 11:50

Mr SQL

Registered: Feb 2023
Posts: 116

X2 those seem like tantalizing hints to a more efficient solution, I would like to hear more.

2024-01-18 05:26

ChristopherJam

Registered: Aug 2004
Posts: 1381

Quote: Quoting Krill
The hard part is knowing when to drop that hammer and search for another tool to beat that non-nail. =)

Don't get your comment. Do you mean one should generate tables elsewhere and then include a huge list of bytes in their sources? I find it much more readable when tables are generated to see the calculation directly in the assembly source.

Yes to generate tables elsewhere most of the time (though for a case as simple as this I'd probably inline a loop plus expression too), but if you mean "do you generate the table once then copy and paste a huge list of .byte statements into your source code" then definitely not!

Any time there's a moderate amount of complexity to generating some data (eg optimising a koala), there's a Python script or Rust program that generates either a .prg to pass to the cruncher, or a seperate source file for the assembly source to .include or .incbin (depending if its labels or just binary data), or a combination of the two.

A Makefile then ensures the tables are regenerated any time the table generator changes (or any other dependencies, eg new artwork), and that that triggers reassembling anything that depends on the table contents.

Same applies to any generated code, eg unrolled loops, especially if the unrolled code is reordered depending on the input data.

2024-01-18 10:45

Martin Piper

Registered: Nov 2007
Posts: 645

I the data is very complex, like compressed audio or video, then I usually offline the conversion and include the binary data as part of the final format build process. So for cartridge it gets added to cart banks and a label file and file structure is generated that the assembly code can reference.

If the data is moderately complex, like XML map data from a GUI tool for a game level, I will probably offline convert it to generate assembly include files with some binary includes.

If the data is simple, like SpritePad or CharPad files, then I'll probably use binary inclusion with simple inline python in the assembly.

It usually boils down to how complex is it to parse and how long does it take to generate, combined with if the assembly code needs a lot of labels to reference the data.

2024-01-18 10:49

Oswald

Registered: Apr 2002
Posts: 5027

its like in GhostBusters, dont cross the streams ;) Asm and data generators are two different things.

2024-01-18 13:38

Krill

Registered: Apr 2002
Posts: 2854

Quoting spider-j

Do you mean one should generate tables elsewhere

As pointed out by CJam and others, there's a magical border of complexity when you really should generate the data externally. (And then just incbin that binary, of course, not generate a wall of asm .byte statements.)

And on the other complexity end, using loop macros etc. is just a lazy way to code, producing bloat, as that stuff (simple tables and unrolled code) can most often be generated at init-run-time, producing smaller output.

So, there are good reasons to use loop macros etc., but not that many.

2024-01-18 13:51

chatGPZ

Registered: Dec 2001
Posts: 11146

Quote:

And then just incbin that binary, of course, not generate a wall of asm .byte statements.

Cracks me up every single time i see people doing this... wth

2024-01-18 14:35

spider-j

Registered: Oct 2004
Posts: 449

Well, while these may all be good ideas in theory, I actually don't know how to "incbin" into CSDb forums for posting a quick and dirty code snippet ;-)

2024-01-18 14:58

Krill

Registered: Apr 2002
Posts: 2854

Quoting spider-j

Well, while these may all be good ideas in theory, I actually don't know how to "incbin" into CSDb forums for posting a quick and dirty code snippet ;-)

Nobody said anything like that.

And code-golf or general prototyping is a good reason for loop macros. =)

2024-01-18 21:39

spider-j

Registered: Oct 2004
Posts: 449

Quoting Krill

Nobody said anything like that.

And code-golf or general prototyping is a good reason for loop macros. =)

Okay, now I see. I got you completely wrong. I thought it was about the code I posted here.

2024-01-18 22:49

Mr SQL

Registered: Feb 2023
Posts: 116

Quoting ChristopherJam

Yes to generate tables elsewhere most of the time (though for a case as simple as this I'd probably inline a loop plus expression too), but if you mean "do you generate the table once then copy and paste a huge list of .byte statements into your source code" then definitely not!

Who says you can't?

I would be more interested to see examples for optimizing Robin's coding exercise with any semantics.

2024-01-18 22:56

Mr SQL

Registered: Feb 2023
Posts: 116

Quoting ChristopherJam

This one takes 13.18 seconds (as compared to just doing print:return in the printing routine, which takes 8.91 - the binary conversion+print is hence around 4.27). So yeah, ws is correct. It's dominated by screen scroll time.

Those times are if you run after loading from reset - if you start at bottom of the screen everything is slower, because yous start scrolling immediately.

0 gosub9:ti$="000000":goto2 1 printd$(i/16)d$(iand15):return 2 fori=0to255:gosub1:next 3 print"time:"ti/60:end 7 data"0000","0001","0010","0011","0100","0101","0110","0111" 8 data"1000","1001","1010","1011","1100","1101","1110","1111" 9 dimd$(16):fori=0to15:readd$(i):next:return

This is another excellent design. Adding the decimal form and spacing to match Robin's output will add slightly to the print time.

2024-01-19 03:34

aeeben

Registered: May 2002
Posts: 42

10 TI$="000000":E=15:DIMB$(E):SYS58692:FORA=.TOE:READB$(A):NEXT:D=1/A
20 FORA=.TO255:PRINTB$(A*D)B$(AANDE):NEXT:PRINTTI/60:DATA0000,0001,0010
30 DATA0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111

-> 11.8166667 sec.

Multiply by 1/16 is a bit faster than division. Constants in variables is a bit faster. I didn't see any gains from using a subroutine.

2024-01-19 04:33

aeeben

Registered: May 2002
Posts: 42

For laughs, I wanted to try this on Bunny Basic too :D

1 TI$="000000":SYS49152,2:PRINTTI/60:END
2 CLR:PRINT"{CLR}";:C=48:B=49:G=128:H=64:I=32:J=16:K=8:L=4:M=2:N=1:F=256:O=65490
3 A=C:D=VANDG:IFD=GTHENA=B
4 SYSO:A=C:D=VANDH:IFD=HTHENA=B
5 SYSO:A=C:D=VANDI:IFD=ITHENA=B
6 SYSO:A=C:D=VANDJ:IFD=JTHENA=B
7 SYSO:A=C:D=VANDK:IFD=KTHENA=B
8 SYSO:A=C:D=VANDL:IFD=LTHENA=B
9 SYSO:A=C:D=VANDM:IFD=MTHENA=B
10 SYSO:A=C:D=VANDN:IFD=NTHENA=B
11 PRINTCHR$(A)
12 V=V+1:IFV<>FTHENGOTO3
13 END

- Bunny Basic interpreter (bunnybasic64.prg) is loaded at $c000
- Type RUN to start from line 1 in slow BASIC mode
- SYS49152,2 calls Bunny Basic program at line 2 and returns to slow BASIC at line 13 (END), to print out the value TI/60
- CLR clears all Bunny Basic variables
- SYSO / SYS65490 is CHROUT. Bunny Basic puts low 8 bits of variable A in Accumulator at SYS

-> 11.6 seconds, but this is all bitwise operations without the nybble lookup, and kernal screen scrolling is taking most of the time

Let's try with nybbles:

1 TI$="000000":SYS49152,2:PRINTTI/60:END
2 CLR:PRINT"{CLR}";:B=16:F=256:E=15
3 A=V/B:A=A+B:GOSUBA:A=VANDE:A=A+B:GOSUBA:PRINT:V=V+1:IFV<>FTHENGOTO3
4 END
16 PRINT"0000";:RETURN
17 PRINT"0001";:RETURN
18 PRINT"0010";:RETURN
19 PRINT"0011";:RETURN
20 PRINT"0100";:RETURN
21 PRINT"0101";:RETURN
22 PRINT"0110";:RETURN
23 PRINT"0111";:RETURN
24 PRINT"1000";:RETURN
25 PRINT"1001";:RETURN
26 PRINT"1010";:RETURN
27 PRINT"1011";:RETURN
28 PRINT"1100";:RETURN
29 PRINT"1101";:RETURN
30 PRINT"1110";:RETURN
31 PRINT"1111";:RETURN

-> 10.3 seconds (~2.4 seconds if you eliminate screen scrolling)

2024-01-19 11:29

Mr SQL

Registered: Feb 2023
Posts: 116

Quoting aeeben

10 TI$="000000":E=15:DIMB$(E):SYS58692:FORA=.TOE:READB$(A):NEXT:D=1/A 20 FORA=.TO255:PRINTB$(A*D)B$(AANDE):NEXT:PRINTTI/60:DATA0000,0001,0010 30 DATA0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111

-> 11.8166667 sec.

Multiply by 1/16 is a bit faster than division. Constants in variables is a bit faster. I didn't see any gains from using a subroutine.

Great solutions! I ran this at 11.5 seconds.

But when I added the decimal form and the spaces from Robin's example it went up to 14.5

10ti$="000000":e=15:dimb$(e):sys58692:fora=0toe:readb$(a):next:d=1/a
20 fora=0to255:? a;" ";b$(a*d)b$(a and e):next:printti/60:data0000,0001,0010
30data0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111

Probably casting a as a string would improve this time, the print routine handles strings faster than numeric variables.

Bunny BASIC is faster than CBM BASIC for being integer based.

Refresh

Subscribe to this thread: