[CSDb] - User Forums - Long division/modulo with byte-size divisor

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Long division/modulo with byte-size divisor

2021-06-13 12:42

Krill

Registered: Apr 2002
Posts: 2982

Long division/modulo with byte-size divisor

So i needed something to divide a large integer (160 or more bits) by a small integer (5 bits) while also performing a modulo operation.

After a few rounds of optimisation, turns out the 6502 implementation is surprisingly compact and fast:

            ; dividend is in BIGNUM, little-endian
            ; divisor is in accu
            sta DIVISOR
            lda #0
            ldx #BIGNUMBYTES
longdivmod  ldy #8
            asl BIGNUM - 1,x
-           rol
            cmp DIVISOR
            bcc +
            sbc DIVISOR
+           rol BIGNUM - 1,x
            dey
            bne -
            dex
            bne longdivmod
            ; quotient is in BIGNUM, little-endian
            ; remainder is in accu

This runs in O(n) linear time. The routine can be executed continuously on the dividend/quotient to extract modulo values.

It's also possible to have a big-endian long integer argument/result by simply iterating over the byte array in reverse order.

Of course, can slightly optimise performance (2 cycles per output bit) by sacrificing a few bytes to self-modify the two DIVISOR arguments with immediate operands. Or unroll the entire thing.

This shall go to Codebase64 at some point, of course, but there might be some more optimisation opportunities.

Also i think there are some requirements on the arguments, such that their MSB must be clear or so.

2021-06-13 22:24

Oswald

Registered: Apr 2002
Posts: 5095

I guess most of us like me still dont understand how exactly diving like this works. I would need to invest like 20 minutes with google to get back to understanding. a little writeup on the basic idea and how did you end up on the routine would help.

2021-06-14 03:51

Quiss

Registered: Nov 2016
Posts: 43

For 5 bits, there are only 32 possibilities. If you're willing to store a piece of code for each, some of them can be made faster.
(Powers of two, obviously. Also, in the first iteration of x, the "sbc" won't get executed until 2^(8-y) >= divisor, so you can do a preloop that just rols)

That is, if you're doing this on the C64.
If this is some drive GCR thing (and the 5 bit provision makes me think it might be), then of course memory's too precious. But you might still be able to squeeze out a few cycles with a log2(x) lookup table, for x=0..31.

(EDIT: Actually, if memory's not an issue, you can have two 8kb lookup tables, for div and mod of 8 bit by 5 bit. You'll still have to shift, though.)

2021-06-14 10:06

NoobTracker
Account closed

Registered: Apr 2021
Posts: 14

How would such a table improve speed? As far as I can tell, it could only be used for the first iteration.

2021-06-14 10:10

Krill

Registered: Apr 2002
Posts: 2982

Quoting Oswald

a little writeup on the basic idea and how did you end up on the routine would help.

I started with a basic plain bitwise-division routine, expanded it for bignum dividends, then chipped away bits and pieces until i ended up with this. =) It was rather mechanical, really.
But a little later after posting, i noticed it's pretty much the same algorithm as Graham came up with for line slope divison (16 / 8 = 16): https://codebase64.org/doku.php?id=base:8bit_divide_8bit_product - filed somewhat wrongly or misleading there.
There are slight differences concerning the extra shift once per input byte, though.

Quoting Quiss

For 5 bits, there are only 32 possibilities. If you're willing to store a piece of code for each, some of them can be made faster.
(Powers of two, obviously. Also, in the first iteration of x, the "sbc" won't get executed until 2^(8-y) >= divisor, so you can do a preloop that just rols)

As i'm continuously extracting values, dividend and quotient are getting smaller and smaller. So the size of the byte array can be decreased whenever a most significant byte becomes zero. This should neatly halve overall execution time.
Prerolling for the first array byte comes with some additional cost, not sure when within those 8 bits that would amortise.

Quoting Quiss

That is, if you're doing this on the C64.
If this is some drive GCR thing (and the 5 bit provision makes me think it might be), then of course memory's too precious. But you might still be able to squeeze out a few cycles with a log2(x) lookup table, for x=0..31.

(EDIT: Actually, if memory's not an issue, you can have two 8kb lookup tables, for div and mod of 8 bit by 5 bit. You'll still have to shift, though.)

This is for C-64 itself, not the drive, not GCR. =)

It is "2 to 5 bits of divisor", actually. I'm extracting permutation indices from a large number, and the sets to permute have 3 to slightly over 16 items. Unranking the permutations can be done in O(n) as well, and in-place. :)

I'm rather size-constrained there, though, so cannot afford unrolling or big tables.
It's all not really time-critical, however, and i must strictly optimise for size.

2021-06-14 10:14

Krill

Registered: Apr 2002
Posts: 2982

Quoting NoobTracker

How would such a table improve speed? As far as I can tell, it could only be used for the first iteration.

Or the very last iteration? :) At least i don't see how this table would apply to larger arguments than a byte.

2021-06-14 10:27

Oswald

Registered: Apr 2002
Posts: 5095

Gunnar, "line slope divison" I'have discovered long ago that log/exp is accurate enough for that.

2021-06-14 10:27

NoobTracker
Account closed

Registered: Apr 2021
Posts: 14

Quote: Quoting NoobTracker
How would such a table improve speed? As far as I can tell, it could only be used for the first iteration.
Or the very last iteration? :) At least i don't see how this table would apply to larger arguments than a byte.

I actually mean the first one, the most significant byte, because that is the only byte where you have a clear accumulator. You store the division result and you load the modulo value into A. After that, you don't know how the previous byte influenced A and the lookup table only works for A=0, as far as I can tell. I might be wrong.

2021-06-14 10:31

Krill

Registered: Apr 2002
Posts: 2982

Quoting Oswald

Gunnar, "line slope divison" I'have discovered long ago that log/exp is accurate enough for that.

Try that on something similar, like slopes in a Gouraud shader. You'll find you really need the accuracy. And then there are line routines with subpixel accuracy, too. :) Anyways, different topic.

2021-06-14 11:37

ChristopherJam

Registered: Aug 2004
Posts: 1409

Nice work eliminating the 20 byte ROL present in the naive implementation :)

My first run looks like

div:
    ldy#160
iterl:
    clc
    ldx#20
:   rol acc,x
    dex
    bpl :-

    sec
    lda acc
    sbc den
    bcc nosub
    sta acc
    inc acc+20
nosub:
    dey
    bne iterl
    rts

and even then it feels like I should be able to lose the increment in favour of just shifting in the carry from the preceding comparison.

2021-06-14 16:26

Quiss

Registered: Nov 2016
Posts: 43

Quoting Krill

It is "2 to 5 bits of divisor", actually. I'm extracting permutation indices from a large number, and the sets to permute have 3 to slightly over 16 items. Unranking the permutations can be done in O(n) as well, and in-place. :)

Huh. I'll be curious to see what kind of 4k effect you need these kinds of cryptographic shenanigans for. :)

Quoting Krill

Or the very last iteration? :) At least i don't see how this table would apply to larger arguments than a byte.

It depends on the (number of bits in the) divisor. Take, for example, divisor = 3.

Let x = BIGNUM[0]. You look up divmod(x, 3), store the (temporary) result in RESULT[0]. That leaves you with a remainder in the range 0..2, so two bits. You read another 6 bits from BIGNUM[1] (so x is now mod << 6 | BIGNUM[1] >> 2). You look up divmod(x, 3) again for the new x. The div result of that you shift to the left by 2 and add it to RESULT[0]:RESULT[1]. Now you read another 6 bits from BIGNUM[1] and BIGNUM[2], and so on.

For all the shifting, it helps to have the usual shift-right-by-6, and-by-3-and-shift-left-by-4, shift-left-by-6 (etc.) lookup tables.

2021-06-14 19:09

Quiss

Registered: Nov 2016
Posts: 43

Actually, for small divisors like 3, you could invest a full 2kb and have a (here) 10 bit lookup for div and mod, since that brings it down to

    # [init zp_{mod,div} to point to {mod,div}_lookup]
loop:
    ldy BIGNUM, x
    lda (zp_div), y
    sta RESULT, x
    lda (zp_mod), y
    ora #(>div_lookup)
    sta zp_div+1
    eor #(>div_lookup) ^ (>mod_lookup)
    sta zp_mod+1
    inx
    cpx #BIGNUMBYTES
    bne loop

(Several optimizations are possible, of course. Like having the mod_lookup table already to the ora)

2021-06-14 19:35

chatGPZ

Registered: Dec 2001
Posts: 11391

All of you should shut up and make a demo about it :)

2021-06-29 21:21

Fred

Registered: Feb 2003
Posts: 287

Here's another division routine with byte-size divisor. It has one loop less compared to Krill's routine and is a few bytes shorter:

; div16
;   input:
;   - 16-bit number in dividend, little-endian
;   - 8-bit in divisor
;   output:
;   - 16-bit number in dividend
;   - AC: remainder
;   - XR: 0
;   - YR: unchanged
div16           ldx #16
                lda #0
-               asl dividend
                rol dividend+1
                rol a
                cmp divisor
                bcc +
                sbc divisor
                inc dividend
+               dex
                bne -
                rts
divisor         .byte 0
dividend        .byte 0, 0

2021-06-29 22:39

Krill

Registered: Apr 2002
Posts: 2982

Yeah cool, but the point was having arbitrarily large dividends to divide by an 8-bit divisor. =)

2021-06-30 08:45

Fred

Registered: Feb 2003
Posts: 287

Right, the routine I provided is 16 bits dividend only.

2021-06-30 21:05

Fred

Registered: Feb 2003
Posts: 287

Here is my version of a long division, specified in bytes in DIV_IN_BYTES:

; div
;   input:
;   - n-bytes dividend, little-endian
;   - 8-bit divisor
;   output:
;   - n-bytes result stored in dividend
;   - AC: remainder
;   - XR: 0
;   - YR: 0

DIV_IN_BYTES    = 20

div             ldy #DIV_IN_BYTES * 8
                lda #0
-               clc
                ldx #-DIV_IN_BYTES & $ff
-               rol dividend + DIV_IN_BYTES - $100,x
                inx
                bmi -

                rol
                cmp divisor
                bcc +
                sbc divisor
                inc dividend
+               dey
                bne --
                rts
                
divisor         .byte $00
dividend        .byte $00, $00, $00, $00, $00, $00, $00, $00
                .byte $00, $00, $00, $00, $00, $00, $00, $00
                .byte $00, $00, $00, $00

2021-06-30 22:33

Krill

Registered: Apr 2002
Posts: 2982

At a quick glance, this seems to run in O(n^2) quadratic time rather than O(n) linear.

With a growing input dividend, both the inner and the outer loops take more iterations.

So i guess this will run slower.

2021-07-01 13:20

Monte Carlos

Registered: Jun 2004
Posts: 364

Using both cmp as well as sbc seems subject to possible optimization.

2021-07-01 14:01

Krill

Registered: Apr 2002
Posts: 2982

Quoting Monte Carlos

Using both cmp as well as sbc seems subject to possible optimization.

If the X register can be spared, replacing

            cmp DIVISOR
            bcc +
            sbc DIVISOR
+

with something like

            tax
            sbx #DIVISOR
            bcc +
            txa
+

might be possible, but this hardly looks like it would save cycles on average. =)

2021-07-01 14:03

Krill

Registered: Apr 2002
Posts: 2982

That said, the long division as in the OP is quite swift, but the long multiplication i also need in the same context turned out to be considerably slower. :) (I guess mostly because it's executed a lot more often, though.)

2021-07-02 10:57

Fred

Registered: Feb 2003
Posts: 287

Quote: At a quick glance, this seems to run in O(n^2) quadratic time rather than O(n) linear.

With a growing input dividend, both the inner and the outer loops take more iterations.

So i guess this will run slower.

Correct. I now see the beauty of your version :-)

Some correction on your routine, I see that the BIGNUMBYTES isn't taken into account correctly. If you set it to e.g. 20, the number of bytes processed is 19. See my correction below.

To speed up the algorithm, I think it is best to skip zeros before going into the loop in case the number of bytes is always fixed and the value is low.

Replace:

            ldx #BIGNUMBYTES

with:

            ldx #BIGNUMBYTES + 1

-           ldy BIGNUM - 1,x
            bne longdivmod
            dex
            bne -

2021-07-02 13:50

Krill

Registered: Apr 2002
Posts: 2982

Quoting Fred

Some correction on your routine

With BIGNUMBYTES = 1

            ldx #2; BIGNUMBYTES + 1

-           ldy BIGNUM - 1,x; BIGNUM + 1, BIGNUM + 0
            bne longdivmod
            dex
            bne -

this now iterates over 2 bignum-bytes. Likewise one to many for all other settings.

So, no, i think my code was okay.

Quoting Fred

To speed up the algorithm, I think it is best to skip zeros before going into the loop in case the number of bytes is always fixed and the value is low.

Yes. =)
Quoting Krill

As i'm continuously extracting values, dividend and quotient are getting smaller and smaller. So the size of the byte array can be decreased whenever a most significant byte becomes zero. This should neatly halve overall execution time.

But i didn't add this optimisation, as i need the bytes elsewhere and it's already quick enough for my purposes. :)

2021-07-02 14:43

Fred

Registered: Feb 2003
Posts: 287

Quote: Quoting Fred
Some correction on your routine
With BIGNUMBYTES = 1
ldx #2; BIGNUMBYTES + 1 - ldy BIGNUM - 1,x; BIGNUM + 1, BIGNUM + 0 bne longdivmod dex bne -
this now iterates over 2 bignum-bytes. Likewise one to many for all other settings.

So, no, i think my code was okay.

Quoting Fred
To speed up the algorithm, I think it is best to skip zeros before going into the loop in case the number of bytes is always fixed and the value is low.
Yes. =)
Quoting Krill
As i'm continuously extracting values, dividend and quotient are getting smaller and smaller. So the size of the byte array can be decreased whenever a most significant byte becomes zero. This should neatly halve overall execution time.
But i didn't add this optimisation, as i need the bytes elsewhere and it's already quick enough for my purposes. :)

ok, my conclusion was too fast, sorry about that. I retested it and you're right, it works okay.

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

TheRyk/MYD!
sebalozlepsi
CA$H/TRiAD
Didi/Laxity
DKT/Samar/sidDivers
cercamon/Scene World..
crayon/Hokuto Force
kbs/Pht/Lxt
Guests online: 127

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Coma Light 13  (9.6)
4 Edge of Disgrace  (9.6)
5 Mojo  (9.6)
6 Uncensored  (9.6)
7 The Demo Coder  (9.6)
8 Comaland 100%  (9.6)
9 What Is The Matrix 2  (9.6)
10 Wonderland XIV  (9.5)

Top onefile Demos

1 Layers  (9.7)
2 Cubic Dream  (9.6)
3 Party Elk 2  (9.6)
4 Copper Booze  (9.6)
5 Dawnfall V1.1  (9.5)
6 Rainbow Connection  (9.5)
7 Morph  (9.5)
8 Libertongo  (9.5)
9 Onscreen 5k  (9.5)
10 It's More Fun to Com..  (9.5)

Top Groups

1 Booze Design  (9.3)
2 Oxyron  (9.3)
3 Performers  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Original Suppliers

1 Derbyshire Ram  (9.7)
2 Fungus  (9.3)
3 Black Beard  (9.2)
4 Baracuda  (9.2)
5 hedning  (9.1)

Page generated in: 0.065 sec.