Say my adds were 6 cycles each, I'm adding 4 numbers, I start timer at $124, then lda TimerA+1:bne means no carries. Don't know how it's useful either, just throwing out a lot of ideas. You gotta brainstorm and not hold yourself back, useful stuff will stick eventually.

There's also two timers, you can have them read different values.

You don't need the sbc #constant because you should set the timer to #constant + variation at the start, then at the end the timer reads exactly the high byte +1 (since Timer can't be zero)

so

ldx timerA

dex

stx sum+1]]>

@mahooney Nice, that's a very good use because it's the only way possible to create an effect. Will check out your demo thanks for the comment :)]]>

---

Have a noise night!

http://mahoney.c64.org]]>

Question, what use can this be put to? And can you use the flags as part of a calculation ,while using the accumulation of times the flags happened for a parallel computation using the timer (which has no sideeffects) to do double the work? (I mean, use the branch to do one thing, then at the end do another thing with the timer value).

In another direction, can you do some computation unrolled then interrupt it at a certain time to see the value and that's your result?

example,

rol

bcs +2

inx

... x8

interrupt after 24 cycles

X will be 4-(popcnt(A and $#f)/6) or something (the full formula is kinda long)

Another idea I've used to to make 'infinite' loops that can only be stopped by an interrupt

lda #0

sta $4000

sta $4001

..

decide which area of the bitmap you want to fill

jump into the code

stop it with an interrupt

of course you can 'plot' an RTS into the middle as well, then restore it.

Interrupt idea might turn out faster

This is getting off topic though.]]>

lda n

ror

bcs *+2

x8

;diff timer

sta popcnt

Population count in 4.5 cycles per bit. Of course a table would be much faster. What else useful could you do?]]>

;now do subtractions

sec

sbc n5

bcs *+2; 'subtract' one cycle from timing if there's a borrow

sbc n6

bcs *+2

...]]>

clc

lda n1

adc n2

bcs *+2;basically do nothing as quickly as possible to make carry consume one more cycle

adc n3

bcs *+2

...

sta sum

;diff timer

sta sum+1

My first method was 6/7, this one is 5/6. The probability of a carry is just under half, so basically approaches 5.5 cycles per number added.

I love to see someone's face try to decode that though, confuse the @#@ out of them :)]]>

The indexed addressing is doing the adds, and the timer is doing the carries, in parallel.

…which is exactly what Fresh is proposing \O/

@lft - all good, I missed that Repose was tracking the carry count in X in post #4…]]>

My own idea which just came to mind based on this concept, is

addtab is a 510 byte table, starting at a page boundary (very important!), that is just a count (0,1,1...255,0..)

ldx n1

lda n2

sta sm1+1

lda n3

sta sm2+1;or just imagine the numbers are already stored there

;initialize and start the timer to a high value

lda TimerA_Low

sta start

sm1 lda addtab,x; if addtab+x crosses page, add 1 cycle

tax; transfer running sum (n1+n2) to X

sm2 lda addtab,x;add n3 to running sum

...

sta r0;the low byte of sum

lda start

sec

sbc timerAlo

sbc #some constant

sta r1;the high byte of sum

It's not fast but, it's an interesting concept that could be applied to optimized code somehow.

If I store n2,n3... directly in sm1+1, sm2+1..., then it optimizes out to:

ldx n1

sm1 lda addtab,x;n1+n2

tax

sm2 lda addtab,x;n3+ (n1+n2)

tax

sm3 lda addtab,x;n4+ (n1+n2+n3)

...

carries are time of code minus the constant 6*n (the number of values added)

This is also quite faster so it's actually practical!

I just invented the missing add instruction. The indexed addressing is doing the adds, and the timer is doing the carries, in parallel.]]>

BTW, I'm not using the carry.

.const zp0=$fb .const zp1=$fc .const zp2=$fd .pc=$1000 start: sei !: bit $d011 bpl !- // No BLs! lda zp1 sta b+1 lda zp2 sta c+1 lda #$00 sta $dd0f sta $dd06 lda #$01 sta $dd07 sta $dd0f ldy zp0 // 3 b: ldx tab,y // 4/5 c: ldy tab,x // 4/5 // Here we could add more ldx/ldy lo: ldx $dd06 // 4 => Wait=15/17 lda carrytab,x // HI/LO in A/Y sty $63 sta $62 cli jmp $bdd1 .align $0100 tab: .fill 512,i&$ff carrytab: .fill 256,[$f3-i]&$ff]]>

We're all terrible at reading each other's comments.

Whoops, you're quite right. Sorry!]]>

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.

I wonder if you could just get away with checking the Y value every N potential overflows, and just see if it has decreased since the last test... it'd be quicker than doing a branch every time.

N may well be as high as 255; I'd have to think a bit harder to be sure of that.]]>

I do like lft's contribution (addition?) of using the carry from the last addition as inverse borrow for the fixup, mind :)]]>

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.]]>

; Assuming 8-bit unsigned input and 16-bit unsigned output ldy #0 ; Perform tmp1=a+c+e clc lda a adc c bcc :+ iny clc : adc e sta tmp1_lo bcc :+ iny clc : sty tmp1_hi [...]]]>

; Assuming 8-bit unsigned input and 16-bit unsigned output lda #0 sta tmp1_hi sta tmp2_hi ; Perform tmp1=a+c+e clc lda a adc c bcc :+ inc tmp1_hi clc : adc e sta tmp1_lo bcc :+ inc tmp1_hi clc : ; Perform tmp2=b+d+f lda b adc c bcc :+ inc tmp2_hi clc : adc f sta tmp2_lo bcc :+ inc tmp2_hi : ; Perform tmp1-=tmp2 sec lda tmp1_lo sbc tmp2_lo sta tmp1_lo lda tmp2_hi sbc tmp2_hi sta tmp2_hi ; Result in tmp1]]>

Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.

I'll have to try again with the "the position in code is that data" idea.

So here's the two variations:

1) as above, 3 setups, partials added in order

2) like codebase, 2 setups, partials added out of order.

Which one will reign victorious?

[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..

[2] http://www.6502.org/source/integers/fastmult.htm]]>

Quick reply for now,

- Yes, I'd thought of storing the values as 2's complement negative, the sign bit steals precision and reduces the domain of the multply. It could be useful to have an alternate and faster routine around, but for now I'm trying to solve a full unsigned 16x16.

-The final sum of a-f will be a positive number, since this is coming from an unsigned multiply partial product. Also, each a-b has a>=b.

I used that to construct the following 16x16 unsigned multiply where the partials are computed and added in order.

Conventions

f means f(x)=x^2/4, g means g(x)=(255-x)^2/4

x0/x1 are low/high of multiplier, x0/y1 multiplicand

Note there is a difference. I take the multiplier as meaning the one that requires setup, aka lda multiplier:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4

"setup x0" is a macro-like notation which means what I just stated above.

This is a bit confusing if you really try to follow it, but here's the order of partials:

y1 y0 x1 x0 ------ p0h p0l ;x0*y0 p1h p1l ;x0*y1 for l and y1*x0 for high p2h p2l ;x1*y0 p3h p3l ;y1*x1 f(a+b)-g(a-b) in pointers in Y p0l=f_low_x0-g_low_x0 * y0 p0h=f_high_x0-g_high_x0 * y0 p1l=f_low_x0-g_low_x0 * y1 p1h=f_high_y1-g_high_y1 * x0 p2l=f_low_x1-g_low_x1 * y0 p2h=f_high_x1-g_high_x1 * y0 p3l=f_low_y1-g_low_y1 * x1 p3h=f_high_y1-g_high_y1 * x1 If you do it in lsb to msb order to do the adds, x0*y0 low -> r0 (p0l) x0*y0 high (p0h) x0*y1 low (p1l) x1*y0 low -> r1 (p2l) x1*y0 high (p2h) y1*x0 high (p1h) this is different than p1l y1*x1 low -> r2 (p3l) y1*x1 high -> r4 (p3h)

And now the code:

setup x0;14 cycles setup x1 setup y1;42c ldy y0 sec lda (f_low_x0),y sbc (g_low_x0),y sta r0;p0l, 18c ;the adds of column 1 ldx #0;stores carries, reset every column with adds clc; Y=y0 lda (f_high_x0),y;A=p0h, (a+b) part ldy y1 adc (f_low_x0),y;+p1l (a+b) part bcc s1 inx clc s1 ldy y0 adc (f_low_x1),y;+p2l (a+b) part bcc s2; 31c inx clc ;the subs of column 1 s2 sec; Y=y0 ;A=the (a+b) parts added, carry in X sbc (g_high_x0),y bcs s3 dex sec s3 ldy y1 sbc (g_low_x0),y;-p1l (a-b) part bcs s4 dex sec s4 ldy y0 sbc (g_low_x1),y;-p2l (a-b) part bcs s5 dex s5 sta r1;35 or 65 per column ;the adds of column 2 txa; get the carries from column 1 clc ... ;column 3 txa clc adc (f_high_y1),y;p3h (a+b) part sec sbc (g_high_y1),y;p3h (a-b) part sta r3;19 ... 42+18+65+65+19=209

This beats the 16x16 posted on codebase, which is 224 by my estimate.

Times are minimal times, really there's a bit more added when there's overflows/carries.

It's good enough to make a fair comparison though.]]>

One problem, X could rollover to $FE.

Could also avoid that by computing a+(-b)+c+(-d)+e+(-f), if b d and f are coming from a table (as they would be for the multiplies I gather this is for) Just store the negated value in the table instead :)]]>

lda n1 adc n2 bcc n1n2_0 clc n1n2_1 adc n3 bcc n1n2n3_1 clc n1n2n3_2 adc n4 bcc n1n2n3n4_2 clc n1n2n3n4_3 adc n5 sta sum lda#3 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2_0 adc n3 bcc n1n2n3_0 clc n1n2n3_1 adc n4 bcc n1n2n3n4_1 clc n1n2n3n4_2 adc n5 sta sum lda#2 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3_0 adc n4 bcc n1n2n3n4_0 clc n1n2n3n4_1 adc n5 sta sum lda#1 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3n4_0 adc n5 sta sum lda#0 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts

Can also ditch the CLCs in the tails if you know there will be no overflow (eg if your numbers are unsigned, or represented in excess $2000). Would take you down to 6-7 cycles per additional low byte, 3 cycles per additional high byte. For enough addends it'd probably also be worth removing the CLCs from the low byte code sequences and just subtracting the misplaced carries from the final result instead.]]>