| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Profiling tool?
Has anyone come across or written a tool that lets you execute a chunk of 6502 that then reports back a disassembly where each instruction is annotated with the total number of cycles it consumed?
Something like unp64 but with added instrumentation, basically.
Closest a quick google found for me was a tool for generating such reports from logs produced by a hardware bus monitor.. |
|
| |
soci
Registered: Sep 2003 Posts: 480 |
This was discussed already:
https://csdb.dk/forums/?roomid=11&topicid=68967&firstpost=36
And mentioned later here:
https://csdb.dk/forums/?roomid=11&topicid=125026&firstpost=12
Otherwise who would want to do that for demo coding. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
I did to check min/max/avg raster time for sid tunes. Output is time spent on each instruction and a disassembly, to find hot spots (typically vibrato or pulse sweeps) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ah excellent, thanks soci.
I somehow missed the trace command in the VICE monitor. I assume that's what you used for "Then I run the binary in VICE and start a trace at the place of interest for as long as necessary"?
seems to be hella slow for 700,000 cycles worth of execution, mind :( |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting JackAsserI did to check min/max/avg raster time for sid tunes. Output is time spent on each instruction and a disassembly, to find hot spots (typically vibrato or pulse sweeps) Off topic, but that reminds me of an old idea of mine:
At the beginning of the player routine called each frame, read this frame's SID register values from a ring buffer and write them to the actual SID registers. Then switch to a thread that would fill the ring buffer with new data until switching back to the calling thread is done with a timer or raster interrupt.
Result: player CPU load spikes are evened out, player code and structure is greatly simplified as no manual load balancing is required. Maximum player CPU time for every video frame can be minimised. |
| |
Digger
Registered: Mar 2005 Posts: 437 |
@Krill: genius! |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
i used that in some part (with an existing player of course) to get rid of spikes.... forgot where though :) |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Ergo; Groepaz is a genius |
| |
Perplex
Registered: Feb 2009 Posts: 255 |
I know Bjørn Røstøen has done this, as well as the guys in Booze Design. There are many geniuses out there. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
Ah yes, i think i stole the idea from HCL. There might be a thread about this here (from about a decade ago) :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
The scene is full of the greatest minds in the galaxy :D
Now, where's my profiling tool? :P |
| |
Slajerek
Registered: May 2015 Posts: 63 |
Actually, a profiling tool is messing my mind for some time already. I have a proof of concept view of such cycles counting and display for the C64 Debugger, but it is not in the master branch yet.
Brush/Elysium showed me at X a Champ 65C02 profiler written in Ruby that is able to generate nice profiling reports in HTML, it shows cycle % per detected procedures and utilizes GraphViz for the code flow diagrams.
I changed the source of Champ to be able to feed data generated by my debugger and just Today released the first alpha version of this, it is included in C64 Debugger v0.64.56.4. Although it is not perfect, as still I need to add labels and watches, but it is quite usable even now.
Go to Settings/Emulation, set profiler output file and start profiling for some time. It will create a *.pd file which then you need to feed to C64 Debugger's version of Champ: http://csdb.dk/getinternalfile.php/179301/c64d-champ-v064564.rb
It should generate a report.html which you can view in a web browser. It has some flaws though that have to be noted:
- the cycle counter is a global VIC counter, not the CPU cycle counter, so bad line cycles are also added, I'm going to add a setting to select which cycle counter to use for profiling
- Champ is completely not able to distinguish if we are in IRQ, thus, cycles taken by code run in IRQ are also counted as a procedure code.
Anyway, this is some starting point I am going to develop further. It is not giving cycles % spent per each code address, but the POC for such view is existing and I suspect should be added in some upcoming version of the C64 Debugger soon. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: The scene is full of the greatest minds in the galaxy :D
Now, where's my profiling tool? :P
You know very well that even if there was such a tool publicly available you'd find an excuse to code you own in Python. I bet you've already started. That's how we work. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
didnt micro64 have some profiling stuff in its "ide"? |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
VICE PDB Monitor 1.0 (get the extra files from github as per comments ) has a limited profiler, Martin did it so I'm not a 100% sure how it works. But I think it uses the memmap operations to count how often an address is executed. So its doesn't give you clocks but it does give hotspots. Its open source so if you wish to extend it to be better, feel free ;) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
windows specific hack though, and requires a custom non supported acme hack. meh |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
hack? what hack? It makes no modifications to VICE at all and will even connect to a Linux instance if you set up a tunnel for it. To be clear it only uses the Remote Debugger Protocol to communicate to VICE.
It can use Martin's custom ACME, however ACME is for morons and 64Tass is also supported(with more features) for true sensible people ;) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
Now i'd like to know what cybermodo has to say about this topic :) |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
I can't believe he uses ACME, that would mean he knows assembly... |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
Obviously he doesnt - but he'd sure have to say something about it =D |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
you could copy paste it for him to lemon |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
NO :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ah, thanks Slajerek, that's an excellent start.
I'm not seeing any cyclecounts or timestamps in the log output though?
So yes, JackAsser, I am indeed currently writing a python script to parse c64debugger logs and the source PRG to add that information..
I'm sure CyberMojoJoJo could whip up a native profiling tool in C in no time at all, assuming someone else has provided some libraries to do all the heavy lifting. Maybe to scroll flamegraphs in ifli? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Whee, this is fun.
First bash at collecting stats from the output from c64 debugger; this is a run of tinycrunch (fast variant) decrunching 7.bin from the Pearls for Pigs corpus.
(well ok, second .prg profiled; started building this for a sekret project, output of that one's under wraps for now)
Columns are address, instruction, total cycles spent by that instruction, cumulative percentage of cycles expended, and an ASCII graph of the latter.
Clearly I'd save a fucktonne of cycles if I relocated the decruncher to zero page and inlined the pointers at 0a30 and 0a32 :)
0a00 stx $07 3 0.00%
0a02 ldy #$02 2 0.00%
0a04 sta $0004,y 15 0.00%
0a07 lda ($06),y 15 0.01%
0a09 dey 6 0.01%
0a0a bpl $0a04 8 0.01%
0a0c pha 3 0.01%
0a0d sec 2 0.01%
0a0e bcs $0a40 3 0.01%
0a10 pla 4 0.01%
0a11 iny 2 0.01%
0a12 sta ($04),y 6 0.01%
0a14 rts 6 0.01%
0a15 inc $05 255 0.06%
0a17 clc 102 0.08%
0a18 bcc $0a40 153 0.11%
0a1a beq $0a10 3577 0.80%
0a1c iny 3574 1.49% #
0a1d clc 3574 2.17% #
0a1e lda ($06),y 8935 3.89% ##
0a20 adc $04 5361 4.92% ##
0a22 sta $08 5361 5.95% ###
0a24 txa 3574 6.64% ###
0a25 ora #$f8 3574 7.33% ####
0a27 adc $05 5361 8.36% ####
0a29 sta $09 5361 9.39% #####
0a2b lda ($08),y 8935 11.11% ######
0a2d sta ($04),y 10722 13.17% #######
0a2f iny 22824 17.56% #########
0a30 lda ($08),y 57060 28.53% ##############
0a32 sta ($04),y 68472 41.69% #####################
0a34 txa 22824 46.08% #######################
0a35 sbx #$08 22824 50.47% #########################
0a37 bpl $0a2f 32449 56.71% ############################
0a39 tya 3574 57.40% #############################
0a3a adc $04 5361 58.43% #############################
0a3c sta $04 5361 59.46% ##############################
0a3e bcs $0a15 3625 60.16% ##############################
0a40 lda #$02 3576 60.84% ##############################
0a42 adc $06 5364 61.88% ###############################
0a44 sta $06 5364 62.91% ###############################
0a46 bcs $0a96 3586 63.60% ################################
0a48 ldy #$00 4884 64.54% ################################
0a4a lax ($06),y 12210 66.88% #################################
0a4c bpl $0a1a 6341 68.10% ##################################
0a4e asl 1970 68.48% ##################################
0a4f bmi $0a75 2466 68.96% ##################################
0a51 iny 11768 71.22% ####################################
0a52 lda ($06),y 29420 76.87% ######################################
0a54 sta ($04),y 35304 83.66% ##########################################
0a56 dex 11768 85.93% ###########################################
0a57 bmi $0a51 17155 89.22% #############################################
0a59 tya 994 89.42% #############################################
0a5a adc $06 1491 89.70% #############################################
0a5c sta $06 1491 89.99% #############################################
0a5e bcc $0a63 1463 90.27% #############################################
0a60 inc $07 140 90.30% #############################################
0a62 clc 56 90.31% #############################################
0a63 tya 994 90.50% #############################################
0a64 adc $04 1491 90.79% #############################################
0a66 sta $04 1491 91.07% ##############################################
0a68 bcc $0a6c 1464 91.35% ##############################################
0a6a inc $05 135 91.38% ##############################################
0a6c ldy #$00 994 91.57% ##############################################
0a6e lax ($06),y 2485 92.05% ##############################################
0a70 bpl $0a1a 1325 92.30% ##############################################
0a72 asl 332 92.37% ##############################################
0a73 bpl $0a51 340 92.43% ##############################################
0a75 txa 1308 92.68% ##############################################
0a76 adc $04 1962 93.06% ###############################################
0a78 sta $08 1962 93.44% ###############################################
0a7a lda #$ff 1308 93.69% ###############################################
0a7c adc $05 1962 94.07% ###############################################
0a7e sta $09 1962 94.44% ###############################################
0a80 iny 1308 94.70% ###############################################
0a81 lda ($08),y 3270 95.32% ################################################
0a83 sta ($04),y 3924 96.08% ################################################
0a85 iny 1308 96.33% ################################################
0a86 lda ($08),y 3270 96.96% ################################################
0a88 sta ($04),y 3924 97.71% #################################################
0a8a lda #$01 1308 97.97% #################################################
0a8c adc $04 1962 98.34% #################################################
0a8e sta $04 1962 98.72% #################################################
0a90 bcs $0a9b 1310 98.97% #################################################
0a92 inc $06 3270 99.60% ##################################################
0a94 bne $0a48 1959 99.98% ##################################################
0a96 inc $07 65 99.99% ##################################################
0a98 jmp $0a48 39 100.00% ##################################################
0a9b inc $05 10 100.00% ##################################################
0a9d bcs $0a92 6 100.00% ##################################################
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ChristopherJamtotal cycles spent by that instruction, cumulative percentage of cycles expended, and an ASCII graph of the latter I wonder how log2(total cycles spent by that instruction) and the corresponding ASCII graph (with or without scaling to 64x "#" max.) would look like. I imagine "number of bits to represent number of cycles" to be quite an intuitive curve. :)
Edit: Hmm, that would basically be a scaled and left-justified version of the "total cycles spent by that instruction" (base 10) column. Maybe it should be linear‽ |
| |
Slajerek
Registered: May 2015 Posts: 63 |
Interesting result of the script :)
Yes, lacking of cycle stamp in each log line is a clear overlook by me, apologies :) there's a number of cycles in each jsr/rts marker, so Champ does not need a cycle count per line. Anyway, I definitely have to add better cycle counters handling and logging. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Krill: Yes, I graphed cumulative to make it easy to see at a glance how costly sections of code are, but the counts do indeed make it easy to see individual expensive instructions.
five digits bad, three or four digits less so :)
Slajerek: Thanks!
Oh, and no need to apologize; much as I'm looking forward to future enhancements this is already extraordinarily useful. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ChristopherJamKrill: Yes, I graphed cumulative to make it easy to see at a glance how costly sections of code are, but the counts do indeed make it easy to see individual expensive instructions. So you're looking for big slopes (high increase of #) when head-parsing the graph line by line from top to bottom, adjusting for increasing "ratio skew"?
I mean the cumulative graph always starts at 0 and ends at 100% in a strictly ascending manner, no matter the program or per-line cycle distribution. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Yes, I look at the slope as a proportion of the final width, rather than current width. Perhaps I should add a right hand border, or a second/lighter fill character between the current total and 100%? (eg ####__)
But yes, even as is you can spot things like the graph going near horizontal at 0a32 and 0a54, and you can also see it going from 1.5% to 89% in the space of 40 or so instructions. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
does you log handle, badlines, sprite DMA, IRQs etc, does it just get tacked on to the cost of the instruction, are you able to filter out what is and isn't IRQ etc. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
This is purely CPU cycles at the moment; I don't have information about VIC bus accesses.
I suspect the only indication I would have that an IRQ has happened would be from parsing the preceding instruction and noting that the next one executed was in an unexpected location; not sure, as I've only tested this with screen blanked and interrupts disabled.
( Slajerek? :D ) |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Enough to profile "pure" algorithms, converting input to output without side-effects. Which is all you need for now. :) |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
Quote: VICE PDB Monitor 1.0 (get the extra files from github as per comments ) has a limited profiler, Martin did it so I'm not a 100% sure how it works. But I think it uses the memmap operations to count how often an address is executed. So its doesn't give you clocks but it does give hotspots. Its open source so if you wish to extend it to be better, feel free ;)
Currently VICE PDB Monitor uses Vice's mmzap and mmshow between executions or breakpoints to gather execution information and generate a heat map. It's not entirely exact since Vice doesn't trap multiple executions from the same address.
I'm pondering adding some profiling syntax and command line arguments to BDD6502 to allow cycle exact profiling a source level debugging of the same.
One thing I would like to support is support for self modifying code where the opcode is changed. This would be useful for very optimised self modifying code. This does however have challenges for regular profile views because cycle counts can change when the opcode is changed. |
| |
Slajerek
Registered: May 2015 Posts: 63 |
@oziphantom noted :) all these are not logged yet, the Champ profile reports do not take these things into account. Will add soon, thanks for pointing this out. |