Thanks for looking into this. If I understand correctly there is some ambiguity over what is going on here and you are making an educated guess on what is happening.
It does sound likely that stack sampling activities are contributing to the measured instruction and cycle counts.
It would be great if when selecting HW Counters option rather than Time Profiling option, that stack sampling was disabled. Capturing HW Counter (PMU) data is very light-weight and does not need to be done often.
As requested I have re-run some captures for a heavier application with both low and high frequency sampling. This time I captured on an iPad.
I selected Safari Browser and loaded the Speedometer 2.1 benchmark in it as a repeatable test.
Here is a screenshot in low frequency sampling mode (other settings as per above post - i.e. 2 x counters CYCL and INSTR + deferred capture).
Here is the equiv High Freq sampling screenshot:
As you can see even though I have captured all processes on the iPad the INSTR and CYCL counts of all processes are very different.
Just for the top process (com.apple.WebKit.WebContent)
Low Freq - INSTR = 232B, CYCL = 84.5B ==> IPC = 2.75
High Freq- INSTR = 285B, CYCL = 110B ==> IPC = 2.59
Looking at all platform activity (to capture indirect work from Safari)
Low Freq - INSTR = 257B, CYCL = 99.3B ==> IPC = 2.59
High Freq- INSTR = 327B, CYCL = 137B ==> IPC = 2.38
Now, since it is clear there is some sort of overhead in the HW Counts here (perhaps due to stack sampling), it's hard to trust either sets of numbers (low or high sampling freq).
Is there any way to disable stack sampling?
If not is there any way to calculate and remove the overhead of stack sampling accurately?