Maximize memory read bandwidth on M1 Ultra/M2 Ultra

I am in the process of developing a matrix-vector multiplication kernel. While conducting performance evaluations, I've noticed that on M1/M1 Pro/M1 Max, the kernel demonstrates an impressive memory bandwidth utilization of around 90%. However, when executed on the M1 Ultra/M2 Ultra, this figure drops to approximately 65%. My suspicion is that this discrepancy is attributed to the dual-die architecture of the M1 Ultra/M2 Ultra. It's plausible that the necessary data might be stored within the L2 cache of the alternate die.

Could you kindly provide any insights or recommendations for mitigating the occurrence of on-die L2 cache misses on the Ultra chips? Additionally, I would greatly appreciate any general advice aimed at enhancing memory load speeds on these particular chips.