a) make the code PC relative (that may cost even more cycles)
Hmm, unless I'm missing something here, from my experience pc relative code usually executes faster. I suppose you're referring writing to immediate addresses? Just sacrifice an address register and use that as a starting pointer for writing.