On the FPGA side, reading 2 U32s or 8 U8s shouldn't make a difference from a throughput sense. Some old info I found internally basically said that if they don't have the same throughput it's a bug.
I also don't think the DMA throughput should be effected. If I remember correctly, the DMA engine will try to send multiple data items up at the same time to minimize the overhead of PCIe packet headers.