Re gpu: 2048 16-bit ints is 4096 bytes, per 4usec is 1,024,000,000 bytes/sec or 976 MB/s. Except its both directions, so actually ~2 GB/s. If youre using a haswell for example (PCIe v3) thats 3 lanes already...without giving your GPU any processing time. A x16 card would give you 3.5 usec, assuming the cuda interface itself has no overhead.
As mentioned above it also depends on the rest of your budget -- whats the cycle time, how much time are you allocating for image capture itself, and what do you need to do with that fft (if greater than x write boolean out? send a message? etc)?