The cost I am worried about is speed. I realize I did not give you enough information. I am implementing convolution layers of neural networks. These layers multiply each window with a filter and sums the result. Here is a naive implementation of a convolution. This is really slow. One way to make it faster is to move the multiplication and summation out of the loop and do it later. Instead, you flatten each window into a 1D array and get a big matrix with "batch_size*new