If it reproduces again, post the VIs. I suspect that the problem is with the nested loops in the subVI and not with the calling VI.
The number of loop instances used at run-time is the minimum of the number specified in the dialog and the number wired to P. If you don't wire anything to P, that input defaults to the number of logical processors in the machine. If you have a quad-core machine and you specified at least four in the dialog, both loops will be four-way parallel. The outer loop will execute using four loop instances, and each of those loop instances will execute the inner loop using four loop instances. That would result in 16 loop instances executing in parallel. SuperS_5 is right that this probably is not efficient.