This code was backported from sse4 code. 2x4,2x8,2x16,4x2,4x4,4x8,4x16,4x32 are covered. The macros only use sse2 but the primitives use movddup(sse3) but this could easily be replaced if sse2 primitives are needed.