Scott Hanselman asks for the fastest way to flip the bits of a number in order to change the Endianess. To make it more challenging, he doesn't just want straight conversion of a 64-bit integer to another 64-bit integer. Instead, he wants to specify the size of the number to reverse (i.e., only the lower x number of bits).

Note: It was brought up that it probably isn't accurate to describe this as an Endianess converter, because Endianess really only affects the individual bytes or words in memory, not the total order of those bytes or words. That is, if you have a string of memory containing "1234", then converting Endianess would be more akin to resulting in "2143" than "4321". Scott's challenge is really for the latter example, which is total reversal of the bits.

There were a couple of interesting solutions posted in his comments section. But, remembering back to the same time period as my previous post, I knew that one of the fastest approaches uses a tradeoff of utilizing memory instead of computation. That is, you perform some level of pre-computation, and then your function only has to do lookups from an array and simple addition/bit shifting.

Part of the problem is determining how much memory you can afford to use for this task. I thought that working with 8-bits at a time was a nice starting point, because 8-bits is at the boundary of an intrinsic type (the "byte" type), and because my table (an array) then only needs to be 256 elements in size.

The contents of this table is the flipped byte for each index value. The algorithm then is to loop through the original number, grab the lower 8-bits, look up the flipped byte value from the table, and then left-shift it into the output variable. At the end, right-shift the output variable to make the final number only have the specified number of significant bits.

byte[] flippedBytes = new byte[] {

0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0,

0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,

0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8,

0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,

0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4,

0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,

0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC,

0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,

0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2,

0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,

0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA,

0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,

0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6,

0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,

0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE,

0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,

0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1,

0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,

0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9,

0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,

0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5,

0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,

0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED,

0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,

0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3,

0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,

0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB,

0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,

0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7,

0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,

0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF,

0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF};

long Reverse(long x, int bits)

{

long z = 0;

// reverse the bytes

for (int i = 0; i < 8; i++)

{

z = (z << 8) + flippedBytes[x & 0xff];

x = x >> 8;

}

return z >> (64 - bits);

}

For a slight performance increase, get rid of the loop and just "manually" iterate the 8 times (this eliminates a comparison and branch operation).:

long Reverse(long x, int bits)

{

long z = 0;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

z = (z << 8) | flippedBytes[x & 0xff];

x = x >> 8;

return z >> (64 - bits);

}

The execution time can further be cut in half by storing flippedWords ("ushort" types) instead of flippedBytes ("byte" types). However, this requires 512 times the memory for storage (64K elements * 2 bytes a piece = 128KB instead of 256 bytes).

ushort[] flippedWords = new ushort[1 << 16];

public Class1() // Constructor

{

// Initialize the array

for (int i = 0; i < flippedWords.Length; i++)

{

flippedWords[i] = (ushort)((flippedBytes[i & 0xff] << 8)

| flippedBytes[(i >> 8)]);

}

}

public long Reverse(long x, int bits)

{

long z = 0;

z = (z << 16) | flippedWords[x & 0xffff];

x = x >> 16;

z = (z << 16) | flippedWords[x & 0xffff];

x = x >> 16;

z = (z << 16) | flippedWords[x & 0xffff];

x = x >> 16;

z = (z << 16) | flippedWords[x & 0xffff];

x = x >> 16;

return z >> (64 - bits);

}

And, to put this thing to bed, there's one more little thing that can improve execution time: control the iterations based on the number of bits desired. That is, if we only need 12 bits in the output, then there's no reason to execute the other 3 iterations. This method provides better performance only if the bits parameter is less than 48, otherwise the previous method proves faster (because there are no conditions):

public long Reverse(long x, int bits)

{

int iter = (bits >> 4) + 1;

long z = 0;

for (int i = 0; i < iter; i++)

{

z = (z << 16) | flippedWords[x & 0xffff];

x = x >> 16;

}

return z >> 16 - (bits & 0xf);

}

tags: Hanselman, C#, endian, code, .net