A few 'bits' of python

June 13, 2013

Bit twiddling has always mesmerized me, or rather, watching others manipulate raw data (at the lowest level possible) is what was so mesmerizing. This is mostly because it seems magical to me. Not coming from a CS degree background I’ve always been attracted to abstractions and libraries that shield me from dealing with system-level data and components.

A recent project required me to take a text file full of csv records and compress each one into a six byte structure.

Legend:
| = byte separator
. = one bit
<----> = space consumption of numeric component

Binary structure of one record:

 <--------------------><--------------------> <------>
|........|........|........|........|........|........|

As you can see, there are two numeric components that take up to 2.5 bytes (20 bits) of space and one 8-bit component. One example csv record looked something like this:

129,-42,54038

Each number (component) expressed in binary using the built-in python function bin() looks like this:

>>> bin(129)
'0b10000001'
>>> bin(-42)
'-0b101010'
>>> bin(54038)
'0b1101001100010110'

Now, try running this line of code in your interpreter:

>>> bin(42)

I’ll wait while you do that…

I’m still waiting.

Ok, did you notice how the result of bin(-42) from earlier looks just like the result for bin(42) just with a leading negative sign (-)? That’s actually really unhelpful in visualizing the end result. What would be really nice to have seen in the beginning is the number represented in “two’s compliment binary”. For reasons I don’t fully understand python differs in its visual representation from its actual implementation for this function’s output. An optional parameter to the bin function that specified “two’s compliment” would be awesome, but it doesn’t exist. So, I wrote some that would show the “two’s compliment” representation of negative numbers:

>>> bits(3, 8)
'00000011'
>>> bits(-3, 8)
'11111101'

Compare the result above with the built-in:

>>> bin(-3)
-0b11

Here’s the implementaion of my bits() method, complete with documentation:

def bits(number, size_in_bits):
    """
    The bin() function is *REALLY* unhelpful when working with negative numbers.
    It outputs the binary representation of the positive version of that number
    with a '-' at the beginning. Woop-di-do. Here's how to derive the two's-
    complement binary of a negative number:

        complement(bin(+n - 1))

    `complement` is a function that flips each bit. `+n` is the negative number
    made positive.

    """
    if number < 0:
        return compliment(bin(abs(number) - 1)[2:]).rjust(size_in_bits, '1')
    else:
        return bin(number)[2:].rjust(size_in_bits, '0')

def compliment(value):
    return ''.join(COMPLEMENT[x] for x in value)

COMPLEMENT = {'1': '0', '0': '1'}

So, now that I could see the bits as they really were I came up with the following description of the scenario. Take each number visualized using the available space:

>>> bits(129, 8)
'10000001'
>>> bits(-42, 20)
'11111111111111010110'
>>> bits(54038, 20)
'11111111111110010001'

No concatenate them in a 48-bit, or 6-byte structure, with bits(54038, 20) ocupying the left-most position, followed by bits(-42, 20), then finally bits(129, 8):

           <--------------------><--------------------> <------>
          |........|........|........|........|........|........|
Expected:  11111111 11111001 00011111 11111111 11010110 10000001

So now, what we need is a sequence of bitwise operations that will produce the 6-byte structure from the 8-bit component and the 2 20-bit components.

Let’s start simple.

We are going to need to do some bit shifting in order to get the 20-bit components to the left of the 8-bit component, which needs to be on the far right.

So, let’s just work with 129 (8-bits) and -42 (20-bits):

>>> bits(-42 << 8 | 129, 48)
'111111111111111111111111111111111101011010000001'

As you can hopefully see, the 129 takes up the right-most 8 bits. The 129 takes up the rest.

Now, the tricky part is getting the 20 bits from 54038 in there without messing up what we’ve already done. Without any more messing around, here’s the solution diagrammed:

Start with bits(54038):

>>> bits(54038)  # showing prettier result for the purpose of illustration
00000000 00000000 00000000 00001111 11111111 10010001

Now shift left 28 bits (<<):

>>> bits(54038 << 28, 48)
00001101 00110001 01100000 00000000 00000000 00000000

Now the tricky part, which makes the next step possible:

>>> bits(54038 << 28 ^ 268435455, 48)
00001101 00110001 01101111 11111111 11111111 11111111

What we just did was a “bitwise exclusive or (^)” using a mask. In this case the mask was 268435455. What’s so special about that number? bits(268435455, 21) shows that in binary, that number is 21 ones in a row. Try removing the bitwise exclusive or using the mask and you’ll see things go bonkers in a hurry.

Now, lets prepare the -42 component:

>>> bits(-42, 48)
11111111 11111111 11111111 11111111 11111111 11010110

Shift it over 8 to make room for the 129 later:

>>> bits(-42 << 8, 48)
11111111 11111111 11111111 11111111 11010110 00000000

Now, add that result into what we had previously using a “bitwise and (&)”:

>>> bits(54038 << 28 ^ 268435455 & -42 << 8, 48)
00001101 00110001 01101111 11111111 11010110 00000000

Can you see both numbers in that structure? Now, add in the 129 on the end (or the beginning if you’re thinking in binary) using a “bitwise or (|)”:

>>> bits(54038 << 28 ^ 268435455 & -42 << 8 | 129, 48)
00001101 00110001 01101111 11111111 11010110 10000001

Phew! Now, we really don’t need the bits function anymore:

>>> result = 54038 << 28 ^ 268435455 & -42 << 8 | 129

Now it was just a matter of creating a function that accepted the 3 numeric inputs and calculated the result using the above operators and mask. The last step was to output the raw bytes of the result:

>>> import struct
>>> b = bytearray(struct.pack('q', result))[:6]

And now, one last sanity check:

# working left to right:
# 00001101 00110001 01101111 11111111 11010110 10000001

>>> bits(b[0], 8)
'10000001'
>>> bits(b[1], 8)
'11010110'
>>> bits(b[2], 8)
'11111111'
>>> bits(b[3], 8)
'01101111'
>>> bits(b[4], 8)
'00110001'
>>> bits(b[5], 8)
'00001101'

As expected, we have 6 bytes that match up with our expected bit array. I must admit that it took quite the reading of the python bitwise operator docs as well as a lot of expirementation once I had created the bits function to find the right order of operations. But my understanding of bitwise operations is much more solid as a result. I’ve posted my bits function as a gist. Enjoy!

def bits(number, size_in_bits):
    """
    The bin() function is *REALLY* unhelpful when working with negative numbers.
    It outputs the binary representation of the positive version of that number
    with a '-' at the beginning. Woop-di-do. Here's how to derive the two's-
    complement binary of a negative number:

        complement(bin(+n - 1))

    `complement` is a function that flips each bit. `+n` is the negative number
    made positive.

    >>> bits(3, 4)
    '0011'
    >>> bits(129, 16)
    '0000000011111111'
    >>> bits(-42, 8)
    '11010110'

    """
    if number < 0:
        return compliment(bin(abs(number) - 1)[2:]).rjust(size_in_bits, '1')
    else:
        return bin(number)[2:].rjust(size_in_bits, '0')


def compliment(value):
    return ''.join(COMPLEMENT[x] for x in value)


COMPLEMENT = {'1': '0', '0': '1'}

-Michael Whatcott