Pointers and Other Data Types
Every once in a while, I come across something in someone’s code that is technically correct, but rather odd in such a way that leads me to believe that there are many programmers out there who are not really comfortable with pointers or relations between data types. An example I’ve run across a few times is something like:
char myString; int myFunction(char* text); ... int blah = myFunction((char*)myString);
In case you’re wondering what’s odd, myString is already of type “char*”, so the typecast is redundant and misleading. Someone only looking at the function call might look at it and wonder “Why is myString not a char* in the first place?”.
That brings me to something I’ve heard several people say before, that typecasting can take a lot of time. Actually, typecasting one pointer type to another pointer type takes literally no time, because they are identical under the hood. Typecasting a signed integer to an unsigned integer of the same size or vice versa also takes no time for the same reason. If the compiler is smart, typecasting an integer type to a smaller integer type should also take no time, and the opposite is at most one instruction taking one clock cycle or less. Typecasting between floating point types can take slightly longer, but still less than 10 clock cycles. Sometimes, though, you need to operate on the actual representation of a floating point number, which can be gotten like “*((DWORD*)&myFloat)”, which, as confusing as it looks, also takes no time (or almost no time in some situations).
A pointer is an unsigned integer variable with the same size as the current CPU address size, since it holds an address. If you’re making a 64-bit program, it’s 8 bytes (effectively a QWORD, or unsigned long long); if you’re making a 32-bit program, it’s 4 bytes (effectively a DWORD, or unsigned long). However, C syntax can make this fairly difficult to see. For example:
long* pArray; long* pArray2 = pArray+16; long* pArray3 = &pArray;
In this case, pArray2 is an unsigned integer with a value that is actually 64 more than pArray, not 16 more. That’s because the C compiler assumes that you mean to get the address of the long at index 16 into the array, not offset 16, and since a long is 4 bytes, index 16 is 64 bytes past the beginning of the array. As such, pArray3 has the same value as pArray2. This, though, leads to some more interesting examples:
long elementA = pArray; long elementB = 16[pArray];
Both of these lines are valid, but they may or may not have the same value. If the compiler assumes that 16 in the latter case is a char*, then the result could be different than if it assumes that it’s a long*. However, if you find yourself doing the latter, please step away from the keyboard and rethink what you’re doing. At least it’s not as ambiguous as:
a += a+++++a+a++;
That line is valid in C and Java, and different C compilers will definitely give different results, and Java gives a different result yet again. (In case you’re curious, it’s a bit easier to read as “a += (a++)+(++a)+(a++);”.) Now I’m just rambling, though, so let’s get back on track. I saw a line of code a couple of weeks ago that went something like:
BYTE* data; ... WORD a = (WORD)(data & 0xFF) + (WORD)(((data & 0xFF)<<8)&0xFFFF);
I actually didn’t realize immediately what the big line was doing (though the real one was even more complicated), because getting a WORD (16 bits) from an array of BYTEs (8 bits) is as simple as either one of the following three lines.
a = data + data<<8; a = ((WORD*)data); a = *((WORD*)(data+6));
The first one might be more familiar, and it’s the most similar to the original, but the latter two are actually more intuitive if you’re thinking about what actually needs to happen. In memory, there are 2 consecutive bytes, with the least significant one first (unless you’re on some embedded system that still uses big-endian). All you need to do is move those two bytes into variable “a”. That’s what either of the last two lines does. The first line gets the value of one byte, gets the value of the other byte and shifts it over a byte, adds the two bytes together to get the number that you had in the first place, and then move it into the variable. In C, the latter two may look a bit more confusing, but in assembly language, they are much simpler, whereas if you wrote assembly that does something like the first line, people might start to giggle.
Another area of confusion comes with complicated types and the use of “const”, because many are not familiar with how to read the type.
char const * * const myVariable;
myVariable is an array of constant pointers to non-constant pointers to constant chars. The reading of the type starts at the variable name and goes to the right (for the array brackets), then to the left (for the rest). Added confusion comes with that it is valid to place the first “const” before “char” in the above line, and it would still be an “array of … pointers to constant chars”. What’s more interesting is that since the name of the array is interpreted as its address, and its address can’t change, myVariable is also a constant pointer to constant pointers to non-constant pointers to constant chars. Some compilers get very messed up in different ways when trying to deal with typecasts and “const” in different places in a type, but there are usually workarounds to those bugs. It gets weirder with multi-dimensional arrays, though. Take, for example:
my2DArray is an array of 123 arrays of 456 chars. There’s nothing particularly exciting about that. However, it’s partly equivalent to a constant pointer to constant pointers to chars. As such, my2DArray is a constant pointer to chars. The key difference is that my2DArray is not a pointer, in that it is not in memory, it is just an address. “my2DArray” is actually equivalent to “((const char*)my2DArray)+3*456”. There is no array of pointers, only a single array of 123*456 chars with some fancy syntax. That’s why if you pass a multidimensional array as a parameter (which would probably be silly) or pass a pointer to one (which would be less silly), you need to specify the size of all dimensions except the last (though you can have the size be the value of another parameter). The compiler needs to know what to multiply by when indexing into it. To sidestep the whole issue, you could use an actual array of pointers to chars or use a single-dimensional array and do the indexing yourself.
I find that in most of these cases, it’s easiest for me to think of things in terms of how it will appear in memory, since then it’s just a matter of finding the right addresses and moving what needs to be moved, but that takes some time to get used to. Either way, there are a lot of interesting quirks with pointers and other data types.