Why are types always a certain size no matter its value?

asd 06/12/2018. 13 answers, 2.717 views
c++

Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certain amount of memory no matter its value? For example, if I created the following integer with the value of 255

int myInt = 255;

Then myInt would occupy 4 bytes with my compiler. However, the actual value, 255 can be represented with only 1 byte, so why would myInt not just occupy 1 byte of memory? Or the more generalized way of asking: Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?

13 Answers


SergeyA 06/12/2018.

Because types fundamentally represent storage, and they are defined in terms of maximum value they can hold, not the current value.

The very simple analogy would be a house - a house has a fixed size, regardless of how many people live in it, and there is also a building code which stipulates the maximum number of people who can live in a house of a certain size.

However, even if a single person is living in a house which can accommodate 10, the size of the house is not going to be affected by the current number of occupants.


Useless 06/13/2018.

The compiler is supposed to produce assembler (and ultimately machine code) for some machine, and generally C++ tries to be sympathetic to that machine.

Being sympathetic to the underlying machine means roughly: making it easy to write C++ code which will map efficiently onto the operations the machine can execute quickly. So, we want to provide access to the data types and operations that are fast and "natural" on our hardware platform.

Concretely, consider a specific machine architecture. Let's take the current Intel x86 family.

The Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 1 (link), section 3.4.1 says:

The 32-bit general-purpose registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:

• Operands for logical and arithmetic operations

• Operands for address calculations

• Memory pointers

So, we want the compiler to use these EAX, EBX etc. registers when it compiles simple C++ integer arithmetic. This means that when I declare an int, it should be something compatible with these registers, so that I can use them efficiently.

The registers are always the same size (here, 32 bits), so my int variables will always be 32 bits as well. I'll use the same layout (little-endian) so that I don't have to do a conversion every time I load a variable value into a register, or store a register back into a variable.

Using godbolt we can see exactly what the compiler does for some trivial code:

int square(int num) {
    return num * num;
}

compiles (with GCC 8.1 and -fomit-frame-pointer -O3 for simplicity) to:

square(int):
  imul edi, edi
  mov eax, edi
  ret

this means:

  1. the int num parameter was passed in register EDI, meaning it's exactly the size and layout Intel expect for a native register. The function doesn't have to convert anything
  2. the multiplication is a single instruction (imul), which is very fast
  3. returning the result is simply a matter of copying it to another register (the caller expects the result to be put in EAX)

Edit: we can add a relevant comparison to show the difference using a non-native layout makes. The simplest case is storing values in something other than native width.

Using godbolt again, we can compare a simple native multiplication

unsigned mult (unsigned x, unsigned y)
{
    return x*y;
}

mult(unsigned int, unsigned int):
  mov eax, edi
  imul eax, esi
  ret

with the equivalent code for a non-standard width

struct pair {
    unsigned x : 31;
    unsigned y : 31;
};

unsigned mult (pair p)
{
    return p.x*p.y;
}

mult(pair):
  mov eax, edi
  shr rdi, 32
  and eax, 2147483647
  and edi, 2147483647
  imul eax, edi
  ret

All the extra instructions are concerned with converting the input format (two 31-bit unsigned integers) into the format the processor can handle natively. If we wanted to store the result back into a 31-bit value, there would be another one or two instructions to do this.

This extra complexity means you'd only bother with this when the space saving is very important. In this case we're only saving two bits compared to using the native unsigned or uint32_t type, which would have generated much simpler code.


Martin York 06/12/2018.

It is an optimization and simplification.

You can either have fixed sized objects. Thus storing the value.
Or you can have variable sized objets. But storing value and size.

fixed sized objects

The code that manipulates number does not need to worry about size. You assume that you always use 4 bytes and make the code very simple.

Dynamic sized objects

The code the manipulates number must understand when reading a variable that it must read the value and size. Use the size to make sure all the high bits are zero out in the register.

When place the value back in memory if the value has not exceeded its current size then simply place the value back in memory. But if the value has shrunk or grown you need to move the storage location of the object to another location in memory to make sure it does not overflow. Now you have to track the position of that number (as it can move if it grows too large for its size). You also need to track all the unused variable locations so they can potentially be reused.

Summary

The code generated for fixed size objects is a lot simpler.

Note

Compression uses the fact that 255 will fit into one byte. There are compression schemes for storing large data sets that will actively use different size values for different numbers. But since this is not live data you don't have the complexities described above. You use less space to store the data at a cost of compressing/de-compressing the data for storage.


NO_NAME 06/12/2018.

Because it would be very complicated and computation heavy to have simple types with dynamic sizes. I'm not sure it this would be even possible.
Computer would have to check how many bits the number takes after every change of its value. It would be quite a lot additional operations. And it would be much harder to perform calculations when you don't know sizes of variables during the compilation.

To support dynamic sizes of variables, computer would actually have to remember how many bytes a variable has right now which ... would require additional memory to store that information. And this information would have to be analyzed before every operation on the variable to choose the right processor instruction.

To better understands how computer works and why variables has constant sizes, learn basics of assembler language.

Although, I suppose it would be possible to achieve something like that with constexpr values. However, this would make the code less predictable for a programmer. I suppose that some compiler optimizations may do something like that but they hide it from a programmer to keep things simple.

I described here only the problems that concerns performance of a program. I omitted all problems that would have to be solved to save memory by reducing sizes of variables. Honestly, I don't think that it is even possible.


In conclusion, using smaller variables than declared has sense only if their values are known during the compilation. It is quite probable that modern compilers do that. In other cases it would cause too many hard or even unsolvable problems.


Bill K 06/12/2018.

Java uses classes called "BigInteger" and "BigDecimal" to do exactly this. You can easily do it yourself in pretty much any language if you want.

The reason most don't? Performance. Your most highly performant languages can't afford to go expanding a variable in the middle of some looping operation--it would be very non-deterministic.

Also for some situations, packed values are pretty much the ONLY type of value you would use. For example, a music/video packet being streamed to your computer might spend a bit to specify if the next value is 2 bytes or 4 bytes as a size optimization.

Once it's on your computer, memory is cheap but the complication of variable size is not.. that's really the only reason.


mtraceur 06/12/2018.

Because in a language like C++, a design goal is that simple operations compile down to simple machine instructions.

All mainstream CPU instruction sets work with fixed-width types, and if you want to do variable-width types, you have to do multiple machine instructions to handle them.

As for why the underlying computer hardware is that way: It's because it's simpler, and more efficient for many cases (but not all).

Imagine the computer as a piece of tape:

| xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | ...

If you simply tell the computer to look at the first byte on the tape, xx, how does it know whether or not the type stops there, or proceeds on to the next byte? If you have a number like 255 (hexadecimal FF) or a number like 65535 (hexadecimal FFFF) the first byte is always FF.

So how do you know? You have to add additional logic, and "overload" the meaning of at least one bit or byte value to indicate that the value continues to the next byte. That logic is never "free", either you emulate it in software or you add a bunch of additional transistors to the CPU to do it.

The fixed-width types of languages like C and C++ reflect that.

It doesn't have to be this way, and more abstract languages which are less concerned with mapping to maximally efficient code are free to use variable-width encodings (also known as "Variable Length Quantities" or VLQ) for numeric types.

Further reading: If you search for "variable length quantity" you can find some examples of where that kind of encoding is actually efficient and worth the additional logic. It's usually when you need to store a huge amount of values which might be anywhere within a large range, but most values tend towards some small sub-range.


Note that if a compiler can prove that it can get away with storing the value in a smaller amount of space without breaking any code (for example it's a variable only visible internally within a single translation unit), and its optimization heuristics suggest that it'll be more efficient on the target hardware, it's entirely allowed to optimize it accordingly and store it in a smaller amount of space, so long as the rest of the code works "as if" it did the standard thing.

But, when the code has to inter-operate with other code that might be compiled separately, sizes have to stay consistent, or ensure that every piece of code follows the same convention.

Because if it's not consistent, there's this complication: What if I have int x = 255; but then later in the code I do x = y? If int could be variable-width, the compiler would have to know ahead of time to pre-allocate the maximum amount of space it'll need. That's not always possible, because what if y is an argument passed in from another piece of code that's compiled separately?


supercat 06/12/2018.

Computer memory is subdivided into consecutively-addressed chunks of a certain size (often 8 bits, and referred to as bytes), and most computers are designed to efficiently access sequences of bytes that have consecutive addresses.

If an object's address never changes within the object's lifetime, then code given its address can quickly access the object in question. An essential limitation with this approach, however, is that if an address is assigned for address X, and then another address is assigned for address Y which is N bytes away, then X will not be able to grow larger than N bytes within the lifetime of Y, unless either X or Y is moved. In order for X to move, it would be necessary that everything in the universe that holds X's address be updated to reflect the new one, likewise for Y to move. While it's possible to design a system to facilitate such updates (both Java and .NET manage it pretty well) it's much more efficient to work with objects that will stay in the same location throughout their lifetime.


Matthieu M. 06/13/2018.

Then myInt would occupy 4 bytes with my compiler. However, the actual value, 255 can be represented with only 1 byte, so why would myInt not just occupy 1 byte of memory?

This is known as variable-length encoding, there are various encodings defined, for example VLQ. One of the most famous, however, is probably UTF-8: UTF-8 encodes code points on a variable number of bytes, from 1 to 4.

Or the more generalized way of asking: Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?

As always in engineering, it's all about trade-offs. There is no solution which has only advantages, so you have to balance advantages and trade-offs when designing your solution.

The design which was settled on was to use fixed-size fundamental types, and the hardware/languages just flew down from there.

So, what is the fundamental weakness of variable encoding, which caused it to be rejected in favor of more memory hungry schemes? No Random Addressing.

What is the index of the byte at which the 4th code point starts in a UTF-8 string?

It depends on the values of the previous code points, a linear scan is required.

Surely there are variable-length encoding schemes which are better at random-addressing?

Yes, but they are also more complicated. If there's an ideal one, I've never seen it yet.

Does Random Addressing really matters anyway?

Oh YES!

The thing is, any kind of aggregate/array relies on fixed-size types:

  • Accessing the 3rd field of a struct? Random Addressing!
  • Accessing the 3rd element of an array? Random Addressing!

Which means you essentially have the following trade-off:

Fixed size types OR Linear memory scans


Davislor 06/12/2018.

There are objects that in some sense have variable size, in the C++ standard library, such as std::vector. However, these all dynamically allocate the extra memory they will need. If you take sizeof(std::vector<int>), you will get a constant that has nothing to do with the memory managed by the object, and if you allocate an array or structure containing std::vector<int>, it will reserve this base size rather than putting the extra storage in the same array or structure. There are a few pieces of C syntax that support something like this, notably variable-length arrays and structures, but C++ did not choose to support them.

The language standard defines object size that way so that compilers can generate efficient code. For example, if int happens to be 4 bytes long on some implementation, and you declare a as a pointer to or array of int values, then a[i] translates into the pseudocode, “dereference the address a + 4×i.” This can be done in constant time, and is such a common and important operation that many instruction-set architectures, including x86 and the DEC PDP machines on which C was originally developed, can do it in a single machine instruction.

One common real-world example of data stored consecutively as variable-length units is strings encoded as UTF-8. (However, the underlying type of a UTF-8 string to the compiler is still char and has width 1. This allows ASCII strings to be interpreted as valid UTF-8, and a lot of library code such as strlen() and strncpy() to continue to work.) The encoding of any UTF-8 codepoint can be one to four bytes long, and therefore, if you want the fifth UTF-8 codepoint in a string, it could begin anywhere from the fifth byte to the twentieth byte of the data. The only way to find it is to scan from the beginning of the string and check the size of each codepoint. If you want to find the fifth grapheme, you also need to check the character classes. If you wanted to find the millionth UTF-8 character in a string, you’d need to run this loop a million times! If you know you will need to work with indices often, you can traverse the string once and build an index of it—or you can convert to a fixed-width encoding, such as UCS-4. Finding the millionth UCS-4 character in a string is just a matter of adding four million to the address of the array.

Another complication with variable-length data is that, when you allocate it, you either need to allocate as much memory as it could ever possibly use, or else dynamically reallocate as needed. Allocating for the worst case could be extremely wasteful. If you need a consecutive block of memory, reallocating could force you to copy all the data over to a different location, but allowing the memory to be stored in non-consecutive chunks complicates the program logic.

So, it’s possible to have variable-length bignums instead of fixed-width short int, int, long int and long long int, but it would be inefficient to allocate and use them. Additionally, all mainstream CPUs are designed to do arithmetic on fixed-width registers, and none have instructions that directly operate on some kind of variable-length bignum. Those would need to be implemented in software, much more slowly.

In the real world, most (but not all) programmers have decided that the benefits of UTF-8 encoding, especially compatibility, are important, and that we so rarely care about anything other than scanning a string from front to back or copying blocks of memory that the draw­backs of variable width are acceptable. We could use packed, variable-width elements similar to UTF-8 for other things. But we very rarely do, and they aren’t in the standard library.


codekaizer 06/12/2018.

Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?

Primarily because of alignment requirements.

As per basic.align/1:

Object types have alignment requirements which place restrictions on the addresses at which an object of that type may be allocated.

Think of a building that has many floors and each floor has many rooms.
Each room is your size (a fixed space) capable of holding N amount of people or objects.
With the room size known beforehand, it makes the structural component of the building well-structured.

If the rooms are not aligned, then the building skeleton won't be well-structured.


The short answer is: Because the C++ standard says so.

The long answer is: What you can do on a computer is ultimately limited by hardware. It is, of course, possible to encode an integer into a variable number of bytes for storage, but then reading it would either require special CPU instructions to be performant, or you could implement it in software, but then it would be awfully slow. Fixed-size operations are available in the CPU for loading values of predefined widths, there are none for variable widths.

Another point to consider is how computer memory works. Let's say your integer type could take up anywhere between 1 to 4 bytes of storage. Suppose you store the value 42 into your integer: it takes up 1 byte, and you place it at memory address X. Then you store your next variable at location X+1 (I'm not considering alignment at this point) and so on. Later you decide to change your value to 6424.

But this doesn't fit into a single byte! So what do you do? Where do you put the rest? You already have something at X+1, so can't place it there. Somewhere else? How will you know later where? Computer memory does not support insert semantics: you can't just place something at a location and push everything after it aside to make room!

Aside: What you're talking about is really the area of data compression. Compression algorithms exist to pack everything tighter, so at least some of them will consider not using more space for your integer than it needs. However, compressed data is not easy to modify (if possible at all) and just ends up being recompressed every time you make any changes to it.


Cort Ammon 06/13/2018.

There are pretty substantial runtime performance benefits from doing this. If you were to operate on variable size types, you would have to decode each number before doing the operation (machine code instructions are typically fixed width), do the operation, then find a space in memory big enough to hold the result. Those are very difficult operations. It's much easier to simply store all of the data slightly inefficiently.

This is not always how it is done. Consider Google's Protobuf protocol. Protobufs are designed to transmit data very efficiently. Decreasing the number of bytes transmitted is worth the cost of additional instructions when operating on the data. Accordingly, protobufs use an encoding which encodes integers in 1, 2, 3, 4, or 5 bytes, and smaller integers take fewer bytes. Once the message is received, however, it is unpacked into a more traditional fixed-size integer format which is easier to operate on. It's only during network transmission that they use such a space-efficient variable length integer.


Buurman 06/13/2018.

There are a few reasons. One is the added complexity for handling arbitrary-sized numbers and the performance hit this gives because the compiler can no longer optimize based on the assumption that every int is exactly X bytes long.

A second one is that storing simple types this way means they need an additional byte to hold the length. So, a value of 255 or less actually needs two bytes in this new system, not one, and in the worst case you now need 5 bytes instead of 4. This means that the performance win in terms of memory used is less than you might think and in some edge cases might actually be a net loss.

A third reason is that computer memory is generally addressable in words, not bytes. (But see footnote). Words are a multiple of bytes, usually 4 on 32-bit systems and 8 on 64 bit systems. You usually can't read an individual byte, you read a word and extract the nth byte from that word. This means both that extracting individual bytes from a word takes a bit more effort than just reading the entire word and that it is very efficient if the entire memory is evenly divided in word-sized (ie, 4-byte sized) chunks. Because, if you have arbitrary sized integers floating around, you might end up with one part of the integer being in one word, and another in the next word, necessitating two reads to get the full integer.

Footnote: To be more precise, while you addressed in bytes, most systems ignored the 'uneven' bytes. Ie, address 0, 1, 2 and 3 all read the same word, 4, 5, 6 and 7 read the next word, and so on.

On an unreleated note, this is also why 32-bit systems had a max of 4 GB memory. The registers used to address locations in memory are usually large enough to hold a word, ie 4 bytes, which has a max value of (2^32)-1 = 4294967295. 4294967296 bytes is 4 GB.

Related questions

Hot questions

Language

Popular Tags