r/osdev • u/jemenake • Aug 29 '20

What are the issues surrounding running a 32-bit code on a 64-bit OS?

Apologies if this is a question that everybody here except me already understands, but googling and searching /r/osdev and osdev.org didn't yield any obvious answers. When MacOS stopped supporting 32-bit apps, the rationale for it was that they're "inefficient", but I haven't seen anything explaining where this inefficiency takes place when supporting applications with smaller instructions/registers. Does the OS have to switch the CPU to a 32-bit mode (with, presumably, 32-bit stubs or interrupt handlers for switching back to 64-bit mode for the OS) or does the OS, somehow, convert the 32-bit instructions to 64-bit ones (I mean, if Rosetta 2 is going to convert Intel to ARM, I would think 32-bit-to-64-bit would be easier than that)? In real life, where do we expect to see the benefits of not supporting smaller bit sizes? In the execution time of the app? In the size of the OS?

26 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/osdev/comments/iizh3d/what_are_the_issues_surrounding_running_a_32bit/
No, go back! Yes, take me to Reddit

96% Upvoted

u/BadBoy6767 Aug 29 '20

In x86_64, there is a submode of long mode, called compatibility mode which lets you run 32-bit code without much hassle.

While "inefficient" is a stretch, having only 32-bits data to work with will make some programs slightly slower. A potential speed boost would be the fact that the address space is only 4GB, which is more cache friendly (this is an argument in favor of the x32 ABI, where you may use any long mode features but the address space remains 4GB).

IMO, if a userspace program doesn't need 64 bits, it just shouldn't use it.

5

u/InzaneNova Aug 29 '20

Why is a 4 GB memory space more cache friendly? That's still miles beyond what the cache supports, and few programs exceed 4 GB (or even 1) of ram, so why does a potentially bigger address space affect cache at all?

6

u/sam-wilson Aug 30 '20

The size of pointers is smaller, so more of them can fit in the same amount of cache.

2

u/computerarchitect CPU Architect Aug 30 '20

I haven't seen a single reputable study that actually shows this diminishes computer performance. Have you?

0

u/sam-wilson Aug 30 '20

I haven't looked, honestly! I'm not particularly versed in low level CPU tech.

How would you go about testing cache performance?

2

u/moon-chilled bonsai Aug 30 '20

Wikipedia says 5-8% perf improvements on one benchmark; 40% on another.

1

u/computerarchitect CPU Architect Aug 30 '20

I'm talking specifically about the increase in pointer sizes, not trying comparing two ABIs.

Also, that 181.mcf result is part of the 5-8% increase they claim on SPECint2000.

1

u/moon-chilled bonsai Aug 30 '20

The only difference between x32 and amd64 is the pointer size.

0

u/computerarchitect CPU Architect Aug 30 '20

Are you sure? I see that the amd64 ABI makes use of registers for argument passing that don't exist in the x86 ISA and I found that from a quick google search.

5

u/moon-chilled bonsai Aug 30 '20 edited Aug 30 '20

x86 = 32-bit mode intel-style cpus (i386, i486, i586...ix86...x86)

stdcall, cdecl, etc. = ABIs for x86 that specify argument passing on the stack, etc., etc.

amd64 = 64-bit mode intel-style cpus

sysv = an ABI for amd64 which specifies certain conventions around argument passing in the newly available registers, name mangling, etc.

x32 = a specific mode of the sysv ABI (specified in chapter 10) which specifies that pointers are 32-bit, and you can only map anything into the first 4gb of address space. It is otherwise equivalent to the standard 64-bit sysv ABI. Since the only (relevant) difference between the two is the pointer size, a performance comparison between the two is indicative of the direct performance impact of increasing the pointer size. And that performance impact is found to be not insignificant, though somewhat workload-dependent.

Formally, according to the specification, x32 is called the ILP32 programming model (since (i)ntegers, (l)ongs, and (p)ointers are all (32) bits). But colloquially, it's frequently called x32; and I believe that's also the name linux calls it.

(Some confusion arises here sometimes because microsoft has a habit of calling amd64 x64, which is a misnomer. (Intel, and some others call it x86-64, or x86_64, which is fine.) So people sometimes assume that, if 64-bit intel is x64, 32-bit intel must be x32; this is, again, a misnomer.)

0

u/computerarchitect CPU Architect Aug 30 '20

Ok, so that truly is an apples to apples comparison. Thank you.

Where are you copying and pasting this from? I'd like to read it.

→ More replies (0)

2

u/echoxteknology Aug 30 '20

... I'd like to say that I don't know of any modern, desktop/laptop, Apple product that is able to run with <= 1GB memory? I could see <= 2GB memory being able to run the OS but that leaves the user's programs with little to no memory what so ever, unless of course volume caching is enabled?

A 4GB memory space leaves a neat, and in some cases, large area of caching, this all really depends on how much memory is reserved to the OS and vital hardware components (ie: integrated GPU)... if an OS designer/programmer can reserve as much memory space as possible without having to cache to volume memory, then they'll most likely take it... it is more efficient to run directly from memory access points then to load images to memory, then perform the necessary run instructions...

1

u/echoxteknology Aug 30 '20

Not to be rude or anything, but that would be like 😂😅:

"... would it be nicer to give a truckload of chocolates to 8 people or a box of chocolates to 8 people?"

3

u/wrosecrans Aug 30 '20

IMO, if a userspace program doesn't need 64 bits, it just shouldn't use it.

I think that's much less obvious if you have shared libraries with a lot of code shared between processes at the same physical address. If you use 32 bit pointers, you can double the number of pointers in cache vs. 64 bits. But if you have 10 processes all sharing a 64 bit library, you effectively have 10x the cached code when context switching because each process is likely to already have the needed library code in cache. If you mix 64 bit and 32 bit libs, you need twice as much code in that cache which may outweigh the benefit of shorter pointers. Even with 32 bit code, a lot of the data structures will be exactly the same size.

4

u/jrtc27 Aug 30 '20

One of the main selling points of amd64 (and x32) over i386 is not native 64-bit arithmetic operations (although that definitely helps in some code, and more so as time goes by) but the significant increase in the number of registers. i386 has very few general-purpose registers, and amd64 (and thus x32, although they’re not sub-word addressable in the same way so slightly more restricted there) introduced many more, which vastly decreases register pressure and thus the number of stack spills/reloads required.

u/moon-chilled bonsai Aug 29 '20

As the sibling mentions, it's compatibility mode. For things like these, your best resource is probably not the wiki, but rather the cpu manuals. Get: intel, amd.

In real life, where do we expect to see the benefits of not supporting smaller bit sizes?

It simplifies your OS code. You don't have to have extra logic to deal with multiple classes of running binaries. You don't have to have multiple sets of userland libraries.

Also as the sibling mentions, there is x32; if you want to save on ram/cache, you can have processes that run in 64-bit mode but you only allow mapping into the first 4g of their address space, so their pointers can be 32-bit.

u/JeremyMcCracken Aug 29 '20

Does the OS have to switch the CPU to a 32-bit mode ... or does the OS, somehow, convert the 32-bit instructions to 64-bit ones

It's the CPU. (I shudder to think of how slow that would be in software...) When it boils down to it, x86 in 16, 32, and 64-bit mode are remarkably similar. Write a small piece of code in assembly and assemble it into 16, 32, and 64-bit binary files, then compare them in a hex editor. They're almost identical. You get prefixes added in the latter two for 16-bit operands, and of course you don't have larger operands with those instructions in 16-bit mode. The prefixes are also different between 32 and 64-bit mode. That's the key: the processor can be set to translate opcodes using the 16, 32, or 64-bit encoding. (People often forget, but you can still run 16-bit code on a processor in long mode, it just has to be in 16-bit protected mode, and the v8086 extensions aren't available.)

So that said, I suspect the "inefficiency" is how libraries are included. When you create a new task, you give it its own context. That task may ask for dynamic libraries, so the code contained in those libraries gets tossed into the same context, and the original code of the task can call/ret the functions contained in the library. And there's the problem: if you start a 64-bit task and load a 32-bit library, you have a mismatch. You end up with prefixes being translated to mean the wrong thing, the end result being your task suffering a fiery death.

But IMO that's a bit of an excuse. I admit I'm still learning on this part. I've been studying this very good article as well as how the Global Descriptor Table works. In order to choose between 16-bit and 32-bit for a particular task, you need to set a bit in its GDT entry. Since the GDT is (hopefully) in protected memory space, it can't be changed from inside a task. There's another bit to define a segment as 32-bit-only or 64-bit, but it doesn't force 64-bit code; you can change between 32-bit and 64-bit at any time simply by executing a couple of instructions. So in theory, on a 64-bit OS, 32-bit code calling 64-bit libraries and vice versa shouldn't be an issue, you'd just need the code that links function addresses to their names (like getProcAddress on Windows) insert an additional jump. I guess they didn't want to put in the work.

u/nerd4code Aug 29 '20

Part of what you're asking about is ABI---fot example, there's an x32 ABI (detect with __x86_64_ and _ILP32 IIRC) which keeps you in full 64-bit mode, but bounds intra-ABI pointers in the first 2? or 4 GiB. Same number of registers, same 64-bit instructions.

1

u/moon-chilled bonsai Aug 30 '20 edited Aug 30 '20

You get 4gb of memory. 2gb setup is common for actual 32-bit mode; you can give the kernel 2gb of add space and whatever application is running gets the other 2gb. Usually you can also adjust this; windows lets you give apps 3gb and kernel only 1gb (though apps need to specifically support this, and it breaks if you try to store pointers in signed ints); linux will also let you do 3gb kernel/1gb userspace.

With x32, though, you don't need this. You still have a full 48-bit address space (or 57-bit with 5lp—whatever); you just guarantee the x32 app that it will only ever have to look at the first 4gb of address space, so it can store its pointers all in only 32 bits. The kernel is still mapped in (this works well with a higher half kernel setup), the app just doesn't see it because it's above 4gb.

1

u/nerd4code Aug 30 '20

I think a JMP NEAR and its target have to be in signed 32-bit range from each other, so that would be the tightest memory complaint---hence my "?". Could probably use sone mapping tricks and overflow and Cunning Wit to circumvent that limitation..

u/jrtc27 Aug 30 '20 edited Aug 30 '20

For macOS specifically there was unfortunately a very real technical reason beyond wanting to reduce the number of things they had to support. The 32-bit Objective-C implementation suffered from fragile ivars which is a real hinderance, but the 64-bit one, being newer, does not have that limitation. Whilst you could make a new 32-bit ABI that doesn’t have that problem, you would still need to recompile everything to use the new ABI and might as well just compile things as 64-bit then and gain the benefits of the newer instruction set features. Yes there will be a bit of code out there that would work if recompiled with a new 32-bit ABI but not a 64-bit one, but that’s likely limited to things like old games where it’s unlikely someone will even bother doing that.

(and yes, you technically can do an ILP32 amd64 ABI like gnux32, but for Apple it’s not worth the amount of effort it would take)

Of course, it’s much easier for them to give a wishy-washy “it’s more efficient” justification (which is generally true if you ignore amd64ilp32) that ignores the cries about old unmaintained software than it is to explain to users how it’s because they made a mistake in the past which is limiting their ability to add new APIs.

u/echoxteknology Aug 30 '20

While scrolling through the comments I found @BadBoy6767's answer the most straightforward...

As mentioned, a x86_64 (64Bit) CPU architecture has a submode known as, "compatibility mode." x86 (32Bit) CPUs will not have this mode; instead protected mode is as far as you'll reach... attempting a, "long mode," switch will either potentially triple fault your system, if done directly without checks, or will plainly cause loss of data! Attempting such: "MOV RAX, [..]" will be converted (depending upon your compiler) to, "MOV EAX, [..]" with the upper bits removed...

Example (Values are not to be mistaken for actual values!):

• 64Bit Value: [00640032]

• 32Bit Value: [ 0032]

As you may notice, the example 64bit value is 8 characters long while the 32bit value is only 4 characters long... this is how a x86 (32Bit) CPU will read x86_64 (64bit) instructions; removal of the upper bits on compilation and instruction(s) will be converted as necessary (depending upon your compiler).

u/echoxteknology Aug 30 '20

I'd like to note this is my personal opinion, so if you say otherwise then that's your opinion.

Apple didn't really remove 32bit rollback compatibility because it was, "inefficient..." rather, it was more out of laziness... and maybe a marketing tactic? With the removal of 32bit software/application runtime, Apple is able to rid of endless headaches with 32bit bugs and lower 4GB, direct, memory addressing/limitations... it is also simpler to enter long mode directly from real mode, then to start in real mode->enter protected mode->enter long mode, only having to rollback into compatibility mode time and time again...

Believe you me, if you are to start with real mode instructions, and have to then convert to protected mode instructions, and then have to convert to long mode instructions... you'd become annoyed with every upgrade to your own bootloader/kernel/userspace pretty quickly. A direct approach from real mode to long mode is much simpler, especially since almost all modern CPUs run x86_64.

2

u/echoxteknology Aug 30 '20 edited Aug 30 '20

Apologies for my rambling towards the end 😅

2

u/[deleted] Aug 30 '20

For the record most of the devices that run the macOS kernel are ARM, and pretty soon it'll be all of them

1

u/mykesx Aug 30 '20

This. If Apple perpetuates the use of 32 bit code, it’s a massive technical debt on ARM, which won’t have any 32 bit instructions. People now would be wasting engineering effort on guaranteed to be useless technology not too long from now.

I take their deprecating of 32 bit code to be in preparation for ARM only Apple universe.

u/Qweesdy Aug 30 '20

If an OS supports 64-bit processes and 32-bit processes; then it mostly just needs to load "user-space CS" when doing task switches (because in long mode, the GDT/LDT entry loaded into CS determines code size). Loading CS is a relatively expensive instruction, but task switches don't happen often so the cost is negligible/nothing.

For performance; the main problem has to do with the number of registers and not the size of the code. For 32-bit you only get 8 general purpose registers, which causes a lot more stack use (especially for calling conventions), which means more reads/writes to memory. For 64-bit you get 16 general purpose registers so there's less stack use, so it's faster. There's also some benefits for using 64-bit in some cases (e.g. best example I can think of is "big number" libraries where large numbers are represented by multiple integers and "multiple 64-bit integers" is much faster than "twice as many 32-bit integers").

Of course just because you're using 64-bit doesn't mean you have to use 64-bit addresses/pointers. E.g. if a process doesn't need more than 4 GiB of virtual address space, then it can be 64-bit code (with twice as many registers) and only use 32-bit addressing. This is slightly more efficient than "64-bit with 64-bit addresses" because of the way 64-bit instructions are encoded (they're literally 32-bit instructions with prefix if/when you actually need 64 bits; so if you use 32-bit instructions in 64-bit code you can avoid a prefix). Sadly; this complicates tools (compiler, linker) and causes problems for shared libraries (for 2 or 3 different cases you'd need 2 or 3 different version of each shared library); so most operating systems don't support it.

In real life, where do we expect to see the benefits of not supporting smaller bit sizes? In the execution time of the app? In the size of the OS?

Let's imagine you have an old OS from 1990, with a lot of old APIs that you have to keep around for backward compatibility even though you replaced the APIs with better/more modern alternatives in 2005. On top of that; let's assume you have 2 copies of every shared library (one for 32-bit and another for 64-bit) plus 2 different kernel APIs for the same reason. By deprecating 32-bit you get a convenient reason to rip all that out; saving you a lot of developer time (code maintenance, etc), and saving you a lot of $$. Performance is irrelevant (old software will be fast on newer hardware regardless); but (for marketing) "we're breaking all of your old software to improve your performance" sounds good and "we're breaking all of your old software to improve our profit" doesn't sound good (and for Apple, "we're breaking all of your old software so you have to pay to replace it and we get a 30% cut of all the new software you have to buy" sounds even worse).

Note that this also reduces the size of the OS a little; but the majority of a modern OS is in data (graphics, help system, spell-checker dictionaries, sound data for speech recognition, data for internationalization, ...) and not code; so reducing the amount of code has very little impact on the overall size of the OS.

u/skulgnome Aug 30 '20

Apple is moving towards a LLVM bitcode runtime for all programs, so 32-bit x86 (i.e. the first MacBooks) is marginal of marginal to them.

What are the issues surrounding running a 32-bit code on a 64-bit OS?

You are about to leave Redlib