"Why does ACPI exist" in the beforetimes power management on x86 was done by jumping to an opaque BIOS entry point and hoping it would do the right thing. It frequently didn't. Failed to program your graphics card exactly the way the BIOS expected? Hurrah! Data corruption for you. ACPI made the reasonable decision that, well, maybe it should be up to the OS to set state and be able to recover it. But how should the OS deal with state that's fundamentally device specific?
One way to do that would be to have the OS know about the device specific details. Unfortunately that means you can't ship the computer until you modify every single OS you want to support and get new releases out there. This, uh, was not an option the PC industry seriously considered. The alternative is that you ship something that abstracts the details of the specific hardware. This is what ACPI does, and it's also what things like Device Tree do.
The main distinction between Device Tree and ACPI is that Device Tree is purely a description of the hardware that exists, and so still requires the OS to know what's possible - if you add a new type of power controller, for instance, you need to add a driver for that to the OS before you can express that via Device Tree. ACPI decided to include an interpreted language to allow vendors to expose functionality to the OS without the OS needing to know about the underlying hardware.
So, for instance, ACPI allows you to associate a device with function to power down that device. That function may, when executed, trigger a bunch of register accesses to a piece of hardware otherwise not exposed to the OS, and that hardware may then cut the power rail to the device to power it down entirely. And that can be done without the OS having to know anything about the control hardware.
How is this better than just calling into the firmware to do it? Because the fact that ACPI declares that it's going to access these registers means the OS can figure out that it shouldn't, because it might otherwise collide with what the firmware is doing. With APM we had no visibility into that - if the OS tried to touch the hardware at the same time APM did, boom, almost impossible to debug failures
(This is why various hardware monitoring drivers refuse to load by default on Linux - the firmware declares that it's going to touch those registers itself, so Linux decides not to in order to avoid race conditions and potential hardware damage. In many cases the firmware offers a collaborative interface to obtain the same data, and a driver can be written to get that (https://bugzilla.kernel.org/show_bug.cgi?id=204807#c37 discusses this for a specific board))
@mjg59 Not to mention that many ACPI issues were/are caused by the motherboard-side implementation of ACPI being buggy, incomplete or self-contradictory. One thing that ACPI decidedly isn't: simple.
Historically there were a bunch of ACPI-related issues because the spec didn't define every single possible scenario and also there was no conformance suite (eg, should the interpreter be multi-threaded? Not defined by spec, but influences whether a specific implementation will work or not!). These days overall compatibility is pretty solid and the vast majority of systems work just fine, but we do still have some issues that are largely associated with System Management Mode.
Unfortunately ACPI doesn't entirely remove opaque firmware from the equation - ACPI methods can still trigger System Management Mode, which is basically a fancy way to say "Your computer stops running your OS, does something else for a while, and you have no idea what". This has all the same issues that APM did, in that if the hardware isn't in exactly the state the firmware expects, bad things can happen.
@klausman@mjg59 The real test is whether it makes the system simpler over all. And I'd argue the one-kernel-device model seen on Android phones is complicated in a different way, even if each individual kernel might be simpler.
@jamesh@mjg59 I agree. But then again, embedded devices,.phones and desktop computers and data centre servers all have different parameters and benefit from different approaches.
Just as long as we don't go back to SET BLASTER="A220 I5 D1"
@klausman@mjg59 every smartphone I've owned so far has stopped receiving OS version upgrades before it became unusable.
In contrast, I've got a 10+ year old x86 server in my closet running a recent Linux distro. It just works because no one has to do hardware enablement for that specific system in the new OS release.
@hyc@klausman And the answer is just to claim to be Windows, because Windows has an established contract with the firmware in a way that Linux never has
@klausman@mjg59 and there were still plenty of issues where the ACPI expects the OS to be some flavor of Windows, e.g. "if win95 do X elseif win2k do Y" and does nothing on Linux, so some feature just doesn't work. Typically we'd fix these by dumping the DSDT and rewriting it, but nowadays dynamically loadable DSDTs are deprecated even though those types of problems are just as prevalent as ever.
@mjg59@klausman That's the stock answer but it's inadequate. You have to know to claim to be a specific version of Windows, otherwise you still get breakage.
By and large ACPI has been a net improvement in Linux compatibility on x86 systems. It certainly didn't remove the "Everything is Windows" mentality that many vendors have, but it meant we largely only needed to ensure that Linux behaved the same way as Windows in a finite number of ways rather than in every single hardware driver, and so the chances that a new machine will work out of the box are much greater than they were in the pre-ACPI period
This isn't something that ACPI enabled - in the absence of ACPI firmware vendors would just be doing this unilaterally with even less OS involvement and we'd probably have even more of these issues. Ideally we'd "simply" have hardware that didn't support transitioning back to opaque code, but we don't (ARM has basically the same issue with TrustZone)
One example is a recent Lenovo one, where the firmware appears to try to poke the NVME drive on resume. There's some indication that this is intended to deal with transparently unlocking self-encrypting drives on resume, but it seems to do so without taking IOMMU configuration into account and so things explode. It's kind of understandable why a vendor would implement something like this, but it's also kind of understandable that doing so without OS cooperation may end badly.
(The alternative of teaching the kernel about every piece of hardware it should run on? We've seen that in the ARM world. Most code simply never reaches mainline, and most users are stuck running ancient kernels as a result. Imagine every x86 device vendor shipping their own kernel optimised for their hardware, and now imagine how well that works out given the quality of their firmware. Does that really seem better to you?)
What frustrates me is the sheer number of vendors that didn't see this as a problem (or saw it as a feature), combined with ARM refusing to try and even address this problem when AArch64 is released.
It's difficult to say that ARM is a viable competitor in non-embedded spaces when you can't say that you'll be able to install an update six months down the line (and this is sadly not even a hypothetical ...)
That seems like something that could have been standardized through data rather than code, though. For instance, a standard interface for power supply hardware, with enumerable power lines, and tables that say "this device is attached to this power line".
Having *tables* in firmware seems like a great thing, for everyone except vendors who think it'll destroy their ability to "differentiate" and "value add". Why give vendors a language to drive arbitrary non-standard functionality?
@stark That kind of thing ends up being *very* platform dependent. Say you have a GPU - the GPU driver has no idea how the power line to the GPU is controlled, because that's up to how it was wired up in the specific machine, and the control mechanism is likely also hardware-specific (is it controlled via the embedded controller? Is there some other power controller that needs to be spoken to? That sort of thing)
But one thing I still don't get. The kernel needs a driver to talk to every device and that driver needs to know how to do everything else. Why is turning the device on and off so uniquely tricky that it would be a problem to do that in the driver too?
@josh@stark And, well, the fundamental problem is still that you need to identify all possible scenarios people might reasonably want to implement in advance, and it's clear the industry isn't interested in that
@mjg59 I'm thinking more about things like SMM (and ME, and even UEFI).
What does SMM do that the EC couldn't do better?
All of the hardware I design these days has a MCU that does management stuff (sometimes two, one really low level one for power/reset sequencing and one that comes up later to do everything else) and then talks to an FPGA that does most of the work of the board.
The FPGA doesn't care about polling sensors or controlling fan speeds or anything else that's needed to make the system work. It just lives off in its own temperature-controlled world and does its thing, and has a SPI interface to the MCU when it needs something.
@mjg59 What I don't understand is why all of this functionality is implemented on the CPU at all, rather than on the EC/BMC.
(and then have the EC/BMC expose a standardized, abstracted API to the OS for things like voltage/temperature sensors)
Power management also seems like the kind of thing that makes more sense to have been architected as an out of band feature that software was unaware of except for giving high level directives to the EC.
@mjg59 one issue with ACPI - though it's an issue of specific vendors instead of ACPI in general - is that on Qualcomm ARM Laptops Qualcomm decided to move quite some logic into their Windows drivers instead of ACPI due to the Windows ASL parser being buggy compared to ACPICA which makes running alternative operating systems using purely ACPI on these laptops a bit annoying as one needs to reverse engineer the drivers to discover the interactions between devices 🙄