This is writeup of what I know of the apparently correct settings for Xen Powermanagement.
I'm not putting it on the Xen wiki since it would be pointless. The devs changing / breaking / alternating the code don't bother updating the docs. So I rather have it here where I can update things easily and indicate what's unknown and what isn't.
If you think this is wrong, just go ahead and do proper documentation for every Xen release since 3.1. I will follow suit then.
1. Xen Command Line
You need the following settings:
enable power management from dom0. This isn't optional since Hypervisor controlled power management seems to be fully broken.
vCPU migration in dom0 should never be allowed, much less when trying to control core's power from dom0. Since dom0 only sees the dom0 CPU usage I suspect this is really inefficient and stupid
must not be used. dom0 needs to see all cores. Cores not seen by dom0 will be run at full speed otherwise.
I think there's one more for setting the max c-state?
Note, if you set a core to "offline" using /sys that also, unfortunately, means it will be at full speed, twiddling tumbs. :(
2. dom0 OS
You need the following kernel module:
without this module, the xenpm command won't work (error 22) and all ACPI cpu control features would seem to be off too.
xenpm set-scaling-governor powersave
This sets the powersave (max power saving) governor. If you don't run high-load compute jobs, start with this one.
This enables Turbo mode, meaning idle cores are allowed to race up to their full speed. This needs to be run per-core.
The CPU clock is often between 1.9 and 2.4 GHz at normal full speed and a turbo speed (per core) is 2.9-3.8GHz depending on the model.
(On my 8Core host you see I need to do it for cores 0 to 15 due to intel's hyperthreading. I run mixed workloads, so for me it makes sense to have hyperthreading enabled)
Now, to add insult to injury, you also need to manually enable any deep sleep modes. This is partially due to potential CPU and powersupply bugs - a current Xeon can use 0A on 12V in idle which actually causes some power supplies to crash the pc. I'm saying partially because tbh the non-configuring of C-States predates any current CPU.
First, look up your CPUs deepest sleep state... then:
Use this with care - since there is no info embedded in the hypervisor (with both intel and amd committing to Xen you'd think this would be possible..) for your CPU in your system expect yourself to be the first tester. The command is a bit off and I would not be suprised if I could set a max-cstate of 255.
The good news: you can see if it worked!
Here you can see a good amount of time was spent in the pc6/cc6 states. Which means, in the deepest sleep modes.
To cross-reference I also check on the last (HT) core.
Same result, so it's really in effect. I think.
One could be worried by looking at the "total number of C-States = 4" which is a bit contradictive. Since I don't know enough about ACPI I can imagine a few scenarios:
- XenPM is outdated and doesn't know how to show all (very likely)
- Xen is outdated and mocks us and we're not using all (very likely)
- It's just how the modes are named, and cc6 is a substate of C3 and is also called c-state 6 (likely, since computers)
- Everything is fine, move on (I like this one)
According to Wikipedia, the 3rd guess could be correct:
That's it for this topic! We have enabled all powersaving settings that exist (afaik) and seen how to check if they did something.
What you now lack is a script that safely applies all those settings. I don't have one myself, so far. Share it if you make one.
Also I don't have any numbers on this. My home server uses a Supermicro X9-SRL-F board. That board cannot track the power draw, which is pretty typical for low end servers with desktop power supplies. :>
great tool to hunt down power wasting processes is intel's "powertop".
Topics for further research
- Loading microcode - does it do anything for this
- Other hardware powersaving, i.e. PCIe ASPM
When to not use power management:
If you need very low latency computing, you don't want C-states or any other power management. On high-loaded nagios servers the difference can be up to 30% in performance.
I'm attaching a PDF on this by HP: HP Tuning for low latency computing . Generally, you'll find such documentation with all high-performance/high-IO server hardware vendors (i.e. for Solarflare NICs or Infiniband vendors). The rule of thumb seems to be:
- Turn off Speedstep / EIST
- Turn off all power management above C-state 2.
- Turn off Turbo mode
- Turn off OS control of irq assignments (do it yourself)
- Turn off OS migration of high-cpu processes (desktop feature... bullshit to have in a server)
So the first 3 are about keeping the power profile constant, the last 2 are about taking away control from the OS to maximize your CPU cache use and avoid wasting QPI bandwidth.