[x264-devel] Machine Check errors
Mark Nelson
markn at ieee.org
Thu Mar 12 14:31:23 CET 2015
>Both E5645 and X5680 are Westmere-EP CPUs, does it occur with other
>microarchitectures as well?
Hi Henrik,
I received an off-list email from another person who made the same
suggestion. He had personal experience with a bug in the Nehalem micro
architecture which was caused by specific sequences of instructions,
including some in the SSE2 family.
This matches up with what I am seeing - we have never seen this problem on
Sandy Bridge, and we have only seen it when using x264 builds the use SSE2.
It's difficult to know what specific bug it is, but we are testing with
what I believe is the latest microcode, so Intel has chosen not to fix it.
All this adds up pretty well. What is truly annoying about it is that
neither or MB manufacturer or Intel have been any help whatsoever in
chasing this. When you sell my motherboards, and I can generate Machine
Check errors with user mode code, I feel that the onus is on you to figure
out what is wrong. Our MB vendor was simply unable to do so, with or
without Intel's help.
Intel produces a document called "Debugging Check Exceptions on Embedded IA
Platforms". It's 17 pages long but boils down to this: try changing things
until the problem goes away.
In a perfect world I would expect that if I said "STOP 0x9C" to my vendor
they would immediately have a reference from Intel that describes how this
can be caused by existing bugs.
Anyway, that's a load of complaining that is relatively off-topic. Based on
hearing from someone else who had a nearly identical problem, I am going to
believe that this characterizes the problem.
It's possible we could fix this by modding x264, but there are two big
issues there. One, we don't actually know what code sequence is breaking
things - the crashes are not conveniently pointing to x264 code. Second, we
are using this as a third party library in our product, and it would be
difficult to devote someone to becoming adept at x264 internals for the
sake of fixing this. So instead we work around it by just turning off SSE2.
One final note. It took a long time to pin down x264 as the source of the
problem. One reason was that, despite that fact that we have NEVER seen the
problem on a system that wasn't encoding using x264, the machine check did
not occur in the x264 code. A typical stack dump is shown below. It's
almost as though hitting this defect required one core to be encoding while
another was calculating MD5s.
STACK_TEXT:
f65af248 0000ffff
f65af24c 00009200
f65af250 97908887
f65af254 e00398b8
f65af258 0000ffff
f65af25c 00009200
f65af260 3000ffff
f65af264 f6409331 Fips!TransformMD5+0x281
f65af268 3000ffff
f65af26c f6409331 Fips!TransformMD5+0x281
f65af270 3000ffff
f65af274 f6409331 Fips!TransformMD5+0x281
f65af278 e003f120
f65af27c 00000000
f65af280 e003f128
f65af284 00000000
f65af288 e003f130
f65af28c 00000000
f65af290 e003f138
f65af294 00000000
f65af298 e003f140
f65af29c 00000000
f65af2a0 e003f148
f65af2a4 00000000
f65af2a8 e003f150
f65af2ac 00000000
f65af2b0 e003f158
f65af2b4 00000000
f65af2b8 e003f160
f65af2bc 00000000
f65af2c0 e003f168
f65af2c4 00000000
FOLLOWUP_IP:
Fips!TransformMD5+281
f6409331 8bc2 mov eax,edx
SYMBOL_NAME: Fips!TransformMD5+281
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: Fips
IMAGE_NAME: Fips.SYS
DEBUG_FLR_IMAGE_TIMESTAMP: 480251f7
FAILURE_BUCKET_ID: 0x9C_GenuineIntel_Fips!TransformMD5+281
BUCKET_ID: 0x9C_GenuineIntel_Fips!TransformMD5+281
Followup: MachineOwner
------------------------------------------------------------------------------
Mark Nelson - markn at ieee.org - http://marknelson.us
On Wed, Mar 11, 2015 at 5:54 PM, Henrik Gramner <henrik at gramner.com> wrote:
> On Tue, Mar 10, 2015 at 7:45 PM, Mark Nelson <markn at ieee.org> wrote:
> > Using recent videolan builds of the x264 windows command line executable,
> > (x264-r2491-24e4fed.exe), I have some hardware that experiences BSOD
> errors
> > due to Machine Check 9C. This is seen when using the the default
> auto-detect
> > CPU flags.
> >
> > The BSODs are very rare. On a machine that is using close to 100% of its
> > cycles on encoding, the average rate of failure is perhaps 1/week.
> >
> > The error has been seen on Xeon E5645 @ 2.4 GHz CPUs running XP, and on
> Xeon
> > X5680 @3.33 GHz CPUs running Server 2008 R2.The crash is not associated
> with
> > specific machines, it seems to occur on any machine of a specific model
> and
> > CPU type.
> >
> > On both types of system, running the encoders with --asm 0x1400EE
> > eliminates the problem - thousands and thousands of hours with no
> crashes.
> >
> > Getting to the bottom of Machine Check errors on Intel CPUs seems very
> > problematic. It doesn't seem like our MB manufacturer or Intel has a good
> > way to actually catch this in the act and and explain why it happens. All
> > the advice for fixing this error is along the lines of eliminating
> possible
> > problems, mostly by pointing fingers at things that can go bad on the MB,
> > faulty memory, bad BIOS settings etc.
> >
> > All of that is fine, but these same machines never experience that BSOD
> > error when running other types of software at the same high rates -
> close to
> > 100% CPU utilization. There is something about the default CPU options
> being
> > selected by x264 that is causing the unique event:
> >
> > x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
> >
> >
> > I realize it is *way* outside the scope of this mailer to debug CPU, MB,
> and
> > chipset defects, but it would be interesting to know if anyone has ever
> seen
> > this, either in the context of x264 or elsewhere.
> >
> > I don't think there is any way a Machine Check 9C can be generated by
> user
> > mode code, so I have all along been working on the theory that this is a
> > result of either a hardware defect or configuration error. To no avail.
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > Mark Nelson - markn at ieee.org - http://marknelson.us
> >
> >
> > _______________________________________________
> > x264-devel mailing list
> > x264-devel at videolan.org
> > https://mailman.videolan.org/listinfo/x264-devel
> >
>
> That indeed sounds like a hardware issue since a user space
> application shouldn't be able to cause a BSOD.
>
> Both E5645 and X5680 are Westmere-EP CPUs, does it occur with other
> microarchitectures as well? If not it could possibly be a CPU bug
> (those exist in a much larger number than you'd expect), see
>
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
> for an errata summary of the 5600 series.
>
>
> Henrik
> _______________________________________________
> x264-devel mailing list
> x264-devel at videolan.org
> https://mailman.videolan.org/listinfo/x264-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x264-devel/attachments/20150312/2475eec2/attachment.html>
More information about the x264-devel
mailing list