[x264-devel] Machine Check errors

Mark Nelson markn at ieee.org
Thu Mar 12 14:31:23 CET 2015


>Both E5645 and X5680 are Westmere-EP CPUs, does it occur with other
>microarchitectures as well?

Hi Henrik,

I received an off-list email from another person who made the same
suggestion. He had personal experience with a bug in the Nehalem micro
architecture which was caused by specific sequences of instructions,
including some in the SSE2 family.

This matches up with what I am seeing - we have never seen this problem on
Sandy Bridge, and we have only seen it when using x264 builds the use SSE2.

It's difficult to know what specific bug it is, but we are testing with
what I believe is the latest microcode, so Intel has chosen not to fix it.

All this adds up pretty well. What is truly annoying about it is that
neither or MB manufacturer or Intel have been any help whatsoever in
chasing this. When you sell my motherboards, and I can generate Machine
Check errors with user mode code, I feel that the onus is on you to figure
out what is wrong. Our MB vendor was simply unable to do so, with or
without Intel's help.

Intel produces a document called "Debugging Check Exceptions on Embedded IA
Platforms". It's 17 pages long but boils down to this: try changing things
until the problem goes away.

In a perfect world I would expect that if I said "STOP 0x9C" to my vendor
they would immediately have a reference from Intel that describes how this
can be caused by existing bugs.

Anyway, that's a load of complaining that is relatively off-topic. Based on
hearing from someone else who had a nearly identical problem, I am going to
believe that this characterizes the problem.

It's possible we could fix this by modding x264, but there are two big
issues there. One, we don't actually know what code sequence is breaking
things - the crashes are not conveniently pointing to x264 code. Second, we
are using this as a third party library in our product, and it would be
difficult to devote someone to becoming adept at x264 internals for the
sake of fixing this. So instead we work around it by just turning off SSE2.

One final note. It took a long time to pin down x264 as the source of the
problem. One reason was that, despite that fact that we have NEVER seen the
problem on a system that wasn't encoding using x264, the machine check did
not occur in the x264 code. A typical stack dump is shown below. It's
almost as though hitting this defect required one core to be encoding while
another was calculating MD5s.

STACK_TEXT:

f65af248  0000ffff

f65af24c  00009200

f65af250  97908887

f65af254  e00398b8

f65af258  0000ffff

f65af25c  00009200

f65af260  3000ffff

f65af264  f6409331 Fips!TransformMD5+0x281

f65af268  3000ffff

f65af26c  f6409331 Fips!TransformMD5+0x281

f65af270  3000ffff

f65af274  f6409331 Fips!TransformMD5+0x281

f65af278  e003f120

f65af27c  00000000

f65af280  e003f128

f65af284  00000000

f65af288  e003f130

f65af28c  00000000

f65af290  e003f138

f65af294  00000000

f65af298  e003f140

f65af29c  00000000

f65af2a0  e003f148

f65af2a4  00000000

f65af2a8  e003f150

f65af2ac  00000000

f65af2b0  e003f158

f65af2b4  00000000

f65af2b8  e003f160

f65af2bc  00000000

f65af2c0  e003f168

f65af2c4  00000000



FOLLOWUP_IP:

Fips!TransformMD5+281

f6409331 8bc2            mov     eax,edx

SYMBOL_NAME:  Fips!TransformMD5+281

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Fips

IMAGE_NAME:  Fips.SYS

DEBUG_FLR_IMAGE_TIMESTAMP:  480251f7

FAILURE_BUCKET_ID:  0x9C_GenuineIntel_Fips!TransformMD5+281

BUCKET_ID:  0x9C_GenuineIntel_Fips!TransformMD5+281

Followup: MachineOwner

------------------------------------------------------------------------------

Mark Nelson - markn at ieee.org - http://marknelson.us

On Wed, Mar 11, 2015 at 5:54 PM, Henrik Gramner <henrik at gramner.com> wrote:

> On Tue, Mar 10, 2015 at 7:45 PM, Mark Nelson <markn at ieee.org> wrote:
> > Using recent videolan builds of the x264 windows command line executable,
> > (x264-r2491-24e4fed.exe), I have some hardware that experiences BSOD
> errors
> > due to Machine Check 9C. This is seen when using the the default
> auto-detect
> > CPU flags.
> >
> > The BSODs are very rare. On a machine that is using close to 100% of its
> > cycles on encoding, the average rate of failure is perhaps 1/week.
> >
> > The error has been seen on Xeon E5645 @ 2.4 GHz CPUs running XP, and on
> Xeon
> > X5680 @3.33 GHz CPUs running Server 2008 R2.The crash is not associated
> with
> > specific machines, it seems to occur on any machine of a specific model
> and
> > CPU type.
> >
> > On both types of system, running the encoders with  --asm 0x1400EE
> > eliminates the problem - thousands and thousands of hours with no
> crashes.
> >
> > Getting to the bottom of Machine Check errors on Intel CPUs seems very
> > problematic. It doesn't seem like our MB manufacturer or Intel has a good
> > way to actually catch this in the act and and explain why it happens. All
> > the advice for fixing this error is along the lines of eliminating
> possible
> > problems, mostly by pointing fingers at things that can go bad on the MB,
> > faulty memory, bad BIOS settings etc.
> >
> > All of that is fine, but these same machines never experience that BSOD
> > error when running other types of software at the same high rates -
> close to
> > 100% CPU utilization. There is something about the default CPU options
> being
> > selected by x264 that is causing the unique event:
> >
> > x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
> >
> >
> > I realize it is *way* outside the scope of this mailer to debug CPU, MB,
> and
> > chipset defects, but it would be interesting to know if anyone has ever
> seen
> > this, either in the context of x264 or elsewhere.
> >
> > I don't think there is any way a Machine Check 9C can be generated by
> user
> > mode code, so I have all along been working on the theory that this is a
> > result of either a hardware defect or configuration error. To no avail.
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > Mark Nelson - markn at ieee.org - http://marknelson.us
> >
> >
> > _______________________________________________
> > x264-devel mailing list
> > x264-devel at videolan.org
> > https://mailman.videolan.org/listinfo/x264-devel
> >
>
> That indeed sounds like a hardware issue since a user space
> application shouldn't be able to cause a BSOD.
>
> Both E5645 and X5680 are Westmere-EP CPUs, does it occur with other
> microarchitectures as well? If not it could possibly be a CPU bug
> (those exist in a much larger number than you'd expect), see
>
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
> for an errata summary of the 5600 series.
>
>
> Henrik
> _______________________________________________
> x264-devel mailing list
> x264-devel at videolan.org
> https://mailman.videolan.org/listinfo/x264-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x264-devel/attachments/20150312/2475eec2/attachment.html>


More information about the x264-devel mailing list