> Looks like we may finally have it.
>
> The technical details:
>
> The generation of the NaN was traced down to a single assembly
> instruction. The point at which the NaN occurs is when a value is
> loaded from memory into one of the Floating Point Units 8 registers.
>
> ~~~~~~~~~~~~~~~~
>
> The really technical details:
>
> The FPU treats these registers as if they were a stack, maintaining a
> pointer to the register currently considered the stack top and a
> status bit for each register denoting whether the register is empty or
> contains data. Each time a load operation occurs, the stack pointer
> is incremented, and the value is loaded into the new stack top
> register. When a value is moved out of the stack top register the
> stack pointer is decremented and the old top register is marked empty.
> It is also possible to mark a register as empty manually through code.
>
>
> If a load operation occurs and the next register in the stack is
> marked as containing data a stack overflow exception is generated.
> Intel's technical docs specify that when a stack overflow occurs the
> "undefined value" is loaded into the register in question. We've
> assumed that the "undefined value" is NaN.
>
> ~~~~~~~~~~~~~~~~
>
> The situation described above is what our trace through the assembly
> code revealed to be happening. The code in question is in
> GParticle::scatterBy(...) called by Interactor::multipleScatter(...).
> Since this only occurs in optimized code we are under the assumption
> that the egcs people were lax in generating FPU stack maintenance
> code with optimization turned on.
>
> ~~~~~~~~~~~~~~~~
>
> Enough of all the technical stuff, this is what we're doing about it:
>
> We've tried linking an executable with all code optimized except for
> GParticle. This prevented some of the NaNs we were getting, but not
> all. This could possibly mean that other (optimized) modules contain
> code with the same problem. If this is caused by egcs 1.1 then it is
> likely to occur in more places than the one we found and tracking them
> all down could be a nightmare.
>
> Therefore our intent is to obtain a copy of egcs 1.1.2 which claims to
> have repaired some problems with floating point code generation for
> Pentium Pro, II, and III architectures. The descriptions of these
> fixes are somewhat vague and we can not be certain whether they
> pertain to our problem.
>
> If the new compiler is not the solution, then I will inform the egcs
> people of our dilemma -- and see if I can convince them that a bug we
> claim exists, but which we cannot reliably reproduce, is worth
> investigating.
>
>
> P.S. My thanks to Tony Waite and Richard Dubois for their invaluable
> help in this matter.
>
> Daniel Flath
> LCD Group
> <[log in to unmask]>
> (650) 926-8794
>
|