Source code of Windows XP (NT5)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

267 lines
10 KiB

  1. subttl emfsqrt.asm - FSQRT instruction
  2. page
  3. ;*******************************************************************************
  4. ;emfsqrt.asm - FSQRT instruction
  5. ; by Tim Paterson
  6. ;
  7. ; Microsoft Confidential
  8. ;
  9. ; Copyright (c) Microsoft Corporation 1991
  10. ; All Rights Reserved
  11. ;
  12. ;Inputs:
  13. ; edi = [CURstk]
  14. ;
  15. ;Revision History:
  16. ;
  17. ; [] 09/05/91 TP Initial 32-bit version.
  18. ;
  19. ;*******************************************************************************
  20. ;A linear approximation of the square root function is used to get the
  21. ;intial value for Newton-Raphson iteration. This approximation gives
  22. ;nearly 5-bit accuracy over the required input interval, [1,4). The
  23. ;equation for the linear approximation of y = sqrt(x) is y = mx + b,
  24. ;where m is the slope (named SQRT_COEF) and b is the y-intercept (named
  25. ;SQRT_INTERCEPT).
  26. ;
  27. ;(The values for m and b were computed with Excel Solver in two passes:
  28. ;the first pass computed them full precision, minimizing absolute error;
  29. ;the second computed only b after m was rounded to an 8-bit value.)
  30. ;
  31. ;The resulting values have the following maximum error:
  32. ;
  33. ;inp. value --> 1 2.18972 3.82505
  34. ;----------------------------------------------------------------
  35. ;abs. err., full prec. 0.04544 -0.03233 0.04423
  36. ;abs. err., truncated 0.04544 -0.04546 0.04423
  37. ;
  38. ;The three input values shown represent the left end point, the maximum
  39. ;error (derivative of absolute error == 0), and the right end point.
  40. ;The right end point is not 4 because the approximation reaches 2.000
  41. ;at the value given--we abandon the linear approximation at that point
  42. ;and use that same value for all greater input values. This linear
  43. ;approximation is computed with 8-bit operations, so truncations can
  44. ;add a negative error. This increases maximum error only when it is
  45. ;already negative, as shown in the table.
  46. ;
  47. ;Each iteration of Newton-Raphson approximation more than doubles the
  48. ;number of bits of accuracy. Suppose the current guess is A, and it has
  49. ;an absolute error of e (i.e., A+e or A-e is the root). Then the absolute
  50. ;error after the next iteration is e^2/2A. This error is always positive.
  51. ;However, the divide instruction truncates, which introduces an error
  52. ;that is always negative. Sometimes a constant or rounding bit is added
  53. ;to balance the positive and negative errors. The maximum possible error
  54. ;is given in comments below for each iteration. (Note that when we compute
  55. ;the error from e^2/2A, A could be in the range 1 to 2--we use 1 to get
  56. ;max error.) Remember that the binary point is to the RIGHT of the MSB
  57. ;when looking at these error numbers.
  58. ;SQRT_INTERCEPT is used when the binary point is to the right of the MSB.
  59. ;Multiplying it by 64K would put the binary point to the left of the MSB,
  60. ;so it must be divided by two to be aligned.
  61. SQRT_INTERCEPT equ 23185 ; 0.70755 * 65536 / 2
  62. ;SQRT_COEF would have the binary point to the left of the MSB if multiplied
  63. ;by 256. However, this would leave it with a leading zero, so we multiply
  64. ;it by two more to normalize it.
  65. SQRT_COEF equ 173 ; 0.33789 * 256 * 2
  66. SqrtSpcl:
  67. cmp al,bTAG_DEN
  68. jz SqrtDen
  69. cmp al,bTAG_INF
  70. jnz SpclDestNotDen
  71. ;Have infinity
  72. or ah,ah ;Is it negative?
  73. js ReturnIndefinite
  74. SqrtRet:
  75. ret
  76. MaxStartRoot:
  77. ;The first iteration is calculated as (ax / bh) * 100H + bx. The first
  78. ;trial root in bx should be 10000H (which is too big). But it's very
  79. ;easy to calculate (ax / 100H) * 100H + 10000H = ax.
  80. mov bx,ax
  81. cmp ax,-1 ;Would subsequent DIV overflow?
  82. jb FirstTrialRoot
  83. ;The reduced argument is so close to 4.0 that the 16-bit DIV instruction
  84. ;used in the next iteration would overflow. If the argument is 4-A
  85. ;then a guess of 2.0 is in error by approximately A/4. [This is not
  86. ;an upper bound. The error is a little by more than this by an
  87. ;addition with the magnitude of A^2. This is an insignificant amount
  88. ;when A is small.] This means that the first guess of 2.0 is quite
  89. ;accurate, and we'll use it to bypass some of the iteration steps.
  90. ;This will eliminate the DIV overflow by skipping the DIV.
  91. ;
  92. ;One iteration is performed by: (Arg/Guess + Guess)/2. When Guess = 2,
  93. ;this becomes (Arg/2 + 2)/2 = Arg/4 + 1. We get Arg/2 just by assuming
  94. ;the binary point is one bit further left; then a single right shift is
  95. ;needed to get Arg/4. By shifting in a 1 bit on the left, we account for
  96. ;adding 1 at the same time. [Note that if Arg = 4 - A, then Arg/4 + 1
  97. ; = (4 - A)/4 + 1 = 1 - A/4 + 1 = 2 - A/4. In other words, we just
  98. ;subtract out exactly what we estimate our error to be, A/4.]
  99. ;
  100. ;Since the upper 16 bits are 0FFFFH, A <= 2^-14, so error <= 2^-16 =
  101. ; +0.00001526, -0.
  102. mov ebx,esi ;Return root in ebx
  103. sar ebx,1 ;Trial root = arg/2
  104. cmp esi,ebx ;Will 32-bit division overflow?
  105. jb StartThirdIteration ;No, our 32-bit guess is good
  106. ;Argument is really, really close to 4.0: with an initial trial root of
  107. ;2.0, max absolute error is 2^-32 = +2.328E-10, -0. One trivial
  108. ;iteration will get us 65-bit accuracy, max abs. error = +2.71E-20, -0.
  109. mov ebx,esi
  110. mov eax,ecx ;65-bit root*2 in ebx:eax (MSB implied)
  111. shl ecx,2 ;ecx = low half*4
  112. jmp RoundRoot
  113. SqrtDen:
  114. mov EMSEG:[CURerr],Denormal
  115. test EMSEG:[CWmask],Denormal ;Is denormal exception masked?
  116. jnz SqrtRet ;If not, quit
  117. ;******
  118. EM_ENTRY eFSQRT
  119. eFSQRT:
  120. ;******
  121. mov eax,EMSEG:[edi].ExpSgn
  122. cmp al,bTAG_ZERO
  123. jz SqrtRet
  124. ja SqrtSpcl
  125. or ah,ah
  126. js ReturnIndefinite
  127. mov esi,EMSEG:[edi].lManHi
  128. mov ecx,EMSEG:[edi].lManLo
  129. sar EMSEG:[edi].wExp,1 ;Divide exponent by two
  130. mov edi,0 ;Extend mantissa
  131. jc RootAligned ;If odd exponent, leave it normalized
  132. shrd edi,ecx,1
  133. shrd ecx,esi,1
  134. shr esi,1 ;Denormalize, extending into edi
  135. RootAligned:
  136. ;esi:ecx:edi has mantissa, 2 MSBs are left of binary point. Range is [1,4).
  137. shld eax,esi,16 ;Get high word of mantissa
  138. movzx ebx,ah ;High byte to bl
  139. ;UNDONE: MASM 6 bug!!
  140. ;UNDONE: SQRT_COEF (=0AEH) get sign extended!!
  141. mov dx,SQRT_COEF ;UNDONE
  142. imul bx,dx ;UNDONE
  143. ;UNDONE imul bx,SQRT_COEF ;Product in bx
  144. ;Multiply by SQRT_COEF causes binary point to shift left 1 bit.
  145. add bx,SQRT_INTERCEPT ;5-bit approx. square root in bh
  146. jc MaxStartRoot
  147. ;Max absolute error is +/- 0.04546
  148. div bh ;See how close we are
  149. add bh,al ;quotient + divisor (always sets CY)
  150. FirstTrialRoot:
  151. ;Avoid RCR because it takes 9 clocks on 386. Use SHRD (3 clocks) instead.
  152. mov dl,1 ;Need bit set
  153. shrd bx,dx,1 ;(quotient + divisor)/2
  154. ;bx has 9-bit approx. square root, normalized
  155. ;Max absolute error is +0.001033, -0.003906
  156. movzx eax,si
  157. shld edx,esi,16 ;dx:ax has high half mantissa
  158. div bx ;Test our approximation
  159. add ebx,eax ;quotient + divisor
  160. shl ebx,15 ;Normalize (quotient + divisor)/2
  161. ;ebx has 17-bit approx. square root, normalized
  162. ;Max absolute error is +0.000007629, -0.00001526
  163. ;Add adjustment factor to center the error range at +/-0.00001144
  164. or bh,20H ;Add in 0.000003815
  165. StartThirdIteration:
  166. mov edx,esi
  167. mov eax,ecx
  168. div ebx ;Test approximation
  169. stc ;Set bit for rounding (= 2.328E-10)
  170. adc ebx,eax ;quotient + divisor + round bit
  171. ;Avoid RCR because it takes 9 clocks on 386. Use SHRD (3 clocks) instead.
  172. mov dl,1 ;Need bit set
  173. shrd ebx,edx,1 ;(quotient + divisor)/2, rounded
  174. ;ebx has 32-bit approx. square root, normalized
  175. ;Max absolute error is +2.983E-10, -2.328E-10
  176. mov edx,esi ;Last time we need high half
  177. mov eax,ecx
  178. shld ecx,edi,2 ;ecx = low half*4, w/extension back in
  179. div ebx ;Test approximation
  180. xchg edi,eax ;Save 1st quotient, get extension
  181. mov esi,eax
  182. or esi,edx ;Any remainder?
  183. jz HaveRoot ;Result is ebx:esi
  184. div ebx ;edi:eax is 64-bit quotient
  185. add ebx,edi ;quotient + divisor (always sets CY)
  186. RoundRoot:
  187. mov esi,eax ;Save low half root*2
  188. ;We have 65-bit root*2 in ebx:esi (eax==esi) (MSB is implied one).
  189. ;Max absolute error is +4.450E-20, -5.421E-20. This maximum error
  190. ;corresponds to just less than +/- 1 in the last (65th) bit.
  191. ;
  192. ;We have to determine if this error is positive or negative so
  193. ;we can tell if we rounded up or down (and set the status bit
  194. ;accordingly). This is done by squaring the root and comparing the
  195. ;that result with the input.
  196. ;
  197. ;Squaring the sample root requires summing partial products:
  198. ; lo*lo + lo*hi + hi*lo + hi*hi. lo*hi == hi*lo, so only one multiply
  199. ;is needed there. The low half of lo*lo isn't relevant, we know it
  200. ;is non-zero. Only the low few bits of hi*hi are needed, so we can use
  201. ;an 8-bit multiply there. Since the MSB is implied, we need to add in
  202. ;two 1*lo products (shifted up 64 bits). We only need bits 64 - 71 of
  203. ;the 130-bit product (the action happens near bit 65). What we're
  204. ;squaring is root*2, so the result is square*4. ecx already has arg*4.
  205. mul eax ;Low partial product of square
  206. mov edi,edx ;Only high half counts
  207. mov eax,ebx
  208. mul esi ;Middle partial product of square
  209. add eax,eax ;There are two of these
  210. adc edx,edx
  211. add edi,eax
  212. adc edx,0 ;edx:edi = lo*lo + lo*hi + hi*lo
  213. add edx,esi ;lo*implied msb
  214. add edx,esi ;lo*implied msb again
  215. mov al,bl
  216. mul al ;hi*hi - only low 8 bits are valid
  217. add al,dl ;Bits 64 - 71 of product
  218. or al,1 ;Account for sticky bits 0 - 63
  219. sub cl,al ;Compare product with argument
  220. ;Sign flag set if product is larger. In this case, subtract 1 from root.
  221. add cl,cl ;Set CY if sign is set
  222. SubOneFromRoot:
  223. sbb esi,0 ;Reduce root if product was too big
  224. sbb ebx,0
  225. ShiftRoot:
  226. ;ebx:esi = root*2
  227. ;Absolute error is in the range (0, -5.421E-20). This is equivalent to
  228. ;less than +1, -0 in last bit. Thus LSB is correct rounding bit as
  229. ;long as we set a sticky bit below it.
  230. ;
  231. ;Now divide root*2 by 2, preserving LSB as rounding bit and filling
  232. ;eax with 1's as sticky bits.
  233. ;
  234. ;Avoid RCR because it takes 9 clocks on 386. Use SHRD (3 clocks) instead.
  235. mov eax,-1
  236. shrd eax,esi,1 ;Move round bit to MSB of eax
  237. shrd esi,ebx,1
  238. shrd ebx,eax,1 ;Shift 1 into MSB of ebx
  239. StoreRoot:
  240. mov edi,EMSEG:[CURstk]
  241. mov EMSEG:[Result],edi
  242. mov ecx,EMSEG:[edi].ExpSgn
  243. ;mantissa in ebx:esi:eax, exponent in high ebx, sign in bh bit 7
  244. jmp EMSEG:[RoundMode]
  245. HaveRoot:
  246. ;esi = eax = edx = 0
  247. cmp edi,ebx ;Does quotient == divisor?
  248. jz StoreRoot ;If so, we're done
  249. ;Quotient != divisor, so answer is not exact. Since remainder is zero,
  250. ;the division was exact. The only error in the result is e^2/2A, which
  251. ;is always positive. We need the error to be only negative so that
  252. ;the rounding routine can properly tell if it rounded up.
  253. add ebx,edi ;quotient + divisor (always sets CY)
  254. jmp SubOneFromRoot ;Reduce root to ensure negative error