Leaked source code of windows server 2003
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

3967 lines
147 KiB

;/* *************************************************************************
;** INTEL Corporation Proprietary Information
;**
;** This listing is supplied under the terms of a license
;** agreement with INTEL Corporation and may not be copied
;** nor disclosed except in accordance with the terms of
;** that agreement.
;**
;** Copyright (c) 1995 Intel Corporation.
;** All Rights Reserved.
;**
;** *************************************************************************
;*/
;////////////////////////////////////////////////////////////////////////////
;//
;// $Header: R:\h26x\h26x\src\enc\ex5me.asv 1.17 24 Sep 1996 11:27:00 BNICKERS $
;//
;// $Log: R:\h26x\h26x\src\enc\ex5me.asv $
;//
;// Rev 1.17 24 Sep 1996 11:27:00 BNICKERS
;//
;// Fix register colision.
;//
;// Rev 1.16 24 Sep 1996 10:40:32 BNICKERS
;// For H261, zero out motion vectors when classifying MB as Intra.
;//
;// Rev 1.13 19 Aug 1996 13:48:26 BNICKERS
;// Provide threshold and differential variables for spatial filtering.
;//
;// Rev 1.12 17 Jun 1996 15:19:34 BNICKERS
;// Fix recording of block and MB SWDs for Spatial Loop Filtering case in H261.
;//
;// Rev 1.11 30 May 1996 16:40:14 BNICKERS
;// Fix order of arguments.
;//
;// Rev 1.10 30 May 1996 15:08:36 BNICKERS
;// Fixed minor error in recent IA ME speed improvements.
;//
;// Rev 1.9 29 May 1996 15:37:58 BNICKERS
;// Acceleration of IA version of ME.
;//
;// Rev 1.8 15 Apr 1996 10:48:48 AKASAI
;// Fixed bug in Spatial loop filter code. Code had been unrolled and
;// the second case had not been updated in the fix put in place of
;// (for) the first case. Basically an ebx instead of bl that cased
;// and overflow from 7F to 3F.
;//
;// Rev 1.7 15 Feb 1996 15:39:26 BNICKERS
;// No change.
;//
;// Rev 1.6 15 Feb 1996 14:39:00 BNICKERS
;// Fix bug wherein access to area outside stack frame was occurring.
;//
;// Rev 1.5 15 Jan 1996 14:31:40 BNICKERS
;// Fix decrement of ref area addr when half pel upward is best in block ME.
;// Broadcast macroblock level MV when block gets classified as Intra.
;//
;// Rev 1.4 12 Jan 1996 13:16:08 BNICKERS
;// Fix SLF so that 3 7F pels doesn't overflow, and result in 3F instead of 7F.
;//
;// Rev 1.3 27 Dec 1995 15:32:46 RMCKENZX
;// Added copyright notice
;//
;// Rev 1.2 19 Dec 1995 17:11:16 RMCKENZX
;// fixed 2 bugs:
;// 1. do +-15 pel search if central and NOT 4 mv / macroblock
;// (was doing when central AND 4 mv / macroblock)
;// 2. correctly compute motion vectors when doing 4 motion
;// vectors per block.
;//
;// Rev 1.1 28 Nov 1995 15:25:48 AKASAI
;// Added white space so that will complie with the long lines.
;//
;// Rev 1.0 28 Nov 1995 14:37:00 BECHOLS
;// Initial revision.
;//
;//
;// Rev 1.13 22 Nov 1995 15:32:42 DBRUCKS
;// Brian made this change on my system.
;// Increased a value to simplify debugging
;//
;//
;//
;// Rev 1.12 17 Nov 1995 10:43:58 BNICKERS
;// Fix problems with B-Frame ME.
;//
;//
;//
;// Rev 1.11 31 Oct 1995 11:44:26 BNICKERS
;// Save/restore ebx.
;//
;////////////////////////////////////////////////////////////////////////////
;
; MotionEstimation -- This function performs motion estimation for the macroblocks identified
; in the input list.
; Conditional assembly selects either the H263 or H261 version.
;
; Input Arguments:
;
; MBlockActionStream
;
; The list of macroblocks for which we need to perform motion estimation.
;
; Upon input, the following fields must be defined:
;
; CodedBlocks -- Bit 6 must be set for the last macroblock to be processed.
;
; FirstMEState -- must be 0 for macroblocks that are forced to be Intracoded. An
; IntraSWD will be calculated.
; Other macroblocks must have the following values:
; 1: upper left, without advanced prediction. (Advanced prediction
; only applies to H263.)
; 2: upper edge, without advanced prediction.
; 3: upper right, without advanced prediction.
; 4: left edge, without advanced prediction.
; 5: central block, or any block if advanced prediction is being done.
; 6: right edge, without advanced prediction.
; 7: lower left, without advanced prediction.
; 8: lower edge, without advanced prediction.
; 9: lower right, without advanced prediction.
; If vertical motion is NOT allowed:
; 10: left edge, without advanced prediction.
; 11: central block, or any block if advanced prediction is being done.
; 12: right edge, without advanced prediction.
; *** Note that with advanced prediction, only initial states 0, 4, or
; 11 can be specified. Doing block level motion vectors mandates
; advanced prediction, but in that case, only initial
; states 0 and 4 are allowed.
;
; BlkOffset -- must be defined for each of the blocks in the macroblocks.
;
; TargetFrameBaseAddress -- Address of upper left viewable pel in the target Y plane.
;
; PreviousFrameBaseAddress -- Address of upper left viewable pel in the previous Y plane. Whether this is the
; reconstructed previous frame, or the original, is up to the caller to decide.
;
; FilteredFrameBaseAddress -- Address of upper left viewable pel in the scratch area that this function can record
; the spatially filtered prediction for each block, so that frame differencing can
; utilize it rather than have to recompute it. (H261 only)
;
; DoRadius15Search -- TRUE if central macroblocks should search a distance of 15 from center. Else searches 7 out.
;
; DoHalfPelEstimation -- TRUE if we should do ME to half pel resolution. This is only applicable for H263 and must
; be FALSE for H261. (Note: TRUE must be 1; FALSE must be 0).
;
; DoBlockLevelVectors -- TRUE if we should do ME at block level. This is only applicable for H263 and must be FALSE
; for H261. (Note: TRUE must be 1; FALSE must be 0).
; DoSpatialFiltering -- TRUE if we should determine if spatially filtering the prediction reduces the SWD. Only
; applicable for H261 and must be FALSE for H263. (Note: TRUE must be 1; FALSE must be 0).
;
; ZeroVectorThreshold -- If the SWD for a macroblock is less than this threshold, we do not bother searching for a
; better motion vector. Compute as follows, where D is the average tolerable pel difference
; to satisfy this threshold. (Initial recommendation: D=2 ==> ZVT=384)
; ZVT = (128 * ((int)((D**1.6)+.5)))
;
; NonZeroDifferential -- After searching for the best motion vector (or individual block motion vectors, if enabled),
; if the macroblock's SWD is not better than it was for the zero vector -- not better by at
; least this amount -- then we revert to the zero vector. We are comparing two macroblock
; SWDs, both calculated as follows: (Initial recommendation: NZD=128)
; For each of 128 match points, where D is its Abs Diff, accumulate ((int)(M**1.6)+.5)))
;
; BlockMVDifferential -- The amount by which the sum of four block level SWDs must be better than a single macroblock
; level SWD to cause us to choose block level motion vectors. See NonZeroDifferential for
; how the SWDs are calculated. Only applicable for H261. (Initial recommendation: BMVD=128)
;
; EmptyThreshold -- If the SWD for a block is less than this, the block is forced empty. Compute as follows, where D
; is the average tolerable pel diff to satisfy threshold. (Initial recommendation: D=3 ==> ET=96)
; ET = (32 * ((int)((D**1.6)+.5)))
;
; InterCodingThreshold -- If any of the blocks are forced empty, we can simply skip calculating the INTRASWD for the
; macroblock. If none of the blocks are forced empty, we will compare the macroblock's SWD
; against this threshold. If below the threshold, we will likewise skip calculating the
; INTRASWD. Otherwise, we will calculate the INTRASWD, and if it is less than the [Inter]SWD,
; we will classify the block as INTRA-coded. Compute as follows, where D is the average
; tolerable pel difference to satisfy threshold. (Initial recommendation: D=4 ==> ICT=1152)
; ICT = (128 * ((int)((D**1.6)+.5)))
;
; IntraCodingDifferential -- For INTRA coding to occur, the INTRASWD must be better than the INTERSWD by at least
; this amount.
;
; Output Arguments
;
; MBlockActionStream
;
; These fields are defined as follows upon return:
;
; BlockType -- Set to INTRA, INTER1MV, or (H263 only) INTER4MV.
;
; PHMV and PVMV -- The horizontal and vertical motion vectors, in units of a half pel.
;
; BHMV and BVMV -- These fields get clobbered.
;
; PastRef -- If BlockType != INTRA, set to the address of the reference block.
;
; If Horizontal MV indicates a half pel position, the prediction for the upper left pel of the block
; is the average of the pel at PastRef and the one at PastRef+1.
;
; If Vertical MV indicates a half pel position, the prediction for the upper left pel of the block
; is the average of the pel at PastRef and the one at PastRef+PITCH.
;
; If both MVs indicate half pel positions, the prediction for the upper left pel of the block is the
; average of the pels at PastRef, PastRef+1, PastRef+PITCH, and PastRef+PITCH+1.
;
; Indications of a half pel position can only happen for H263.
;
; In H261, when spatial filtering is done, the address will be in the SpatiallyFilteredFrame, where
; this function stashes the spatially filtered prediction for subsequent reuse by frame differencing.
;
; CodedBlocks -- Bits 4 and 5 are turned on, indicating that the U and V blocks should be processed. (If the
; FDCT function finds them to quantize to empty, it will mark them as empty.)
;
; Bits 0 thru 3 are cleared for each of blocks 1 thru 4 that MotionEstimation forces empty;
; they are set otherwise.
;
; Bits 6 and 7 are left unchanged.
;
; SWD -- Set to the sum of the SWDs for the four luma blocks in the macroblock. The SWD for any block that is
; forced empty, is NOT included in the sum.
;
;
;
; IntraSWDTotal -- The sum of the block SWDs for all Intracoded macroblocks.
;
; IntraSWDBlocks -- The number of blocks that make up the IntraSWDTotal.
;
; InterSWDTotal -- The sum of the block SWDs for all Intercoded macroblocks.
; None of the blocks forced empty are included in this.
;
; InterSWDBlocks -- The number of blocks that make up the InterSWDTotal.
;
;
; Other assumptions:
;
; For performance reasons, it is assumed that the layout of current and previous frames (and spatially filtered
; frame for H261) rigourously conforms to the following guide.
;
; The spatially filtered frame (only present and applicable for H261) is an output frame into which MotionEstimation
; places spatially filtered macroblocks as it determines if filtering is good for a macroblock. If it determines
; such, frame differencing will be able to re-use the spatially filtered macroblock, rather than recomputing it.
;
; Cache
; Alignment
; Points: v v v v v v v v v v v v v
; 16 | 352 (narrower pictures are left justified) | 16
; +---+---------------------------------------------------------------------------------------+---+
; | D | Current Frame Y Plane | D |
; | u | | u |
; Frame | m | | m |
; Height | m | | m |
; Lines | y | | y |
; | | | |
; +---+---------------------------------------------------------------------------------------+---+
; | |
; | |
; | |
; 24 lines | Dummy Space (24 lines plus 8 bytes. Can be reduced to 8 bytes if unrestricted motion |
; | vectors is NOT selected.) |
; | |
; | 8 176 16 176 |8
; | +-+-------------------------------------------------------------------------------------------+-+
; +-+D| Current Frame U Plane | D | Current Frame V Plane |D|
; Frame |u| | u | |u|
; Height |m| | m | |m|
; Div By 2 |m| | m | |m|
; Lines |y| | y | |y|
; +-+-------------------------------------------+---+-------------------------------------------+-+
; 72 dummy bytes. I.e. enough dummy space to assure that MOD ((Previous_Frame - Current_Frame), 128) == 80
; +-----------------------------------------------------------------------------------------------+
; | |
; 16 lines | If Unrestricted Motion Vectors selected, 16 lines must appear above and below previous frame, |
; | and these lines plus the 16 columns to the left and 16 columns to the right of the previous |
; | frame must be initialized to the values at the edges and corners, propagated outward. If |
; | Unrestricted Motion Vectors is off, these lines don't have to be allocated. |
; | |
; | +---------------------------------------------------------------------------------------+ +
; Frame | | Previous Frame Y Plane | |
; Height | | | |
; Lines | | | |
; | | | |
; | | | |
; | +---------------------------------------------------------------------------------------+ +
; | |
; 16 lines | See comment above Previous Y Plane |
; | |
; |+--- 8 bytes of dummy space. Must be there, whether unrestricted MV or not. |
; || |
; |v+-----------------------------------------------+---------------------------------------------+-+
; +-+ | |
; | See comment above Previous Y Plane. | See comment above Previous Y Plane. |
; 8 lines | Same idea here, but 8 lines are needed above | Same idea here, but 8 lines are needed |
; | and below U plane, and 8 columns on each side.| and below V plane, and 8 columns on each side.|
; | | |
; |8 176 8|8 176 8|
; | +-------------------------------------------+ | +-------------------------------------------+ |
; | | Previous Frame U Plane | | | Previous Frame V Plane | |
; Frame | | | | | | |
; Height | | | | | | |
; Div By 2 | | | | | | |
; Lines | | | | | | |
; | +-------------------------------------------+ | +-------------------------------------------+ |
; | | |
; 8 lines | See comment above Previous U Plane | See comment above Previous V Plane |
; | | |
; | | |
; | | |
; +-----------------------------------------------+---------------------------------------------+-+
; Enough dummy space to assure that MOD ((Spatial_Frame - Previous_Frame), 4096) == 2032
; +---+---------------------------------------------------------------------------------------+---+
; | D | Spatially Filtered Y Plane (present only for H261) | D |
; | u | | u |
; Frame | m | | m |
; Height | m | | m |
; Lines | y | | y |
; | | | |
; +---+---------------------------------------------------------------------------------------+---+
; | |
; | |
; | |
; 24 lines | Dummy Space (24 lines plus 8 bytes. Can be reduced to 8 bytes if unrestricted motion |
; | vectors is NOT selected, which is certainly the case for H261.) |
; | |
; | 8 176 16 176 |8
; | +-+-------------------------------------------------------------------------------------------+-+
; +-+D| Spatially Filtered U plane (H261 only) | D | Spatially Filtered V plane (H261 only) |D|
; Frame |u| | u | |u|
; Height |m| | m | |m|
; Div By 2 |m| | m | |m|
; Lines |y| | y | |y|
; +-+-------------------------------------------+---+-------------------------------------------+-+
;
; Cache layout of the target block and the full range for the reference area (as restricted to +/- 7 in vertical,
; and +/- 7 (expandable to +/- 15) in horizontal, is as shown here. Each box represents a cache line (32 bytes),
; increasing incrementally from left to right, and then to the next row (like reading a book). The 128 boxes taken
; as a whole represent 4Kbytes. The boxes are populated as follows:
;
; R -- Data from the reference area. Each box contains 23 of the pels belonging to a line of the reference area.
; The remaining 7 pels of the line is either in the box to the left (for reference areas used to provide
; predictions for target macroblocks that begin at an address 0-mod-32), or to the right (for target MBs that
; begin at an address 16-mod-32). There are 30 R's corresponding to the 30-line limit on the vertical distance
; we might search.
;
; T -- Data from the target macroblock. Each box contains a full line (16 pels) for each of two adjacent
; macroblocks. There are 16 T's corresponding to the 16 lines of the macroblocks.
;
; S -- Space for the spatially filtered macroblock (H261 only).
;
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+---+---+---+---+
; | T | | R | | S | | R | |
; +---+---+---+---+---+---+---+---+
;
; Thus, in a logical sense, the above data fits into one of the 4K data cache pages, leaving the other for all other
; data. Care has been taken to assure that the tables and the stack space needed by this function fit nicely into
; the other data cache page. Only the MBlockActionStream remains to conflict with the above data structures. That
; is both unavoidable, and of minimal consequence.
; An algorithm has been selected that calculates fewer SWDs (Sum of Weighted Differences) than the typical log search.
; In the typical log search, a three level search is done, in which the SWDs are compared for the center point and a
; point at each 45 degrees, initially 4 pels away, then 2, then 1. This requires a total of 25 SWDs for each
; macroblock (except those near edges or corners).
;
; In this algorithm, six levels are performed, with each odd level being a horizontal search, and each even level being
; a vertical search. Each search compares the SWD for the center point with that of a point in each direction on the
; applicable axis. This requires 13 SWDs, and a lot simpler control structure. Here is an example picture of a
; search, in which "0" represents the initial center point (the 0,0 motion vector), "A", and "a" represent the first
; search points, etc. In this example, the "winner" of each level of the search proceeds as follows: a, B, C, C, E, F,
; arriving at a motion vector of -1 horizontal, 5 vertical.
;
; ...............
; ...............
; ...............
; ...b...........
; ...............
; ...............
; ...............
; ...a...0...A...
; ...............
; .....d.........
; ......f........
; .c.BeCE........
; ......F........
; .....D.........
; ...............
;
;
; A word about data cache performance. Conceptually, the tables and local variables used by this function are placed
; in memory such that they will fit in one 4K page of the on-chip data cache. For the Pentium (tm) microprocessor,
; this leaves the other 4K page for other purposes. The other data structures consist of:
;
; The current frame, from which we need to access the lines of the 16*16 macroblock. Since cache lines are 32 bytes
; wide, the cache fill operations that fetch one target macroblock will serve to fetch the macroblock to the right,
; so an average of 8 cache lines are fetched for each macroblock.
;
; The previous frame, from which we need to access a reference area of 30*30 pels. For each macroblock for which we
; need to search for a motion vector, we will typically need to access no more than about 25 of these, but in general
; these lines span the 30 lines of the search area. Since cache lines are 32 bytes wide, the cache fill operations
; that fetch reference data for one macroblock, will tend to fetch data that is useful as reference data for the
; macroblock to the right, so an average of about 15 (rounded up to be safe) cache lines are fetched for each
; macroblock.
;
; The MBlockActionStream, which controls the searching (since we don't need to motion estimate blocks that are
; legislated to be intra) will disrupt cache behaviour of the other data structures, but not to a significant degree.
;
; By setting the pitch to a constant of 384, and by allocating the frames as described above, the one available 4K page
; of data cache will be able to contain the 30 lines of the reference area, the 16 lines of the target area, and the
; 16 lines of the spatially filtered area (H261 only) without any collisions.
;
;
; Here is a flowchart of the major sections of this function:
;
; +-- Execute once for Y part of each macroblock that is NOT Intra By Decree --+
; | |
; | +---------------------------------------------------------------+ |
; | | 1) Compute average value for target match points. | |
; | | 2) Prepare match points in target MB for easier matching. | |
; | | 3) Compute the SWD for (0,0) motion vector. | |
; | +---------------------------------------------------------------+ |
; | | |
; | v |
; | /---------------------------------\ Yes |
; | < 4) Is 0-motion SWD good enough? >-------------------------+ |
; | \---------------------------------/ | |
; | | | |
; | |No | |
; | v | |
; | +--- 5) While state engine has more motion vectors to check ---+ | |
; | | | | |
; | | | | |
; | | +---------------------------------------------------+ | | |
; | | | 5) Compute SWDs for 2 ref MBs and pick best of 3. |----->| | |
; | | +---------------------------------------------------+ | | |
; | | | | |
; | +--------------------------------------------------------------+ | |
; | | | |
; | v | |
; | /-----------------------------------------\ | |
; | < 6) Is best motion vector the 0-vector? > | |
; | \-----------------------------------------/ | |
; | | | | |
; | |No |Yes | |
; | v v | |
; | +-----------------+ +-------------------------------------------+ | |
; | | Mark all blocks | | 6) Identify as empty block any in which: |<-+ |
; | +--| non-empty. | | --> 0-motion SWD < EmptyThresh, and | |
; | | +-----------------+ +-------------------------------------------+ |
; | | | |
; | | v |
; | | /--------------------------------\ Yes +--------------------------+ |
; | | < 6) Are all blocks marked empty? >--->| 6) Classify FORCEDEMPTY |-->|
; | | \--------------------------------/ +--------------------------+ |
; | | | |
; | | |No |
; | | v |
; | | /--------------------------------------------\ |
; | | < 7) Are any non-phantom blocks marked empty? > |
; | | \--------------------------------------------/ |
; | | | | |
; | | |No |Yes |
; | v v v |
; | +---------------------+ +--------------------------------+ |
; | | 8) Compute IntraSWD | | Set IntraSWD artificially high | |
; | +---------------------+ +--------------------------------+ |
; | | | |
; | v v |
; | +-------------------------------+ |
; | | 10) Classify block as one of: | |
; | | INTRA |--------------------------------->|
; | | INTER | |
; | +-------------------------------+ |
; | |
; +----------------------------------------------------------------------------+
;
;
OPTION PROLOGUE:None
OPTION EPILOGUE:ReturnAndRelieveEpilogueMacro
OPTION M510
include e3inst.inc
include e3mbad.inc
.xlist
include memmodel.inc
.list
.DATA
; Storage for tables and temps used by Motion Estimation function. Fit into
; 4Kbytes contiguous memory so that it uses one cache page, leaving other
; for reference area of previous frame and target macroblock of current frame.
PickPoint DB 0,4,?,4,0,?,2,2 ; Map CF accum to new central pt selector.
PickPoint_BLS DB 6,4,?,4,6,?,2,2 ; Same, for when doing block level search.
OffsetToRef LABEL DWORD ; Linearized adjustments to affect horz/vert motion.
DD ? ; This index used when zero-valued motion vector is good enough.
DD 0 ; Best fit of 3 SWDs is previous center.
DD 1 ; Best fit of 3 SWDs is the ref block 1 pel to the right.
DD -1 ; Best fit of 3 SWDs is the ref block 1 pel to the left.
DD 1*PITCH ; Best fit of 3 SWDs is the ref block 1 pel above.
DD -1*PITCH ; Best fit of 3 SWDs is the ref block 1 pel below.
DD 2 ; Best fit of 3 SWDs is the ref block 2 pels to the right.
DD -2 ; Best fit of 3 SWDs is the ref block 2 pels to the left.
DD 2*PITCH ; Best fit of 3 SWDs is the ref block 2 pel above.
DD -2*PITCH ; Best fit of 3 SWDs is the ref block 2 pel below.
DD 4 ; Best fit of 3 SWDs is the ref block 4 pels to the right.
DD -4 ; Best fit of 3 SWDs is the ref block 4 pels to the left.
DD 4*PITCH ; Best fit of 3 SWDs is the ref block 4 pel above.
DD -4*PITCH ; Best fit of 3 SWDs is the ref block 4 pel below.
DD 7 ; Best fit of 3 SWDs is the ref block 7 pels to the right.
DD -7 ; Best fit of 3 SWDs is the ref block 7 pels to the left.
DD 7*PITCH ; Best fit of 3 SWDs is the ref block 7 pel above.
DD -7*PITCH ; Best fit of 3 SWDs is the ref block 7 pel below.
M0 = 4 ; Define symbolic indices into OffsetToRef lookup table.
MHP1 = 8
MHN1 = 12
MVP1 = 16
MVN1 = 20
MHP2 = 24
MHN2 = 28
MVP2 = 32
MVN2 = 36
MHP4 = 40
MHN4 = 44
MVP4 = 48
MVN4 = 52
MHP7 = 56
MHN7 = 60
MVP7 = 64
MVN7 = 68
; Map linearized motion vector to vertical part.
; (Mask bottom byte of linearized MV to zero, then use result
; as index into this array to get vertical MV.)
IF PITCH-384
*** error: The magic of this table assumes a pitch of 384.
ENDIF
DB -32, -32
DB -30
DB -28, -28
DB -26
DB -24, -24
DB -22
DB -20, -20
DB -18
DB -16, -16
DB -14
DB -12, -12
DB -10
DB -8, -8
DB -6
DB -4, -4
DB -2
DB 0
UnlinearizedVertMV DB 0
DB 2
DB 4, 4
DB 6
DB 8, 8
DB 10
DB 12, 12
DB 14
DB 16, 16
DB 18
DB 20, 20
DB 22
DB 24, 24
DB 26
DB 28, 28
DB 30
; Map initial states to initializers for half pel search. Where search would
; illegally take us off edge of picture, set initializer artificially high.
InitHalfPelSearchHorz LABEL DWORD
DD 040000000H, 000000000H, 000004000H
DD 040000000H, 000000000H, 000004000H
DD 040000000H, 000000000H, 000004000H
DD 040000000H, 000000000H, 000004000H
InitHalfPelSearchVert LABEL DWORD
DD 040000000H, 040000000H, 040000000H
DD 000000000H, 000000000H, 000000000H
DD 000004000H, 000004000H, 000004000H
DD 040004000H, 040004000H, 040004000H
SWDState LABEL BYTE ; Rules that govern state engine of motion estimator.
DB 8 DUP (?) ; 0: not used.
; 1: Upper Left Corner. Explore 4 right and 4 down.
DB 21, M0 ; (0,0)
DB 22, MHP4 ; (0,4)
DB 23, MVP4, ?, ? ; (4,0)
; 2: Upper Edge. Explore 4 left and 4 right.
DB 22, M0 ; (0, 0)
DB 22, MHN4 ; (0,-4)
DB 22, MHP4, ?, ? ; (0, 4)
; 3: Upper Right Corner. Explore 4 right and 4 down.
DB 31, M0 ; (0, 0)
DB 22, MHN4 ; (0,-4)
DB 32, MVP4, ?, ? ; (4, 0)
; 4: Left Edge. Explore 4 up and 4 down.
DB 23, M0 ; ( 0,0)
DB 23, MVN4 ; (-4,0)
DB 23, MVP4, ?, ? ; ( 4,0)
; 5: Interior Macroblock. Explore 4 up and 4 down.
DB 37, M0 ; ( 0,0)
DB 37, MVN4 ; (-4,0)
DB 37, MVP4, ?, ? ; ( 4,0)
; 6: Right Edge. Explore 4 up and 4 down.
DB 32, M0 ; ( 0,0)
DB 32, MVN4 ; (-4,0)
DB 32, MVP4, ?, ? ; ( 4,0)
; 7: Lower Left Corner. Explore 4 up and 4 right.
DB 38, M0 ; ( 0,0)
DB 39, MHP4 ; ( 0,4)
DB 23, MVN4, ?, ? ; (-4,0)
; 8: Lower Edge. Explore 4 left and 4 right.
DB 39, M0 ; (0, 0)
DB 39, MHN4 ; (0,-4)
DB 39, MHP4, ?, ? ; (0, 4)
; 9: Lower Right Corner. Explore 4 up and 4 left.
DB 44, M0 ; ( 0, 0)
DB 39, MHN4 ; ( 0,-4)
DB 32, MVN4, ?, ? ; (-4, 0)
; 10: Left Edge, No Vertical Motion Allowed.
DB 46, M0 ; (0,0)
DB 48, MHP2 ; (0,2)
DB 47, MHP4, ?, ? ; (0,4)
; 11: Interior Macroblock, No Vertical Motion Allowed.
DB 47, M0 ; (0, 0)
DB 47, MHN4 ; (0,-4)
DB 47, MHP4, ?, ? ; (0, 4)
; 12: Right Edge, No Vertical Motion Allowed.
DB 49, M0 ; (0, 0)
DB 48, MHN2 ; (0,-2)
DB 47, MHN4, ?, ? ; (0,-4)
; 13: Horz by 2, Vert by 2, Horz by 1, Vert by 1.
DB 14, M0
DB 14, MHP2
DB 14, MHN2, ?, ?
; 14: Vert by 2, Horz by 1, Vert by 1.
DB 15, M0
DB 15, MVP2
DB 15, MVN2, ?, ?
; 15: Horz by 1, Vert by 1.
DB 16, M0
DB 16, MHP1
DB 16, MHN1, ?, ?
; 16: Vert by 1.
DB 0, M0
DB 0, MVP1
DB 0, MVN1, ?, ?
; 17: Vert by 2, Horz by 2, Vert by 1, Horz by 1.
DB 18, M0
DB 18, MVP2
DB 18, MVN2, ?, ?
; 18: Horz by 2, Vert by 1, Horz by 1.
DB 19, M0
DB 19, MHP2
DB 19, MHN2, ?, ?
; 19: Vert by 1, Horz by 1.
DB 20, M0
DB 20, MVP1
DB 20, MVN1, ?, ?
; 20: Horz by 1.
DB 0, M0
DB 0, MHP1
DB 0, MHN1, ?, ?
; 21: From 1A. Upper Left. Try 2 right and 2 down.
DB 24, M0 ; (0, 0)
DB 25, MHP2 ; (0, 2)
DB 26, MVP2, ?, ? ; (2, 0)
; 22: From 1B.
; From 2 center point would be (0,-4/0/4).
; From 3B center point would be (0,-4).
DB 27, M0 ; (0, 4)
DB 18, MVP2 ; (2, 4) Next: Horz 2, Vert 1, Horz 1. (1:3,1:7)
DB 13, MVP4, ?, ? ; (4, 4) Next: Horz 2, Vert 2, Horz 1, Vert 1. (1:7,1:7)
; 23: From 1C.
; From 4 center point would be (-4/0/4,0).
; From 7C center point would be (-4,0).
DB 29, M0 ; (4, 0)
DB 14, MHP2 ; (4, 2) Next: Vert 2, Horz 1, Vert 1. (1:7,1:3)
DB 17, MHP4, ?, ? ; (4, 4) Next: Vert 2, Horz 2, Vert 1, Horz 1. (1:7,1:7)
; 24: From 21A. Upper Left. Try 1 right and 1 down.
DB 0, M0 ; (0, 0)
DB 0, MHP1 ; (1, 0)
DB 0, MVP1, ?, ? ; (0, 1)
; 25: From 21B.
; From 31B center point would be (0,-2).
DB 20, M0 ; (0, 2) Next: Horz 1 (0,1:3)
DB 20, MVP1 ; (1, 2) Next: Horz 1 (1,1:3)
DB 15, MVP2, ?, ? ; (2, 2) Next: Horz 1, Vert 1 (1:3,1:3)
; 26: From 21C.
; From 38C center point would be (-2,0).
DB 16, M0 ; (2, 0) Next: Vert 1 (1:3,0)
DB 16, MHP1 ; (2, 1) Next: Vert 1 (1:3,1)
DB 19, MHP2, ?, ? ; (2, 2) Next: Vert 1, Horz 1 (1:3,1:3)
; 27: From 22A.
DB 28, M0 ; (0, 4)
DB 28, MHN2 ; (0, 2)
DB 28, MHP2, ?, ? ; (0, 6)
; 28: From 27.
DB 20, M0 ; (0, 2/4/6) Next: Horz 1. (0,1:7)
DB 20, MVP1 ; (1, 2/4/6) Next: Horz 1. (1,1:7)
DB 20, MVP2, ?, ? ; (2, 2/4/6) Next: Horz 1. (2,1:7)
; 29: From 23A.
DB 30, M0 ; (4, 0)
DB 30, MVN2 ; (2, 0)
DB 30, MVP2, ?, ? ; (6, 0)
; 30: From 29.
DB 16, M0 ; (2/4/6, 0) Next: Vert 1. (1:7,0)
DB 16, MHP1 ; (2/4/6, 1) Next: Vert 1. (1:7,1)
DB 16, MHP2, ?, ? ; (2/4/6, 2) Next: Vert 1. (1:7,2)
; 31: From 3A. Upper Right. Try 2 left and 2 down.
DB 33, M0 ; (0, 0)
DB 25, MHN2 ; (0,-2)
DB 34, MVP2, ?, ? ; (2, 0)
; 32: From 3C.
; From 6 center point would be (-4/0/4, 0)
; From 9C center point would be (-4, 0)
DB 35, M0 ; (4, 0)
DB 14, MHN2 ; (4,-2) Next: Vert2,Horz1,Vert1. (1:7,-1:-3)
DB 17, MHN4, ?, ? ; (4,-4) Next: Vert2,Horz2,Vert1,Horz1. (1:7,-1:-7)
; 33: From 31A. Upper Right. Try 1 left and 1 down.
DB 0, M0 ; (0, 0)
DB 0, MHN1 ; (0,-1)
DB 0, MVP1, ?, ? ; (1, 0)
; 34: From 31C.
; From 44C center point would be (-2, 0)
DB 16, M0 ; (2, 0) Next: Vert 1 (1:3, 0)
DB 16, MHN1 ; (2,-1) Next: Vert 1 (1:3,-1)
DB 19, MHN2, ?, ? ; (2,-2) Next: Vert 1, Horz 1 (1:3,-1:-3)
; 35: From 32A.
DB 36, M0 ; (4, 0)
DB 36, MVN2 ; (2, 0)
DB 36, MVP2, ?, ? ; (6, 0)
; 36: From 35.
DB 16, M0 ; (2/4/6, 0) Next: Vert 1. (1:7, 0)
DB 16, MHN1 ; (2/4/6,-1) Next: Vert 1. (1:7,-1)
DB 16, MHN2, ?, ? ; (2/4/6,-2) Next: Vert 1. (1:7,-2)
; 37: From 5.
DB 17, M0 ; (-4/0/4, 0) Next: Vert2,Horz2,Vert1,Horz1 (-7:7,-3: 3)
DB 17, MHP4 ; (-4/0/4,-4) Next: Vert2,Horz2,Vert1,Horz1 (-7:7, 1: 7)
DB 17, MHN4, ?, ? ; (-4/0/4, 4) Next: Vert2,Horz2,Vert1,Horz1 (-7:7,-7:-1)
; 38: From 7A. Lower Left. Try 2 right and 2 up.
DB 42, M0 ; ( 0,0)
DB 43, MHP2 ; ( 0,2)
DB 26, MVN2, ?, ? ; (-2,0)
; 39: From 13B.
; From 14 center point would be (0,-4/0/4)
; From 16B center point would be (0,-4)
DB 40, M0 ; ( 0,4)
DB 18, MVN2 ; (-2,4) Next: Horz2,Vert1,Horz1. (-3:-1,1:7)
DB 13, MVN4, ?, ? ; (-4,4) Next: Horz2,Vert2,Horz1,Vert1. (-7:-1,1:7)
; 40: From 39A.
DB 41, M0 ; (0, 4)
DB 41, MHN2 ; (0, 2)
DB 41, MHP2, ?, ? ; (0, 6)
; 41: From 40.
DB 20, M0 ; ( 0,2/4/6) Next: Horz 1. ( 0,1:7)
DB 20, MVN1 ; (-1,2/4/6) Next: Horz 1. (-1,1:7)
DB 20, MVN2, ?, ? ; (-2,2/4/6) Next: Horz 1. (-2,1:7)
; 42: From 38A. Lower Left. Try 1 right and 1 up.
DB 0, M0 ; ( 0,0)
DB 0, MHP1 ; ( 0,1)
DB 0, MVN1, ?, ? ; (-1,0)
; 43: From 38B.
; From 44B center point would be (0,-2)
DB 20, M0 ; ( 0,2) Next: Horz 1 ( 0,1:3)
DB 20, MVN1 ; (-1,2) Next: Horz 1 (-1,1:3)
DB 15, MVN2, ?, ? ; (-2,2) Next: Horz 1, Vert 1 (-1:-3,1:3)
; 44: From 9A. Lower Right. Try 2 left and 2 up.
DB 45, M0 ; ( 0, 0)
DB 43, MHN2 ; ( 0,-2)
DB 34, MVN2, ?, ? ; (-2, 0)
; 45: From 44A. Lower Right. Try 1 left and 1 up.
DB 0, M0 ; ( 0, 0)
DB 0, MHN1 ; ( 0,-1)
DB 0, MVN1, ?, ? ; (-1, 0)
; 46: From 17A.
DB 0, M0 ; (0,0)
DB 0, MHP1 ; (0,1)
DB 0, MHP1, ?, ? ; (0,1)
; 47: From 10C.
; From 11 center point would be (0,4/0/-4)
; From 12C center point would be (0,-4)
DB 48, M0 ; (0,4)
DB 48, MHN2 ; (0,2)
DB 48, MHP2, ?, ? ; (0,6)
; 48 From 10B.
; From 47 center point would be (0,2/4/6)
; From 12B center point would be (0,-2)
DB 0, M0 ; (0,2)
DB 0, MHN1 ; (0,1)
DB 0, MHP1, ?, ? ; (0,3)
; 49 From 12A.
DB 0, M0 ; (0, 0)
DB 0, MHN1 ; (0,-1)
DB 0, MHN1, ?, ? ; (0,-1)
; 50: Interior Macroblock. Explore 7 up and 7 down.
DB 51, M0 ; ( 0,0)
DB 51, MVN7 ; (-7,0)
DB 51, MVP7, ?, ? ; ( 7,0)
; 51: Explore 7 left and 7 right.
DB 5, M0 ; (-7|0|7, 0)
DB 5, MHN7 ; (-7|0|7,-7)
DB 5, MHP7, ?, ? ; (-7|0|7, 7)
MulByNeg8 LABEL DWORD
CNT = 0
REPEAT 128
DD WeightedDiff+CNT
CNT = CNT - 8
ENDM
; The following treachery puts the numbers into byte 2 of each aligned DWORD.
DB 0, 0
DD 193 DUP (255)
DD 250,243,237,231,225,219,213,207,201,195,189,184,178,172,167,162,156
DD 151,146,141,135,130,126,121,116,111,107,102, 97, 93, 89, 84, 80, 76
DD 72, 68, 64, 61, 57, 53, 50, 46, 43, 40, 37, 34, 31, 28, 25, 22, 20
DD 18, 15, 13, 11, 9, 7, 6, 4, 3, 2, 1
DB 0, 0
WeightedDiff LABEL DWORD
DB 0, 0
DD 0, 0, 1, 2, 3, 4, 6, 7, 9, 11, 13, 15, 18
DD 20, 22, 25, 28, 31, 34, 37, 40, 43, 46, 50, 53, 57, 61, 64, 68, 72
DD 76, 80, 84, 89, 93, 97,102,107,111,116,121,126,130,135,141,146,151
DD 156,162,167,172,178,184,189,195,201,207,213,219,225,231,237,243,250
DD 191 DUP (255)
DB 255, 0
MotionOffsets DD 1*PITCH,0,?,?
RemnantOfCacheLine DB 8 DUP (?)
LocalStorage LABEL DWORD ; Local storage goes on the stack at addresses
; whose lower 12 bits match this address.
.CODE
ASSUME cs : FLAT
ASSUME ds : FLAT
ASSUME es : FLAT
ASSUME fs : FLAT
ASSUME gs : FLAT
ASSUME ss : FLAT
MOTIONESTIMATION proc C AMBAS: DWORD,
ATargFrmBase: DWORD,
APrevFrmBase: DWORD,
AFiltFrmBase: DWORD,
ADo15Search: DWORD,
ADoHalfPelEst: DWORD,
ADoBlkLvlVec: DWORD,
ADoSpatialFilt: DWORD,
AZeroVectorThresh: DWORD,
ANonZeroMVDiff: DWORD,
ABlockMVDiff: DWORD,
AEmptyThresh: DWORD,
AInterCodThresh: DWORD,
AIntraCodDiff: DWORD,
ASpatialFiltThresh: DWORD,
ASpatialFiltDiff: DWORD,
AIntraSWDTot: DWORD,
AIntraSWDBlks: DWORD,
AInterSWDTot: DWORD,
AInterSWDBlks: DWORD
LocalFrameSize = 128 + 168*4 + 32 ; 128 for locals; 168*4 for blocks; 32 for dummy block.
RegStoSize = 16
; Arguments:
MBlockActionStream_arg = RegStoSize + 4
TargetFrameBaseAddress_arg = RegStoSize + 8
PreviousFrameBaseAddress_arg = RegStoSize + 12
FilteredFrameBaseAddress_arg = RegStoSize + 16
DoRadius15Search_arg = RegStoSize + 20
DoHalfPelEstimation_arg = RegStoSize + 24
DoBlockLevelVectors_arg = RegStoSize + 28
DoSpatialFiltering_arg = RegStoSize + 32
ZeroVectorThreshold_arg = RegStoSize + 36
NonZeroMVDifferential_arg = RegStoSize + 40
BlockMVDifferential_arg = RegStoSize + 44
EmptyThreshold_arg = RegStoSize + 48
InterCodingThreshold_arg = RegStoSize + 52
IntraCodingDifferential_arg = RegStoSize + 56
SpatialFiltThreshold_arg = RegStoSize + 60
SpatialFiltDifferential_arg = RegStoSize + 64
IntraSWDTotal_arg = RegStoSize + 68
IntraSWDBlocks_arg = RegStoSize + 72
InterSWDTotal_arg = RegStoSize + 76
InterSWDBlocks_arg = RegStoSize + 80
EndOfArgList = RegStoSize + 84
; Locals (on local stack frame)
MBlockActionStream EQU [esp+ 0]
CurrSWDState EQU [esp+ 4]
MotionOffsetsCursor EQU CurrSWDState
HalfPelHorzSavings EQU CurrSWDState
VertFilterDoneAddr EQU CurrSWDState
IntraSWDTotal EQU [esp+ 8]
IntraSWDBlocks EQU [esp+ 12]
InterSWDTotal EQU [esp+ 16]
InterSWDBlocks EQU [esp+ 20]
MBCentralInterSWD EQU [esp+ 24]
MBRef1InterSWD EQU [esp+ 28]
MBRef2InterSWD EQU [esp+ 32]
MBCentralInterSWD_BLS EQU [esp+ 36]
MB0MVInterSWD EQU [esp+ 40]
MBAddrCentralPoint EQU [esp+ 44]
MBMotionVectors EQU [esp+ 48]
DoHalfPelEstimation EQU [esp+ 52]
DoBlockLevelVectors EQU [esp+ 56]
DoSpatialFiltering EQU [esp+ 60]
ZeroVectorThreshold EQU [esp+ 64]
NonZeroMVDifferential EQU [esp+ 68]
BlockMVDifferential EQU [esp+ 72]
EmptyThreshold EQU [esp+ 76]
InterCodingThreshold EQU [esp+ 80]
IntraCodingDifferential EQU [esp+ 84]
SpatialFiltThreshold EQU [esp+ 88]
SpatialFiltDifferential EQU [esp+ 92]
TargetMBAddr EQU [esp+ 96]
TargetFrameBaseAddress EQU [esp+ 100]
PreviousFrameBaseAddress EQU [esp+ 104]
TargToRef EQU [esp+ 108]
TargToSLF EQU [esp+ 112]
DoRadius15Search EQU [esp+ 116]
StashESP EQU [esp+ 120]
BlockLen EQU 168
Block1 EQU [esp+ 128+40] ; "128" is for locals. "40" is so offsets range from -40 to 124.
Block2 EQU Block1 + BlockLen
Block3 EQU Block2 + BlockLen
Block4 EQU Block3 + BlockLen
BlockN EQU Block4 + BlockLen
BlockNM1 EQU Block4
BlockNM2 EQU Block3
BlockNP1 EQU Block4 + BlockLen + BlockLen
DummyBlock EQU Block4 + BlockLen
Ref1Addr EQU -40
Ref2Addr EQU -36
AddrCentralPoint EQU -32
CentralInterSWD EQU -28
Ref1InterSWD EQU -24
Ref2InterSWD EQU -20
CentralInterSWD_BLS EQU -16 ; CentralInterSWD, when doing blk level search.
CentralInterSWD_SLF EQU -16 ; CentralInterSWD, when doing spatial filter.
HalfPelSavings EQU Ref2Addr
ZeroMVInterSWD EQU -12
BlkHMV EQU -8
BlkVMV EQU -7
BlkMVs EQU -8
AccumTargetPels EQU -4
; Offsets for Negated Quadrupled Target Pels:
N8T00 EQU 0
N8T04 EQU 4
N8T02 EQU 8
N8T06 EQU 12
N8T20 EQU 16
N8T24 EQU 20
N8T22 EQU 24
N8T26 EQU 28
N8T40 EQU 32
N8T44 EQU 36
N8T42 EQU 40
N8T46 EQU 44
N8T60 EQU 48
N8T64 EQU 52
N8T62 EQU 56
N8T66 EQU 60
N8T11 EQU 64
N8T15 EQU 68
N8T13 EQU 72
N8T17 EQU 76
N8T31 EQU 80
N8T35 EQU 84
N8T33 EQU 88
N8T37 EQU 92
N8T51 EQU 96
N8T55 EQU 100
N8T53 EQU 104
N8T57 EQU 108
N8T71 EQU 112
N8T75 EQU 116
N8T73 EQU 120
N8T77 EQU 124
push esi
push edi
push ebp
push ebx
; Adjust stack ptr so that local frame fits nicely in cache w.r.t. other data.
mov esi,esp
sub esp,000001000H
mov eax,[esp] ; Cause system to commit page.
sub esp,000001000H
and esp,0FFFFF000H
mov ebx,OFFSET LocalStorage+31
and ebx,000000FE0H
mov edx,PD [esi+MBlockActionStream_arg]
or esp,ebx
mov eax,PD [esi+TargetFrameBaseAddress_arg]
mov TargetFrameBaseAddress,eax
mov ebx,PD [esi+PreviousFrameBaseAddress_arg]
mov PreviousFrameBaseAddress,ebx
sub ebx,eax
mov ecx,PD [esi+FilteredFrameBaseAddress_arg]
sub ecx,eax
mov TargToRef,ebx
mov TargToSLF,ecx
mov eax,PD [esi+EmptyThreshold_arg]
mov EmptyThreshold,eax
mov eax,PD [esi+DoHalfPelEstimation_arg]
mov DoHalfPelEstimation,eax
mov eax,PD [esi+DoBlockLevelVectors_arg]
mov DoBlockLevelVectors,eax
mov eax,PD [esi+DoRadius15Search_arg]
mov DoRadius15Search,eax
mov eax,PD [esi+DoSpatialFiltering_arg]
mov DoSpatialFiltering,eax
mov eax,PD [esi+ZeroVectorThreshold_arg]
mov ZeroVectorThreshold,eax
mov eax,PD [esi+NonZeroMVDifferential_arg]
mov NonZeroMVDifferential,eax
mov eax,PD [esi+BlockMVDifferential_arg]
mov BlockMVDifferential,eax
mov eax,PD [esi+InterCodingThreshold_arg]
mov InterCodingThreshold,eax
mov eax,PD [esi+IntraCodingDifferential_arg]
mov IntraCodingDifferential,eax
mov eax,PD [esi+SpatialFiltThreshold_arg]
mov SpatialFiltThreshold,eax
mov eax,PD [esi+SpatialFiltDifferential_arg]
mov SpatialFiltDifferential,eax
xor ebx,ebx
mov IntraSWDBlocks,ebx
mov InterSWDBlocks,ebx
mov IntraSWDTotal,ebx
mov InterSWDTotal,ebx
mov Block1.BlkMVs,ebx
mov Block2.BlkMVs,ebx
mov Block3.BlkMVs,ebx
mov Block4.BlkMVs,ebx
mov DummyBlock.Ref1Addr,esp
mov DummyBlock.Ref2Addr,esp
mov StashESP,esi
jmp FirstMacroBlock
; Activity Details for this section of code (refer to flow diagram above):
;
; 1) To calculate an average value for the target match points of each
; block, we sum the 32 match points. The totals for each of the 4
; blocks is output seperately.
;
; 2) Define each prepared match point in the target macroblock as the
; real match point times negative 8, with the base address of the
; WeightedDiff lookup table added. I.e.
;
; for (i = 0; i < 16; i += 2)
; for (j = 0; j < 16; j += 2)
; N8T[i][j] = ( -8 * Target[i][j]) + ((U32) WeightedDiff);
;
; Both the multiply and the add of the WeightedDiff array base are
; effected by a table lookup into the array MulByNeg8.
;
; Then the SWD of a reference macroblock can be calculated as follows:
;
; SWD = 0;
; for each match point (i,j)
; SWD += *((U32 *) (N8T[i][j] + 8 * Ref[i][j]));
;
; In assembly, the fetch of WeightedDiff array element amounts to this:
;
; mov edi,DWORD PTR N8T[i][j] ; Fetch N8T[i][j]
; mov dl,BYTE PTR Ref[i][j] ; Fetch Ref[i][j]
; mov edi,DWORD PTR[edi+edx*8] ; Fetch WeithtedDiff of target & ref.
;
; 3) We calculate the 0-motion SWD, as described just above. We use 32
; match points per block, and write the result seperately for each
; block. The result is accumulated into the high half of ebp.
;
; 4) If the SWD for the 0-motion vector is below a threshold, we don't
; bother searching for other possibly better motion vectors. Presently,
; this threshold is set such that an average difference of less than
; three per match point causes the 0-motion vector to be accepted.
;
; Register usage for this section:
;
; Input of this section:
;
; edx -- MBlockActionStream
;
; Predominate usage for body of this section:
;
; esi -- Target block address.
; edi -- 0-motion reference block address.
; ebp[ 0:12] -- Accumulator for target pels.
; ebp[13:15] -- Loop control
; ebp[16:31] -- Accumulator for weighted diff between target and 0-MV ref.
; edx -- Address at which to store -8 times pels.
; ecx -- A reference pel.
; ebx -- A target pel.
; eax -- A target pel times -8; and a weighted difference.
;
; Expected Pentium (tm) microprocessor performance for section:
;
; Executed once per macroblock.
;
; 520 clocks for instruction execution
; 8 clocks for bank conflicts (64 dual mem ops with 1/8 chance of conflict)
; 80 clocks generously estimated for an average of 8 cache line fills for
; the target macroblock and 8 cache line fills for the reference area.
; ----
; 608 clocks total time for this section.
;
NextMacroBlock:
mov bl,[edx].CodedBlocks
add edx,SIZEOF T_MacroBlockActionDescr
and ebx,000000040H ; Check for end-of-stream
jne Done
FirstMacroBlock:
mov cl,[edx].CodedBlocks ; Init CBP for macroblock.
mov ebp,TargetFrameBaseAddress
mov bl,[edx].FirstMEState ; First State
mov eax,DoRadius15Search ; Searching 15 full pels out, or just 7?
neg al ; doing blk lvl => al=0, not => al=-1
or cl,03FH ; Indicate all 6 blocks are coded.
and al,bl
mov esi,[edx].BlkY1.BlkOffset ; Get address of next macroblock to do.
cmp al,5
jne @f
mov bl,50 ; Cause us to search +/- 15 if central
; ; block and willing to go that far.
@@:
mov edi,TargToRef
add esi,ebp
mov CurrSWDState,ebx ; Stash First State Number as current.
add edi,esi
xor ebp,ebp
mov TargetMBAddr,esi ; Stash address of target macroblock.
mov MBlockActionStream,edx ; Stash list ptr.
mov [edx].CodedBlocks,cl
mov ecx,INTER1MV ; Speculate INTER-coding, 1 motion vector.
mov [edx].BlockType,cl
lea edx,Block1
PrepMatchPointsNextBlock:
mov bl,PB [esi+6] ; 06A -- Target Pel 00.
add ebp,ebx ; 06B -- Accumulate target pels.
mov cl,PB [edi+6] ; 06C -- Reference Pel 00.
mov eax,MulByNeg8[ebx*4] ; 06D -- Target Pel 00 * -8.
mov bl,PB [esi+4] ; 04A
mov [edx].N8T06,eax ; 06E -- Store negated quadrupled Pel 00.
add ebp,ebx ; 04B
mov eax,PD [eax+ecx*8] ; 06F -- Weighted difference for Pel 00.
mov cl,PB [edi+4] ; 04C
add ebp,eax ; 06G -- Accumulate weighted difference.
mov eax,MulByNeg8[ebx*4] ; 04D
mov bl,PB [esi+2] ; 02A
mov [edx].N8T04,eax ; 04E
add ebp,ebx ; 02B
mov eax,PD [eax+ecx*8] ; 04F
mov cl,PB [edi+2] ; 02C
add ebp,eax ; 04G
mov eax,MulByNeg8[ebx*4] ; 02D
mov bl,PB [esi] ; 00A
mov [edx].N8T02,eax ; 02E
add ebp,ebx ; 00B
mov eax,PD [eax+ecx*8] ; 02F
add esi,PITCH+1
mov cl,PB [edi] ; 00C
add edi,PITCH+1
lea ebp,[ebp+eax+000004000H] ; 02G (plus loop control)
mov eax,MulByNeg8[ebx*4] ; 00D
mov bl,PB [esi+6] ; 17A
mov [edx].N8T00,eax ; 00E
add ebp,ebx ; 17B
mov eax,PD [eax+ecx*8] ; 00F
mov cl,PB [edi+6] ; 17C
add ebp,eax ; 00G
mov eax,MulByNeg8[ebx*4] ; 17D
mov bl,PB [esi+4] ; 15A
mov [edx].N8T17,eax ; 17E
add ebp,ebx ; 15B
mov eax,PD [eax+ecx*8] ; 17F
mov cl,PB [edi+4] ; 15C
add ebp,eax ; 17G
mov eax,MulByNeg8[ebx*4] ; 15D
mov bl,PB [esi+2] ; 13A
mov [edx].N8T15,eax ; 15E
add ebp,ebx ; 13B
mov eax,PD [eax+ecx*8] ; 15F
mov cl,PB [edi+2] ; 13C
add ebp,eax ; 15G
mov eax,MulByNeg8[ebx*4] ; 13D
mov bl,PB [esi] ; 11A
mov [edx].N8T13,eax ; 13E
add ebp,ebx ; 11B
mov eax,PD [eax+ecx*8] ; 13F
add esi,PITCH-1
mov cl,PB [edi] ; 11C
add edi,PITCH-1
add ebp,eax ; 13G
mov eax,MulByNeg8[ebx*4] ; 11D
mov bl,PB [esi+6] ; 26A
mov [edx].N8T11,eax ; 11E
add ebp,ebx ; 26B
mov eax,PD [eax+ecx*8] ; 11F
mov cl,PB [edi+6] ; 26C
add ebp,eax ; 11G
mov eax,MulByNeg8[ebx*4] ; 26D
mov bl,PB [esi+4] ; 24A
mov [edx].N8T26,eax ; 26E
add ebp,ebx ; 24B
mov eax,PD [eax+ecx*8] ; 26F
mov cl,PB [edi+4] ; 24C
add ebp,eax ; 26G
mov eax,MulByNeg8[ebx*4] ; 24D
mov bl,PB [esi+2] ; 22A
mov [edx].N8T24,eax ; 24E
add ebp,ebx ; 22B
mov eax,PD [eax+ecx*8] ; 24F
mov cl,PB [edi+2] ; 22C
add ebp,eax ; 24G
mov eax,MulByNeg8[ebx*4] ; 22D
mov bl,PB [esi] ; 20A
mov [edx].N8T22,eax ; 22E
add ebp,ebx ; 20B
mov eax,PD [eax+ecx*8] ; 22F
add esi,PITCH+1
mov cl,PB [edi] ; 20C
add edi,PITCH+1
add ebp,eax ; 22G
mov eax,MulByNeg8[ebx*4] ; 20D
mov bl,PB [esi+6] ; 37A
mov [edx].N8T20,eax ; 20E
add ebp,ebx ; 37B
mov eax,PD [eax+ecx*8] ; 20F
mov cl,PB [edi+6] ; 37C
add ebp,eax ; 20G
mov eax,MulByNeg8[ebx*4] ; 37D
mov bl,PB [esi+4] ; 35A
mov [edx].N8T37,eax ; 37E
add ebp,ebx ; 35B
mov eax,PD [eax+ecx*8] ; 37F
mov cl,PB [edi+4] ; 35C
add ebp,eax ; 37G
mov eax,MulByNeg8[ebx*4] ; 35D
mov bl,PB [esi+2] ; 33A
mov [edx].N8T35,eax ; 35E
add ebp,ebx ; 33B
mov eax,PD [eax+ecx*8] ; 35F
mov cl,PB [edi+2] ; 33C
add ebp,eax ; 35G
mov eax,MulByNeg8[ebx*4] ; 33D
mov bl,PB [esi] ; 31A
mov [edx].N8T33,eax ; 33E
add ebp,ebx ; 31B
mov eax,PD [eax+ecx*8] ; 33F
add esi,PITCH-1
mov cl,PB [edi] ; 31C
add edi,PITCH-1
add ebp,eax ; 33G
mov eax,MulByNeg8[ebx*4] ; 31D
mov bl,PB [esi+6] ; 46A
mov [edx].N8T31,eax ; 31E
add ebp,ebx ; 46B
mov eax,PD [eax+ecx*8] ; 31F
mov cl,PB [edi+6] ; 46C
add ebp,eax ; 31G
mov eax,MulByNeg8[ebx*4] ; 46D
mov bl,PB [esi+4] ; 44A
mov [edx].N8T46,eax ; 46E
add ebp,ebx ; 44B
mov eax,PD [eax+ecx*8] ; 46F
mov cl,PB [edi+4] ; 44C
add ebp,eax ; 46G
mov eax,MulByNeg8[ebx*4] ; 44D
mov bl,PB [esi+2] ; 42A
mov [edx].N8T44,eax ; 44E
add ebp,ebx ; 42B
mov eax,PD [eax+ecx*8] ; 44F
mov cl,PB [edi+2] ; 42C
add ebp,eax ; 44G
mov eax,MulByNeg8[ebx*4] ; 42D
mov bl,PB [esi] ; 40A
mov [edx].N8T42,eax ; 42E
add ebp,ebx ; 40B
mov eax,PD [eax+ecx*8] ; 42F
add esi,PITCH+1
mov cl,PB [edi] ; 40C
add edi,PITCH+1
add ebp,eax ; 42G
mov eax,MulByNeg8[ebx*4] ; 40D
mov bl,PB [esi+6] ; 57A
mov [edx].N8T40,eax ; 40E
add ebp,ebx ; 57B
mov eax,PD [eax+ecx*8] ; 40F
mov cl,PB [edi+6] ; 57C
add ebp,eax ; 40G
mov eax,MulByNeg8[ebx*4] ; 57D
mov bl,PB [esi+4] ; 55A
mov [edx].N8T57,eax ; 57E
add ebp,ebx ; 55B
mov eax,PD [eax+ecx*8] ; 57F
mov cl,PB [edi+4] ; 55C
add ebp,eax ; 57G
mov eax,MulByNeg8[ebx*4] ; 55D
mov bl,PB [esi+2] ; 53A
mov [edx].N8T55,eax ; 55E
add ebp,ebx ; 53B
mov eax,PD [eax+ecx*8] ; 55F
mov cl,PB [edi+2] ; 53C
add ebp,eax ; 55G
mov eax,MulByNeg8[ebx*4] ; 53D
mov bl,PB [esi] ; 51A
mov [edx].N8T53,eax ; 53E
add ebp,ebx ; 51B
mov eax,PD [eax+ecx*8] ; 53F
add esi,PITCH-1
mov cl,PB [edi] ; 51C
add edi,PITCH-1
add ebp,eax ; 53G
mov eax,MulByNeg8[ebx*4] ; 51D
mov bl,PB [esi+6] ; 66A
mov [edx].N8T51,eax ; 51E
add ebp,ebx ; 66B
mov eax,PD [eax+ecx*8] ; 51F
mov cl,PB [edi+6] ; 66C
add ebp,eax ; 51G
mov eax,MulByNeg8[ebx*4] ; 66D
mov bl,PB [esi+4] ; 64A
mov [edx].N8T66,eax ; 66E
add ebp,ebx ; 64B
mov eax,PD [eax+ecx*8] ; 66F
mov cl,PB [edi+4] ; 64C
add ebp,eax ; 66G
mov eax,MulByNeg8[ebx*4] ; 64D
mov bl,PB [esi+2] ; 62A
mov [edx].N8T64,eax ; 64E
add ebp,ebx ; 62B
mov eax,PD [eax+ecx*8] ; 64F
mov cl,PB [edi+2] ; 62C
add ebp,eax ; 64G
mov eax,MulByNeg8[ebx*4] ; 62D
mov bl,PB [esi] ; 60A
mov [edx].N8T62,eax ; 62E
add ebp,ebx ; 60B
mov eax,PD [eax+ecx*8] ; 62F
add esi,PITCH+1
mov cl,PB [edi] ; 60C
add edi,PITCH+1
add ebp,eax ; 62G
mov eax,MulByNeg8[ebx*4] ; 60D
mov bl,PB [esi+6] ; 77A
mov [edx].N8T60,eax ; 60E
add ebp,ebx ; 77B
mov eax,PD [eax+ecx*8] ; 60F
mov cl,PB [edi+6] ; 77C
add ebp,eax ; 60G
mov eax,MulByNeg8[ebx*4] ; 77D
mov bl,PB [esi+4] ; 75A
mov [edx].N8T77,eax ; 77E
add ebp,ebx ; 75B
mov eax,PD [eax+ecx*8] ; 77F
mov cl,PB [edi+4] ; 75C
add ebp,eax ; 77G
mov eax,MulByNeg8[ebx*4] ; 75D
mov bl,PB [esi+2] ; 73A
mov [edx].N8T75,eax ; 75E
add ebp,ebx ; 73B
mov eax,PD [eax+ecx*8] ; 75F
mov cl,PB [edi+2] ; 73C
add ebp,eax ; 75G
mov eax,MulByNeg8[ebx*4] ; 73D
mov bl,PB [esi] ; 71A
mov [edx].N8T73,eax ; 73E
add ebp,ebx ; 71B
mov eax,PD [eax+ecx*8] ; 73F
mov cl,PB [edi] ; 71C
add esi,PITCH-1-PITCH*8+8
add edi,PITCH-1-PITCH*8+8
add ebp,eax ; 73G
mov eax,MulByNeg8[ebx*4] ; 71D
mov ebx,ebp
mov [edx].N8T71,eax ; 71E
and ebx,000001FFFH ; Extract sum of target pels.
add edx,BlockLen ; Move to next output block
mov eax,PD [eax+ecx*8] ; 71F
mov [edx-BlockLen].AccumTargetPels,ebx ; Store acc of target pels for block.
add eax,ebp ; 71G
and ebp,000006000H ; Extract loop control
shr eax,16 ; Extract SWD; CF == 1 every second iter.
mov ebx,ecx
mov [edx-BlockLen].CentralInterSWD,eax ; Store SWD for 0-motion vector.
jnc PrepMatchPointsNextBlock
add esi,PITCH*8-16 ; Advance to block 3, or off end.
add edi,PITCH*8-16 ; Advance to block 3, or off end.
xor ebp,000002000H
jne PrepMatchPointsNextBlock ; Jump if advancing to block 3.
mov ebx,CurrSWDState ; Fetch First State Number for engine.
mov edi,Block1.CentralInterSWD
test bl,bl ; Test for INTRA-BY-DECREE.
je IntraByDecree
add eax,Block2.CentralInterSWD
add edi,Block3.CentralInterSWD
add eax,edi
mov edx,ZeroVectorThreshold
cmp eax,edx ; Compare 0-MV against ZeroVectorThresh
jle BelowZeroThresh ; Jump if 0-MV is good enough.
mov cl,PB SWDState[ebx*8+3] ; cl == Index of inc to apply to central
; ; point to get to ref1.
mov bl,PB SWDState[ebx*8+5] ; bl == Same as cl, but for ref2.
mov edx,TargToRef
mov MB0MVInterSWD,eax ; Stash SWD for zero motion vector.
mov edi,PD OffsetToRef[ebx] ; Get inc to apply to ctr to get to ref2.
mov ebp,PD OffsetToRef[ecx] ; Get inc to apply to ctr to get to ref1.
lea esi,[esi+edx-PITCH*16] ; Calculate address of 0-MV ref block.
;
mov MBAddrCentralPoint,esi ; Set central point to 0-MV.
mov MBCentralInterSWD,eax
mov eax,Block1.CentralInterSWD ; Stash Zero MV SWD, in case we decide
mov edx,Block2.CentralInterSWD ; the best non-zero MV isn't enough
mov Block1.ZeroMVInterSWD,eax ; better than the zero MV.
mov Block2.ZeroMVInterSWD,edx
mov eax,Block3.CentralInterSWD
mov edx,Block4.CentralInterSWD
mov Block3.ZeroMVInterSWD,eax
mov Block4.ZeroMVInterSWD,edx
; Activity Details for this section of code (refer to flow diagram above):
;
; 5) The SWD for two different reference macroblocks is calculated; ref1
; into the high order 16 bits of ebp, and ref2 into the low 16 bits.
; This is performed for each iteration of the state engine. A normal,
; internal macroblock will perform 6 iterations, searching +/- 4
; horizontally, then +/- 4 vertically, then +/- 2 horizontally, then
; +/- 2 vertically, then +/- 1 horizontally, then +/- 1 vertically.
;
; Register usage for this section:
;
; Input:
;
; esi -- Addr of 0-motion macroblock in ref frame.
; ebp -- Increment to apply to get to first ref1 macroblock.
; edi -- Increment to apply to get to first ref2 macroblock.
; ebx, ecx -- High order 24 bits are zero.
;
; Output:
;
; ebp -- SWD for the best-fit reference macroblock.
; ebx -- Index of increment to apply to get to best-fit reference MB.
; MBAddrCentralPoint -- the best-fit of the previous iteration; it is the
; value to which OffsetToRef[ebx] must be added.
;
;
; Expected performance for SWDLoop code:
;
; Execution frequency: Six times per block for which motion analysis is done
; beyond the 0-motion vector.
;
; Pentium (tm) microprocessor times per six iterations:
; 180 clocks for instruction execution setup to DoSWDLoop
; 2520 clocks for DoSWDLoop procedure, instruction execution.
; 192 clocks for bank conflicts in DoSWDLoop
; 30 clocks generously estimated for an average of 6 cache line fills for
; the reference area.
; ----
; 2922 clocks total time for this section.
MBFullPelMotionSearchLoop:
lea edi,[esi+edi+PITCH*8+8]
lea esi,[esi+ebp+PITCH*8+8]
mov Block4.Ref1Addr,esi
mov Block4.Ref2Addr,edi
sub esi,8
sub edi,8
mov Block3.Ref1Addr,esi
mov Block3.Ref2Addr,edi
sub esi,PITCH*8-8
sub edi,PITCH*8-8
mov Block2.Ref1Addr,esi
mov Block2.Ref2Addr,edi
sub esi,8
sub edi,8
mov Block1.Ref1Addr,esi
mov Block1.Ref2Addr,edi
; esi -- Points to ref1
; edi -- Points to ref2
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
call DoSWDLoop
; ebp -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
mov esi,MBCentralInterSWD ; Get SWD for central point of these 3 refs
xor eax,eax
add ebp,Block1.Ref1InterSWD
add edx,Block1.Ref2InterSWD
add ebp,Block2.Ref1InterSWD
add edx,Block2.Ref2InterSWD
add ebp,Block3.Ref1InterSWD
add edx,Block3.Ref2InterSWD
cmp ebp,edx ; Carry flag == 1 iff ref1 SWD < ref2 SWD.
mov edi,CurrSWDState ; Restore current state number.
adc eax,eax ; eax == 1 iff ref1 SWD < ref2 SWD.
cmp ebp,esi ; Carry flag == 1 iff ref1 SWD < central SWD.
adc eax,eax ;
cmp edx,esi ; Carry flag == 1 iff ref2 SWD < central SWD.
adc eax,eax ; 0 --> Pick central point.
; ; 1 --> Pick ref2.
; ; 2 --> Not possible.
; ; 3 --> Pick ref2.
; ; 4 --> Pick central point.
; ; 5 --> Not possible.
; ; 6 --> Pick ref1.
; ; 7 --> Pick ref1.
mov MBRef2InterSWD,edx
mov MBRef1InterSWD,ebp
xor edx,edx
mov dl,PB PickPoint[eax] ; dl == 0: central pt; 2: ref1; 4: ref2
mov esi,MBAddrCentralPoint ; Reload address of central ref block.
;
;
mov ebp,Block1.CentralInterSWD[edx*2] ; Get SWD for each block, picked pt.
mov al,PB SWDState[edx+edi*8+1] ; al == Index of inc to apply to old central
; ; point to get new central point.
mov Block1.CentralInterSWD,ebp ; Stash SWD for new central point.
mov ebp,Block2.CentralInterSWD[edx*2]
mov Block2.CentralInterSWD,ebp
mov ebp,Block3.CentralInterSWD[edx*2]
mov Block3.CentralInterSWD,ebp
mov ebp,Block4.CentralInterSWD[edx*2]
mov Block4.CentralInterSWD,ebp
mov ebp,MBCentralInterSWD[edx*2]; Get the SWD for the point we picked.
mov dl,PB SWDState[edx+edi*8] ; dl == New state number.
mov MBCentralInterSWD,ebp ; Stash SWD for new central point.
mov edi,PD OffsetToRef[eax] ; Get inc to apply to get to new central pt.
mov CurrSWDState,edx ; Stash current state number.
mov bl,PB SWDState[edx*8+3] ; bl == Index of inc to apply to central
; ; point to get to next ref1.
mov cl,PB SWDState[edx*8+5] ; cl == Same as bl, but for ref2.
add esi,edi ; Move to new central point.
test dl,dl
mov ebp,PD OffsetToRef[ebx] ; Get inc to apply to ctr to get to ref1.
mov edi,PD OffsetToRef[ecx] ; Get inc to apply to ctr to get to ref2.
mov MBAddrCentralPoint,esi ; Stash address of new central ref block.
jne MBFullPelMotionSearchLoop ; Jump if not done searching.
;Done searching for integer motion vector for full macroblock
IF PITCH-384
*** Error: The magic leaks out of the following code if PITCH isn't 384.
ENDIF
mov ecx,TargToRef ; To Linearize MV for winning ref blk.
mov eax,esi ; Copy of ref macroblock addr.
sub eax,ecx ; To Linearize MV for winning ref blk.
mov ecx,TargetMBAddr
sub eax,ecx
mov edx,MBlockActionStream ; Fetch list ptr.
mov ebx,eax
mov ebp,DoHalfPelEstimation ; Are we doing half pel motion estimation?
shl eax,25 ; Extract horz motion component.
mov [edx].BlkY1.PastRef,esi ; Save address of reference MB selected.
sar ebx,8 ; Hi 24 bits of linearized MV lookup vert MV.
mov ecx,MBCentralInterSWD
sar eax,24 ; Finish extract horz motion component.
test ebp,ebp
mov bl,PB UnlinearizedVertMV[ebx] ; Look up proper vert motion vector.
mov [edx].BlkY1.PHMV,al ; Save winning horz motion vector.
mov [edx].BlkY1.PVMV,bl ; Save winning vert motion vector.
IFDEF H261
ELSE
je SkipHalfPelSearch_1MV
;Search for half pel motion vector for full macroblock.
mov Block1.AddrCentralPoint,esi
lea ebp,[esi+8]
mov Block2.AddrCentralPoint,ebp
add ebp,PITCH*8-8
mov Block3.AddrCentralPoint,ebp
xor ecx,ecx
mov cl,[edx].FirstMEState
add ebp,8
mov edi,esi
mov Block4.AddrCentralPoint,ebp
mov ebp,InitHalfPelSearchHorz[ecx*4-4]
; ebp -- Initialized to 0, except when can't search off left or right edge.
; edi -- Ref addr for block 1. Ref1 is .5 pel to left. Ref2 is .5 to right.
call DoSWDHalfPelHorzLoop
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
mov esi,MBlockActionStream
xor eax,eax ; Keep pairing happy
add ecx,Block1.Ref1InterSWD
add edx,Block1.Ref2InterSWD
add ecx,Block2.Ref1InterSWD
add edx,Block2.Ref2InterSWD
add ecx,Block3.Ref1InterSWD
add edx,Block3.Ref2InterSWD
mov bl,[esi].FirstMEState
mov edi,Block1.AddrCentralPoint
cmp ecx,edx
jl MBHorz_Ref1LTRef2
mov ebp,MBCentralInterSWD
mov esi,MBlockActionStream
sub ebp,edx
jle MBHorz_CenterBest
mov al,[esi].BlkY1.PHMV ; Half pel to the right is best.
mov ecx,Block1.Ref2InterSWD
mov Block1.CentralInterSWD_BLS,ecx
mov ecx,Block3.Ref2InterSWD
mov Block3.CentralInterSWD_BLS,ecx
mov ecx,Block2.Ref2InterSWD
mov Block2.CentralInterSWD_BLS,ecx
mov ecx,Block4.Ref2InterSWD
mov Block4.CentralInterSWD_BLS,ecx
inc al
mov [esi].BlkY1.PHMV,al
jmp MBHorz_Done
MBHorz_CenterBest:
mov ecx,Block1.CentralInterSWD
xor ebp,ebp
mov Block1.CentralInterSWD_BLS,ecx
mov ecx,Block2.CentralInterSWD
mov Block2.CentralInterSWD_BLS,ecx
mov ecx,Block3.CentralInterSWD
mov Block3.CentralInterSWD_BLS,ecx
mov ecx,Block4.CentralInterSWD
mov Block4.CentralInterSWD_BLS,ecx
jmp MBHorz_Done
MBHorz_Ref1LTRef2:
mov ebp,MBCentralInterSWD
mov esi,MBlockActionStream
sub ebp,ecx
jle MBHorz_CenterBest
mov al,[esi].BlkY1.PHMV ; Half pel to the left is best.
mov edx,[esi].BlkY1.PastRef
dec al
mov ecx,Block1.Ref1InterSWD
mov Block1.CentralInterSWD_BLS,ecx
mov ecx,Block3.Ref1InterSWD
mov Block3.CentralInterSWD_BLS,ecx
mov ecx,Block2.Ref1InterSWD
mov Block2.CentralInterSWD_BLS,ecx
mov ecx,Block4.Ref1InterSWD
mov Block4.CentralInterSWD_BLS,ecx
dec edx
mov [esi].BlkY1.PHMV,al
mov [esi].BlkY1.PastRef,edx
MBHorz_Done:
mov HalfPelHorzSavings,ebp
mov ebp,InitHalfPelSearchVert[ebx*4-4]
; ebp -- Initialized to 0, except when can't search off left or right edge.
; edi -- Ref addr for block 1. Ref1 is .5 pel above. Ref2 is .5 below.
call DoSWDHalfPelVertLoop
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
add ecx,Block1.Ref1InterSWD
add edx,Block1.Ref2InterSWD
add ecx,Block2.Ref1InterSWD
add edx,Block2.Ref2InterSWD
add ecx,Block3.Ref1InterSWD
add edx,Block3.Ref2InterSWD
cmp ecx,edx
jl MBVert_Ref1LTRef2
mov ebp,MBCentralInterSWD
mov esi,MBlockActionStream
sub ebp,edx
jle MBVert_CenterBest
mov ecx,Block1.CentralInterSWD
mov edx,Block1.Ref2InterSWD
sub ecx,edx
mov edx,Block1.CentralInterSWD_BLS
sub edx,ecx
mov al,[esi].BlkY1.PVMV ; Half pel below is best.
mov Block1.CentralInterSWD,edx
inc al
mov ecx,Block3.CentralInterSWD
mov edx,Block3.Ref2InterSWD
sub ecx,edx
mov edx,Block3.CentralInterSWD_BLS
sub edx,ecx
mov ecx,Block2.CentralInterSWD
mov Block3.CentralInterSWD,edx
mov edx,Block2.Ref2InterSWD
sub ecx,edx
mov edx,Block2.CentralInterSWD_BLS
sub edx,ecx
mov ecx,Block4.CentralInterSWD
mov Block2.CentralInterSWD,edx
mov edx,Block4.Ref2InterSWD
sub ecx,edx
mov edx,Block4.CentralInterSWD_BLS
sub edx,ecx
mov [esi].BlkY1.PVMV,al
mov Block4.CentralInterSWD,edx
jmp MBVert_Done
MBVert_CenterBest:
mov ecx,Block1.CentralInterSWD_BLS
xor ebp,ebp
mov Block1.CentralInterSWD,ecx
mov ecx,Block2.CentralInterSWD_BLS
mov Block2.CentralInterSWD,ecx
mov ecx,Block3.CentralInterSWD_BLS
mov Block3.CentralInterSWD,ecx
mov ecx,Block4.CentralInterSWD_BLS
mov Block4.CentralInterSWD,ecx
jmp MBVert_Done
MBVert_Ref1LTRef2:
mov ebp,MBCentralInterSWD
mov esi,MBlockActionStream
sub ebp,ecx
jle MBVert_CenterBest
mov ecx,Block1.CentralInterSWD
mov edx,Block1.Ref1InterSWD
sub ecx,edx
mov edx,Block1.CentralInterSWD_BLS
sub edx,ecx
mov al,[esi].BlkY1.PVMV ; Half pel above is best.
mov Block1.CentralInterSWD,edx
dec al
mov ecx,Block3.CentralInterSWD
mov edx,Block3.Ref1InterSWD
sub ecx,edx
mov edx,Block3.CentralInterSWD_BLS
sub edx,ecx
mov ecx,Block2.CentralInterSWD
mov Block3.CentralInterSWD,edx
mov edx,Block2.Ref1InterSWD
sub ecx,edx
mov edx,Block2.CentralInterSWD_BLS
sub edx,ecx
mov ecx,Block4.CentralInterSWD
mov Block2.CentralInterSWD,edx
mov edx,Block4.Ref1InterSWD
sub ecx,edx
mov edx,Block4.CentralInterSWD_BLS
sub edx,ecx
mov ecx,[esi].BlkY1.PastRef
mov Block4.CentralInterSWD,edx
sub ecx,PITCH
mov [esi].BlkY1.PVMV,al
mov [esi].BlkY1.PastRef,ecx
MBVert_Done:
mov ecx,HalfPelHorzSavings
mov edx,esi
add ebp,ecx ; Savings for horz and vert half pel motion.
mov ecx,MBCentralInterSWD ; Reload SWD for new central point.
sub ecx,ebp ; Approx SWD for prescribed half pel motion.
mov esi,[edx].BlkY1.PastRef ; Reload address of reference MB selected.
mov MBCentralInterSWD,ecx
SkipHalfPelSearch_1MV:
ENDIF ; H263
mov ebp,[edx].BlkY1.MVs ; Load Motion Vectors
add esi,8
mov [edx].BlkY2.PastRef,esi
mov [edx].BlkY2.MVs,ebp
lea edi,[esi+PITCH*8]
add esi,PITCH*8-8
mov [edx].BlkY3.PastRef,esi
mov [edx].BlkY3.MVs,ebp
mov [edx].BlkY4.PastRef,edi
mov [edx].BlkY4.MVs,ebp
IFDEF H261
ELSE ; H263
mov MBMotionVectors,ebp ; Stash macroblock level motion vectors.
mov ebp,640 ; ??? BlockMVDifferential
cmp ecx,ebp
jl NoBlockMotionVectors
mov ecx,DoBlockLevelVectors
test ecx,ecx ; Are we doing block level motion vectors?
je NoBlockMotionVectors
; Activity Details for this section of code (refer to flow diagram above):
;
; The following search is done similarly to the searches done above, except
; these are block searches, instead of macroblock searches.
;
; Expected performance:
;
; Execution frequency: Six times per block for which motion analysis is done
; beyond the 0-motion vector.
;
; Pentium (tm) microprocessor times per six iterations:
; 180 clocks for instruction execution setup to DoSWDLoop
; 2520 clocks for DoSWDLoop procedure, instruction execution.
; 192 clocks for bank conflicts in DoSWDLoop
; 30 clocks generously estimated for an average of 6 cache line fills for
; the reference area.
; ----
; 2922 clocks total time for this section.
;
; Set up for the "BlkFullPelSWDLoop_4blks" loop to follow.
; - Store the SWD values for blocks 4, 3, 2, 1.
; - Compute and store the address of the central reference
; point for blocks 1, 2, 3, 4.
; - Compute and store the first address for ref 1 (minus 4
; pels horizontally) and ref 2 (plus 4 pels horizontally)
; for blocks 4, 3, 2, 1 (in that order).
; - Initialize MotionOffsetsCursor
; - On exit:
; esi = ref 1 address for block 1
; edi = ref 2 address for block 1
;
mov esi,Block4.CentralInterSWD
mov edi,Block3.CentralInterSWD
mov Block4.CentralInterSWD_BLS,esi
mov Block3.CentralInterSWD_BLS,edi
mov esi,Block2.CentralInterSWD
mov edi,Block1.CentralInterSWD
mov Block2.CentralInterSWD_BLS,esi
mov eax,MBAddrCentralPoint ; Reload addr of central, integer pel ref MB.
mov Block1.CentralInterSWD_BLS,edi
mov Block1.AddrCentralPoint,eax
lea edi,[eax+PITCH*8+8+1]
lea esi,[eax+PITCH*8+8-1]
mov Block4.Ref1Addr,esi
mov Block4.Ref2Addr,edi
sub esi,8
add eax,8
mov Block2.AddrCentralPoint,eax
add eax,PITCH*8-8
mov Block3.AddrCentralPoint,eax
add eax,8
mov Block4.AddrCentralPoint,eax
sub edi,8
mov Block3.Ref1Addr,esi
mov Block3.Ref2Addr,edi
sub esi,PITCH*8-8
sub edi,PITCH*8-8
mov Block2.Ref1Addr,esi
mov Block2.Ref2Addr,edi
sub esi,8
mov eax,OFFSET MotionOffsets
mov MotionOffsetsCursor,eax
sub edi,8
mov Block1.Ref1Addr,esi
mov Block1.Ref2Addr,edi
;
; This loop will execute 6 times:
; +- 4 pels horizontally
; +- 4 pels vertically
; +- 2 pels horizontally
; +- 2 pels vertically
; +- 1 pel horizontally
; +- 1 pel vertically
; It terminates when ref1 = ref2. This simple termination
; condition is what forces unrestricted motion vectors (UMV)
; to be ON when advanced prediction (4MV) is ON. Otherwise
; we would need a state engine as above to distinguish edge
; pels.
;
BlkFullPelSWDLoop_4blks:
; esi -- Points to ref1
; edi -- Points to ref2
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
call DoSWDLoop
; ebp -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
mov eax,MotionOffsetsCursor
BlkFullPelSWDLoop_1blk:
xor esi,esi
cmp ebp,edx ; CF == 1 iff ref1 SWD < ref2 SWD.
mov edi,BlockNM1.CentralInterSWD_BLS; Get SWD for central pt of these 3 refs
adc esi,esi ; esi == 1 iff ref1 SWD < ref2 SWD.
cmp ebp,edi ; CF == 1 iff ref1 SWD < central SWD.
mov ebp,BlockNM2.Ref1InterSWD ; Fetch next block's Ref1 SWD.
adc esi,esi
cmp edx,edi ; CF == 1 iff ref2 SWD < central SWD.
adc esi,esi ; 0 --> Pick central point.
; ; 1 --> Pick ref2.
; ; 2 --> Not possible.
; ; 3 --> Pick ref2.
; ; 4 --> Pick central point.
; ; 5 --> Not possible.
; ; 6 --> Pick ref1.
; ; 7 --> Pick ref1.
mov edx,BlockNM2.Ref2InterSWD ; Fetch next block's Ref2 SWD.
sub esp,BlockLen ; Move ahead to next block.
mov edi,[eax] ; Next ref2 motion vector offset.
mov cl,PickPoint_BLS[esi] ; cl == 6: central pt; 2: ref1; 4: ref2
mov ebx,esp ; For testing completion.
;
;
mov esi,BlockN.AddrCentralPoint[ecx*2-12] ; Get the addr for pt we picked.
mov ecx,BlockN.CentralInterSWD[ecx*2] ; Get the SWD for point we picked.
mov BlockN.AddrCentralPoint,esi ; Stash addr for new central point.
sub esi,edi ; Compute next ref1 addr.
mov BlockN.Ref1Addr,esi ; Stash next ref1 addr.
mov BlockN.CentralInterSWD_BLS,ecx ; Stash the SWD for central point.
lea edi,[esi+edi*2] ; Compute next ref2 addr.
xor ecx,ecx
mov BlockN.Ref2Addr,edi ; Stash next ref2 addr.
and ebx,00000001FH ; Done when esp at 32-byte bound.
jne BlkFullPelSWDLoop_1blk
add esp,BlockLen*4
add eax,4 ; Advance MotionOffsets pointer.
mov MotionOffsetsCursor,eax
cmp esi,edi
jne BlkFullPelSWDLoop_4blks
IF PITCH-384
*** Error: The magic leaks out of the following code if PITCH isn't 384.
ENDIF
;
; The following code has been modified to correctly decode the motion vectors
; The previous code was simply subtracting the target frame base address
; from the chosen (central) reference block address.
; What is now done is the begining reference macroblock address computed
; in ebp, then subtracted from the chosen (central) reference block address.
; Then, for blocks 2, 3, and 4, the distance from block 1 to that block
; is subtracted. Care was taken to preserve the original pairing.
;
mov esi,Block1.AddrCentralPoint ; B1a Reload address of central ref block.
mov ebp,TargetMBAddr ; **** CHANGE **** addr. of target MB
mov edi,Block2.AddrCentralPoint ; B2a
add ebp,TargToRef ; **** CHANGE **** add Reference - Target
; mov ebp,PreviousFrameBaseAddress **** CHANGE **** DELETED
mov Block1.Ref1Addr,esi ; B1b Stash addr central ref block.
sub esi,ebp ; B1c Addr of ref blk, but in target frame.
mov Block2.Ref1Addr,edi ; B2b
sub edi,ebp ; B2c
sub edi,8 ; **** CHANGE **** Correct for block 2
mov eax,esi ; B1e Copy linearized MV.
sar esi,8 ; B1f High 24 bits of lin MV lookup vert MV.
mov ebx,edi ; B2e
sar edi,8 ; B2f
add eax,eax ; B1g Sign extend HMV; *2 (# of half pels).
mov Block1.BlkHMV,al ; B1h Save winning horz motion vector.
add ebx,ebx ; B2g
mov Block2.BlkHMV,bl ; B2h
mov al,UnlinearizedVertMV[esi] ; B1i Look up proper vert motion vector.
mov Block1.BlkVMV,al ; B1j Save winning vert motion vector.
mov al,UnlinearizedVertMV[edi] ; B2i
mov esi,Block3.AddrCentralPoint ; B3a
mov edi,Block4.AddrCentralPoint ; B4a
mov Block3.Ref1Addr,esi ; B3b
mov Block4.Ref1Addr,edi ; B4b
mov Block2.BlkVMV,al ; B2j
sub esi,ebp ; B3c
sub esi,8*PITCH ; **** CHANGE **** Correct for block 3
sub edi,ebp ; B4c
sub edi,8*PITCH+8 ; **** CHANGE **** Correct for block 4
mov eax,esi ; B3e
sar esi,8 ; B3f
mov ebx,edi ; B4e
sar edi,8 ; B4f
add eax,eax ; B3g
mov Block3.BlkHMV,al ; B3h
add ebx,ebx ; B4g
mov Block4.BlkHMV,bl ; B4h
mov al,UnlinearizedVertMV[esi] ; B3i
mov Block3.BlkVMV,al ; B3j
mov al,UnlinearizedVertMV[edi] ; B4i
mov ebp,Block1.CentralInterSWD_BLS
mov ebx,Block2.CentralInterSWD_BLS
add ebp,Block3.CentralInterSWD_BLS
add ebx,Block4.CentralInterSWD_BLS
add ebx,ebp
mov Block4.BlkVMV,al ; B4j
mov ecx,DoHalfPelEstimation
mov MBCentralInterSWD_BLS,ebx
test ecx,ecx
je NoHalfPelBlockLevelMVs
HalfPelBlockLevelMotionSearch:
mov edi,Block1.AddrCentralPoint
xor ebp,ebp
; ebp -- Initialized to 0, implying can search both left and right.
; edi -- Ref addr for block 1. Ref1 is .5 pel to left. Ref2 is .5 to right.
call DoSWDHalfPelHorzLoop
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
NextBlkHorz:
mov ebx,BlockNM1.CentralInterSWD_BLS
cmp ecx,edx
mov BlockNM1.HalfPelSavings,ebp
jl BlkHorz_Ref1LTRef2
mov al,BlockNM1.BlkHMV
sub esp,BlockLen
sub ebx,edx
jle BlkHorz_CenterBest
inc al
mov BlockN.HalfPelSavings,ebx
mov BlockN.BlkHMV,al
jmp BlkHorz_Done
BlkHorz_Ref1LTRef2:
mov al,BlockNM1.BlkHMV
sub esp,BlockLen
sub ebx,ecx
jle BlkHorz_CenterBest
mov ecx,BlockN.Ref1Addr
dec al
mov BlockN.HalfPelSavings,ebx
dec ecx
mov BlockN.BlkHMV,al
mov BlockN.Ref1Addr,ecx
BlkHorz_CenterBest:
BlkHorz_Done:
mov ecx,BlockNM1.Ref1InterSWD
mov edx,BlockNM1.Ref2InterSWD
test esp,000000018H
jne NextBlkHorz
mov edi,BlockN.AddrCentralPoint
add esp,BlockLen*4
; ebp -- Initialized to 0, implying search both up and down is okay.
; edi -- Ref addr for block 1. Ref1 is .5 pel above. Ref2 is .5 below.
call DoSWDHalfPelVertLoop
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
NextBlkVert:
mov ebx,BlockNM1.CentralInterSWD_BLS
cmp ecx,edx
mov edi,BlockNM1.HalfPelSavings
jl BlkVert_Ref1LTRef2
mov al,BlockNM1.BlkVMV
sub esp,BlockLen
sub edx,ebx
jge BlkVert_CenterBest
inc al
sub edi,edx
mov BlockN.BlkVMV,al
jmp BlkVert_Done
BlkVert_Ref1LTRef2:
mov al,BlockNM1.BlkVMV
sub esp,BlockLen
sub ecx,ebx
jge BlkVert_CenterBest
sub edi,ecx
mov ecx,BlockN.Ref1Addr
dec al
sub ecx,PITCH
mov BlockN.BlkVMV,al
mov BlockN.Ref1Addr,ecx
BlkVert_CenterBest:
BlkVert_Done:
mov ecx,BlockNM1.Ref1InterSWD
sub ebx,edi
mov BlockN.CentralInterSWD_BLS,ebx
mov edx,BlockNM1.Ref2InterSWD
test esp,000000018H
lea ebp,[ebp+edi]
jne NextBlkVert
mov ebx,MBCentralInterSWD_BLS+BlockLen*4
add esp,BlockLen*4
sub ebx,ebp
xor eax,eax ; ??? Keep pairing happy
NoHalfPelBlockLevelMVs:
mov eax,MBCentralInterSWD
mov ecx,BlockMVDifferential
sub eax,ebx
mov edi,MB0MVInterSWD
cmp eax,ecx
jle BlockMVNotBigEnoughGain
sub edi,ebx
mov ecx,NonZeroMVDifferential
cmp edi,ecx
jle NonZeroMVNotBigEnoughGain
; Block motion vectors are best.
mov MBCentralInterSWD,ebx ; Set MBlock's SWD to sum of 4 blocks.
mov edx,MBlockActionStream
mov eax,Block1.CentralInterSWD_BLS ; Set each block's SWD.
mov ebx,Block2.CentralInterSWD_BLS
mov Block1.CentralInterSWD,eax
mov Block2.CentralInterSWD,ebx
mov eax,Block3.CentralInterSWD_BLS
mov ebx,Block4.CentralInterSWD_BLS
mov Block3.CentralInterSWD,eax
mov Block4.CentralInterSWD,ebx
mov eax,Block1.BlkMVs ; Set each block's motion vector.
mov ebx,Block2.BlkMVs
mov [edx].BlkY1.MVs,eax
mov [edx].BlkY2.MVs,ebx
mov eax,Block3.BlkMVs
mov ebx,Block4.BlkMVs
mov [edx].BlkY3.MVs,eax
mov [edx].BlkY4.MVs,ebx
mov eax,Block1.Ref1Addr ; Set each block's reference blk addr.
mov ebx,Block2.Ref1Addr
mov [edx].BlkY1.PastRef,eax
mov [edx].BlkY2.PastRef,ebx
mov eax,Block3.Ref1Addr
mov ebx,Block4.Ref1Addr
mov [edx].BlkY3.PastRef,eax
mov eax,INTER4MV ; Set type for MB to INTER-coded, 4 MVs.
mov [edx].BlkY4.PastRef,ebx
mov [edx].BlockType,al
jmp MotionVectorSettled
NoBlockMotionVectors:
ENDIF ; H263
mov edi,MB0MVInterSWD
BlockMVNotBigEnoughGain: ; Try MB-level motion vector.
mov eax,MBCentralInterSWD
mov ecx,NonZeroMVDifferential
sub edi,eax
mov edx,MBlockActionStream
cmp edi,ecx
jg MotionVectorSettled
NonZeroMVNotBigEnoughGain: ; Settle on zero MV.
mov eax,Block1.ZeroMVInterSWD ; Restore Zero MV SWD.
mov edx,Block2.ZeroMVInterSWD
mov Block1.CentralInterSWD,eax
mov Block2.CentralInterSWD,edx
mov eax,Block3.ZeroMVInterSWD
mov edx,Block4.ZeroMVInterSWD
mov Block3.CentralInterSWD,eax
mov Block4.CentralInterSWD,edx
mov eax,MB0MVInterSWD ; Restore SWD for zero motion vector.
BelowZeroThresh:
mov edx,MBlockActionStream
mov ebx,TargetMBAddr ; Get address of this target macroblock.
mov MBCentralInterSWD,eax ; Save SWD.
xor ebp,ebp
add ebx,TargToRef
mov [edx].BlkY1.MVs,ebp ; Set horz and vert MVs to 0 in all blks.
mov [edx].BlkY1.PastRef,ebx ; Save address of ref block, all blks.
add ebx,8
mov [edx].BlkY2.PastRef,ebx
mov [edx].BlkY2.MVs,ebp
lea ecx,[ebx+PITCH*8]
add ebx,PITCH*8-8
mov [edx].BlkY3.PastRef,ebx
mov [edx].BlkY3.MVs,ebp
mov [edx].BlkY4.PastRef,ecx
mov [edx].BlkY4.MVs,ebp
; Activity Details for this section of code (refer to flow diagram above):
;
; 6) We've settled on the motion vector that will be used if we do indeed
; code the macroblock with inter-coding. We need to determine if some
; or all of the blocks can be forced as empty (copy).
; blocks. If all the blocks can be forced empty, we force the whole
; macroblock to be empty.
;
; Expected Pentium (tm) microprocessor performance for this section:
;
; Execution frequency: Once per macroblock.
;
; 23 clocks.
;
MotionVectorSettled:
IFDEF H261
mov edi,MBCentralInterSWD
mov eax,DoSpatialFiltering ; Are we doing spatial filtering?
mov edi,TargetMBAddr
test eax,eax
je SkipSpatialFiltering
mov ebx,MBCentralInterSWD
mov esi,SpatialFiltThreshold
cmp ebx,esi
jle SkipSpatialFiltering
add edi,TargToSLF ; Compute addr at which to put SLF prediction.
xor ebx,ebx
mov esi,[edx].BlkY1.PastRef
xor edx,edx
mov ebp,16
xor ecx,ecx
SpatialFilterHorzLoop:
mov dl,[edi] ; Pre-load cache line for output.
mov bl,[esi+6] ; p6
mov al,[esi+7] ; p7
inc bl ; p6+1
mov cl,[esi+5] ; p5
mov [edi+7],al ; p7' = p7
add al,bl ; p7 + p6 + 1
add bl,cl ; p6 + p5 + 1
mov dl,[esi+4] ; p4
add eax,ebx ; p7 + 2p6 + p5 + 2
shr eax,2 ; p6' = (p7 + 2p6 + p5 + 2) / 4
inc dl ; p4 + 1
add cl,dl ; p5 + p4 + 1
mov [edi+6],al ; p6'
mov al,[esi+3] ; p3
add ebx,ecx ; p6 + 2p5 + p4 + 2
shr ebx,2 ; p5' = (p6 + 2p5 + p4 + 2) / 4
add dl,al ; p4 + p3 + 1
mov [edi+5],bl ; p5'
mov bl,[esi+2] ; p2
add ecx,edx ; p5 + 2p4 + p3 + 2
inc bl ; p2 + 1
shr ecx,2 ; p4' = (p5 + 2p4 + p3 + 2) / 4
add al,bl ; p3 + p2 + 1
mov [edi+4],cl ; p4'
add edx,eax ; p4 + 2p3 + p2 + 2
shr edx,2 ; p3' = (p4 + 2p3 + p2 + 2) / 4
mov cl,[esi+1] ; p1
add bl,cl ; p2 + p1 + 1
mov [edi+3],dl ; p3'
add eax,ebx ; p3 + 2p2 + p1 + 2
mov dl,[esi] ; p0
shr eax,2 ; p2' = (p3 + 2p2 + p1 + 2) / 4
inc ebx ; p2 + p1 + 2
mov [edi+2],al ; p2'
add ebx,ecx ; p2 + 2p1 + 2
mov [edi],dl ; p0' = p0
add ebx,edx ; p2 + 2p1 + p0 + 2
shr ebx,2 ; p1' = (p2 + 2p1 + p0 + 2) / 4
mov al,[esi+7+8]
mov [edi+1],bl ; p1'
mov bl,[esi+6+8]
inc bl
mov cl,[esi+5+8]
mov [edi+7+8],al
add al,bl
add bl,cl
mov dl,[esi+4+8]
add eax,ebx
;
shr eax,2
inc dl
add cl,dl
mov [edi+6+8],al
mov al,[esi+3+8]
add ebx,ecx
shr ebx,2
add dl,al
mov [edi+5+8],bl
mov bl,[esi+2+8]
add ecx,edx
inc bl
shr ecx,2
add al,bl
mov [edi+4+8],cl
add edx,eax
shr edx,2
mov cl,[esi+1+8]
add bl,cl
mov [edi+3+8],dl
add eax,ebx
mov dl,[esi+8]
shr eax,2
inc ebx
mov [edi+2+8],al
add ebx,ecx
mov [edi+8],dl
add ebx,edx
shr ebx,2
add esi,PITCH
mov [edi+1+8],bl
add edi,PITCH
dec ebp ; Done?
jne SpatialFilterHorzLoop
mov VertFilterDoneAddr,edi
sub edi,PITCH*16
SpatialFilterVertLoop:
mov eax,[edi] ; p0
; ; Bank conflict for sure.
;
mov ebx,[edi+PITCH] ; p1
add eax,ebx ; p0+p1
mov ecx,[edi+PITCH*2] ; p2
add ebx,ecx ; p1+p2
mov edx,[edi+PITCH*3] ; p3
shr eax,1 ; (p0+p1)/2 dirty
mov esi,[edi+PITCH*4] ; p4
add ecx,edx ; p2+p3
mov ebp,[edi+PITCH*5] ; p5
shr ebx,1 ; (p1+p2)/2 dirty
add edx,esi ; p3+p4
and eax,07F7F7F7FH ; (p0+p1)/2 clean
and ebx,07F7F7F7FH ; (p1+p2)/2 clean
and ecx,0FEFEFEFEH ; p2+p3 pre-cleaned
and edx,0FEFEFEFEH ; p3+p4 pre-cleaned
shr ecx,1 ; (p2+p3)/2 clean
add esi,ebp ; p4+p5
shr edx,1 ; (p3+p4)/2 clean
lea eax,[eax+ebx+001010101H] ; (p0+p1)/2+(p1+p2)/2+1
shr esi,1 ; (p4+p5)/2 dirty
;
and esi,07F7F7F7FH ; (p4+p5)/2 clean
lea ebx,[ebx+ecx+001010101H] ; (p1+p2)/2+(p2+p3)/2+1
shr eax,1 ; p1' = ((p0+p1)/2+(p1+p2)/2+1)/2 dirty
lea ecx,[ecx+edx+001010101H] ; (p2+p3)/2+(p3+p4)/2+1
shr ebx,1 ; p2' = ((p1+p2)/2+(p2+p3)/2+1)/2 dirty
lea edx,[edx+esi+001010101H] ; (p3+p4)/2+(p4+p5)/2+1
and eax,07F7F7F7FH ; p1' clean
and ebx,07F7F7F7FH ; p2' clean
shr ecx,1 ; p3' = ((p2+p3)/2+(p3+p4)/2+1)/2 dirty
mov [edi+PITCH],eax ; p1'
shr edx,1 ; p4' = ((p3+p4)/2+(p4+p5)/2+1)/2 dirty
mov eax,[edi+PITCH*6] ; p6
and ecx,07F7F7F7FH ; p3' clean
and edx,07F7F7F7FH ; p4' clean
mov [edi+PITCH*2],ebx ; p2'
add ebp,eax ; p5+p6
shr ebp,1 ; (p5+p6)/2 dirty
mov ebx,[edi+PITCH*7] ; p7
add eax,ebx ; p6+p7
and ebp,07F7F7F7FH ; (p5+p6)/2 clean
mov [edi+PITCH*3],ecx ; p3'
and eax,0FEFEFEFEH ; (p6+p7)/2 pre-cleaned
shr eax,1 ; (p6+p7)/2 clean
lea esi,[esi+ebp+001010101H] ; (p4+p5)/2+(p5+p6)/2+1
shr esi,1 ; p5' = ((p4+p5)/2+(p5+p6)/2+1)/2 dirty
mov [edi+PITCH*4],edx ; p4'
lea ebp,[ebp+eax+001010101H] ; (p5+p6)/2+(p6+p7)/2+1
and esi,07F7F7F7FH ; p5' clean
shr ebp,1 ; p6' = ((p5+p6)/2+(p6+p7)/2+1)/2 dirty
mov [edi+PITCH*5],esi ; p5'
and ebp,07F7F7F7FH ; p6' clean
add edi,4
test edi,00000000FH
mov [edi+PITCH*6-4],ebp ; p6'
jne SpatialFilterVertLoop
add edi,PITCH*8-16
mov eax,VertFilterDoneAddr
cmp eax,edi
jne SpatialFilterVertLoop
; Activity Details for this section of code (refer to flow diagram above):
;
; 9) The SAD for the spatially filtered reference macroblock is calculated
; with half the pel differences accumulating into the low order half
; of ebp, and the other half into the high order half.
;
; Register usage for this section:
;
; Input of this section:
;
; edi -- Address of pel 0,0 of spatially filtered reference macroblock.
;
; Predominate usage for body of this section:
;
; edi -- Address of pel 0,0 of spatially filtered reference macroblock.
; esi, eax -- -8 times pel values from target macroblock.
; ebp[ 0:15] -- SAD Accumulator for half of the match points.
; ebp[16:31] -- SAD Accumulator for other half of the match points.
; edx[ 0: 7] -- Weighted difference for one pel.
; edx[ 8:15] -- Zero.
; edx[16:23] -- Weighted difference for another pel.
; edx[24:31] -- Zero.
; bl, cl -- Pel values from the spatially filtered reference macroblock.
;
; Expected Pentium (tm) microprocessor performance for this section:
;
; Execution frequency: Once per block for which motion analysis is done
; beyond the 0-motion vector.
;
; 146 clocks instruction execution (typically).
; 6 clocks for bank conflicts (1/8 chance with 48 dual mem ops).
; 0 clocks for new cache line fills.
; ----
; 152 clocks total time for this section.
;
SpatialFilterDone:
sub edi,PITCH*8-8 ; Get to block 4.
xor ebp,ebp
xor ebx,ebx
xor ecx,ecx
SLFSWDLoop:
mov eax,BlockNM1.N8T00 ; Get -8 times target Pel00.
mov bl,[edi] ; Get Pel00 in spatially filtered reference.
mov esi,BlockNM1.N8T04
mov cl,[edi+4]
mov edx,[eax+ebx*8] ; Get abs diff for spatial filtered ref pel00.
mov eax,BlockNM1.N8T02
mov dl,[esi+ecx*8+2] ; Get abs diff for spatial filtered ref pel04.
mov bl,[edi+2]
mov esi,BlockNM1.N8T06
mov cl,[edi+6]
mov ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T11
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*1+1]
mov cl,[edi+PITCH*1+5]
mov esi,BlockNM1.N8T15
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T13
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*1+3]
mov cl,[edi+PITCH*1+7]
mov esi,BlockNM1.N8T17
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T20
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*2+0]
mov cl,[edi+PITCH*2+4]
mov esi,BlockNM1.N8T24
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T22
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*2+2]
mov cl,[edi+PITCH*2+6]
mov esi,BlockNM1.N8T26
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T31
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*3+1]
mov cl,[edi+PITCH*3+5]
mov esi,BlockNM1.N8T35
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T33
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*3+3]
mov cl,[edi+PITCH*3+7]
mov esi,BlockNM1.N8T37
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T40
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*4+0]
mov cl,[edi+PITCH*4+4]
mov esi,BlockNM1.N8T44
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T42
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*4+2]
mov cl,[edi+PITCH*4+6]
mov esi,BlockNM1.N8T46
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T51
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*5+1]
mov cl,[edi+PITCH*5+5]
mov esi,BlockNM1.N8T55
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T53
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*5+3]
mov cl,[edi+PITCH*5+7]
mov esi,BlockNM1.N8T57
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T60
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*6+0]
mov cl,[edi+PITCH*6+4]
mov esi,BlockNM1.N8T64
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T62
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*6+2]
mov cl,[edi+PITCH*6+6]
mov esi,BlockNM1.N8T66
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T71
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*7+1]
mov cl,[edi+PITCH*7+5]
mov esi,BlockNM1.N8T75
add ebp,edx
mov edx,[eax+ebx*8]
mov eax,BlockNM1.N8T73
mov dl,[esi+ecx*8+2]
mov bl,[edi+PITCH*7+3]
mov cl,[edi+PITCH*7+7]
mov esi,BlockNM1.N8T77
add ebp,edx
mov edx,[eax+ebx*8]
add edx,ebp
mov cl,[esi+ecx*8+2]
shr edx,16
add ebp,ecx
and ebp,0FFFFH
sub esp,BlockLen
add ebp,edx
sub edi,8
test esp,000000008H
mov BlockN.CentralInterSWD_SLF,ebp
jne SLFSWDLoop
test esp,000000010H
lea edi,[edi-PITCH*8+16]
jne SLFSWDLoop
mov eax,Block2.CentralInterSWD_SLF+BlockLen*4
mov ebx,Block3.CentralInterSWD_SLF+BlockLen*4
mov ecx,Block4.CentralInterSWD_SLF+BlockLen*4
add esp,BlockLen*4
add ebp,ecx
lea edx,[eax+ebx]
add ebp,edx
mov edx,SpatialFiltDifferential
lea esi,[edi+PITCH*8-8]
mov edi,MBCentralInterSWD
sub edi,edx
mov edx,MBlockActionStream
cmp ebp,edi
jge SpatialFilterNotAsGood
mov MBCentralInterSWD,ebp ; Spatial filter was better. Stash
mov ebp,Block1.CentralInterSWD_SLF ; pertinent calculations.
mov Block2.CentralInterSWD,eax
mov Block3.CentralInterSWD,ebx
mov Block4.CentralInterSWD,ecx
mov Block1.CentralInterSWD,ebp
mov [edx].BlkY1.PastRef,esi
mov al,INTERSLF
mov [edx].BlockType,al
SkipSpatialFiltering:
SpatialFilterNotAsGood:
ENDIF ; H261
mov al,[edx].CodedBlocks ; Fetch coded block pattern.
mov edi,EmptyThreshold ; Get threshold for forcing block empty?
mov ebp,MBCentralInterSWD
mov esi,InterSWDBlocks
mov ebx,Block4.CentralInterSWD ; Is SWD > threshold?
cmp ebx,edi
jg @f
and al,0F7H ; If not, indicate block 4 is NOT coded.
dec esi
sub ebp,ebx
@@:
mov ebx,Block3.CentralInterSWD
cmp ebx,edi
jg @f
and al,0FBH
dec esi
sub ebp,ebx
@@:
mov ebx,Block2.CentralInterSWD
cmp ebx,edi
jg @f
and al,0FDH
dec esi
sub ebp,ebx
@@:
mov ebx,Block1.CentralInterSWD
cmp ebx,edi
jg @f
and al,0FEH
dec esi
sub ebp,ebx
@@:
mov [edx].CodedBlocks,al ; Store coded block pattern.
add esi,4
mov InterSWDBlocks,esi
xor ebx,ebx
and eax,00FH
mov MBCentralInterSWD,ebp
cmp al,00FH ; Are any blocks marked empty?
jne InterBest ; If some blocks are empty, can't code as Intra
cmp ebp,InterCodingThreshold ; Is InterSWD below inter-coding threshhold.
lea esi,Block1+128
mov ebp,0
jae CalculateIntraSWD
InterBest:
mov ecx,InterSWDTotal
mov ebp,MBCentralInterSWD
add ecx,ebp ; Add to total for this macroblock class.
mov PD [edx].SWD,ebp
mov InterSWDTotal,ecx
jmp NextMacroBlock
; Activity Details for this section of code (refer to flow diagram above):
;
; 11) The IntraSWD is calculated as two partial sums, one in the low order
; 16 bits of ebp and one in the high order 16 bits. An average pel
; value for each block will be calculated to the nearest half.
;
; Register usage for this section:
;
; Input of this section:
;
; None
;
; Predominate usage for body of this section:
;
; esi -- Address of target block 1 (3), plus 128.
; ebp[ 0:15] -- IntraSWD Accumulator for block 1 (3).
; ebp[16:31] -- IntraSWD Accumulator for block 2 (4).
; edi -- Block 2 (4) target pel, times -8, and with WeightedDiff added.
; edx -- Block 1 (3) target pel, times -8, and with WeightedDiff added.
; ecx[ 0: 7] -- Weighted difference for one pel in block 2 (4).
; ecx[ 8:15] -- Zero.
; ecx[16:23] -- Weighted difference for one pel in block 1 (3).
; ecx[24:31] -- Zero.
; ebx -- Average block 2 (4) target pel to nearest .5.
; eax -- Average block 1 (3) target pel to nearest .5.
;
; Output of this section:
;
; edi -- Scratch.
; ebp[ 0:15] -- IntraSWD. (Also written to MBlockActionStream.)
; ebp[16:31] -- garbage.
; ebx -- Zero.
; eax -- MBlockActionStream.
;
; Expected Pentium (tm) microprocessor performance for this section:
;
; Executed once per macroblock, (except for those for which one of more blocks
; are marked empty, or where the InterSWD is less than a threshold).
;
; 183 clocks for instruction execution
; 12 clocks for bank conflicts (94 dual mem ops with 1/8 chance of conflict)
; ----
; 195 clocks total time for this section.
IntraByDecree:
mov eax,InterSWDBlocks ; Inc by 4, because we will undo it below.
xor ebp,ebp
mov MBMotionVectors,ebp ; Stash zero for MB level motion vectors.
mov ebp,040000000H ; Set Inter SWD artificially high.
lea esi,Block1+128
add eax,4
mov MBCentralInterSWD,ebp
mov InterSWDBlocks,eax
CalculateIntraSWD:
CalculateIntraSWDLoop:
mov eax,[esi-128].AccumTargetPels ; Fetch acc of target pels for 1st block.
mov edx,[esi-128].N8T00
add eax,8
mov ebx,[esi-128+BlockLen].AccumTargetPels
shr eax,4 ; Average block 1 target pel rounded to nearest .5.
add ebx,8
shr ebx,4
mov edi,[esi-128+BlockLen].N8T00
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T02
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T02
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T04
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T04
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T06
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T06
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T11
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T11
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T13
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T13
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T15
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T15
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T17
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T17
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T20
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T20
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T22
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T22
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T24
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T24
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T26
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T26
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T31
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T31
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T33
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T33
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T35
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T35
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T37
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T37
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T40
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T40
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T42
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T42
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T44
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T44
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T46
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T46
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T51
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T51
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T53
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T53
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T55
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T55
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T57
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T57
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T60
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T60
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T62
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T62
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T64
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T64
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T66
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T66
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T71
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T71
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T73
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T73
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T75
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T75
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov edx,[esi-128].N8T77
mov cl,PB [edi+ebx*4+2]
mov edi,[esi-128+BlockLen].N8T77
add ebp,ecx
mov ecx,PD [edx+eax*4]
mov cl,PB [edi+ebx*4+2]
mov eax,000007FFFH
add ebp,ecx
add esi,BlockLen*2
and eax,ebp
mov ecx,MBCentralInterSWD
shr ebp,16
sub ecx,IntraCodingDifferential
add ebp,eax
mov edx,MBlockActionStream ; Reload list ptr.
cmp ecx,ebp ; Is IntraSWD > InterSWD - differential?
jl InterBest
lea ecx,Block1+128+BlockLen*2
cmp ecx,esi
je CalculateIntraSWDLoop
; ebp -- IntraSWD
; edx -- MBlockActionStream
DoneCalcIntraSWD:
IntraBest:
mov ecx,IntraSWDTotal
mov edi,IntraSWDBlocks
add ecx,ebp ; Add to total for this macroblock class.
add edi,4 ; Accumulate # of blocks for this type.
mov IntraSWDBlocks,edi
mov edi,InterSWDBlocks
sub edi,4
mov IntraSWDTotal,ecx
mov InterSWDBlocks,edi
mov bl,INTRA
mov PB [edx].BlockType,bl ; Indicate macroblock handling decision.
IFDEF H261
xor ebx,ebx
ELSE ; H263
mov ebx,MBMotionVectors ; Set MVs to best MB level motion vectors.
ENDIF
mov PD [edx].BlkY1.MVs,ebx
mov PD [edx].BlkY2.MVs,ebx
mov PD [edx].BlkY3.MVs,ebx
mov PD [edx].BlkY4.MVs,ebx
xor ebx,ebx
mov PD [edx].SWD,ebp
jmp NextMacroBlock
;==============================================================================
; Internal functions
;==============================================================================
DoSWDLoop:
; Upon entry:
; esi -- Points to ref1
; edi -- Points to ref2
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
mov bl,PB [esi] ; 00A -- Get Pel 00 in reference ref1.
mov eax,Block1.N8T00+4 ; 00B -- Get -8 times target pel 00.
mov cl,PB [edi] ; 00C -- Get Pel 00 in reference ref2.
sub esp,BlockLen*4+28
SWDLoop:
mov edx,PD [eax+ebx*8] ; 00D -- Get weighted diff for ref1 pel 00.
mov bl,PB [esi+2] ; 02A
mov dl,PB [eax+ecx*8+2] ; 00E -- Get weighted diff for ref2 pel 00.
mov eax,BlockN.N8T02+32 ; 02B
mov ebp,edx ; 00F -- Accum weighted diffs for pel 00.
mov cl,PB [edi+2] ; 02C
mov edx,PD [eax+ebx*8] ; 02D
mov bl,PB [esi+4] ; 04A
mov dl,PB [eax+ecx*8+2] ; 02E
mov eax,BlockN.N8T04+32 ; 04B
mov cl,PB [edi+4] ; 04C
add ebp,edx ; 02F
mov edx,PD [eax+ebx*8] ; 04D
mov bl,PB [esi+6]
mov dl,PB [eax+ecx*8+2] ; 04E
mov eax,BlockN.N8T06+32
mov cl,PB [edi+6]
add ebp,edx ; 04F
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*1+1]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T11+32
mov cl,PB [edi+PITCH*1+1]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*1+3]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T13+32
mov cl,PB [edi+PITCH*1+3]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*1+5]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T15+32
mov cl,PB [edi+PITCH*1+5]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*1+7]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T17+32
mov cl,PB [edi+PITCH*1+7]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*2+0]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T20+32
mov cl,PB [edi+PITCH*2+0]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*2+2]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T22+32
mov cl,PB [edi+PITCH*2+2]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*2+4]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T24+32
mov cl,PB [edi+PITCH*2+4]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*2+6]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T26+32
mov cl,PB [edi+PITCH*2+6]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*3+1]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T31+32
mov cl,PB [edi+PITCH*3+1]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*3+3]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T33+32
mov cl,PB [edi+PITCH*3+3]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*3+5]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T35+32
mov cl,PB [edi+PITCH*3+5]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*3+7]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T37+32
mov cl,PB [edi+PITCH*3+7]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*4+0]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T40+32
mov cl,PB [edi+PITCH*4+0]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*4+2]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T42+32
mov cl,PB [edi+PITCH*4+2]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*4+4]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T44+32
mov cl,PB [edi+PITCH*4+4]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*4+6]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T46+32
mov cl,PB [edi+PITCH*4+6]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*5+1]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T51+32
mov cl,PB [edi+PITCH*5+1]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*5+3]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T53+32
mov cl,PB [edi+PITCH*5+3]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*5+5]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T55+32
mov cl,PB [edi+PITCH*5+5]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*5+7]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T57+32
mov cl,PB [edi+PITCH*5+7]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*6+0]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T60+32
mov cl,PB [edi+PITCH*6+0]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*6+2]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T62+32
mov cl,PB [edi+PITCH*6+2]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*6+4]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T64+32
mov cl,PB [edi+PITCH*6+4]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*6+6]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T66+32
mov cl,PB [edi+PITCH*6+6]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*7+1]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T71+32
mov cl,PB [edi+PITCH*7+1]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*7+3]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T73+32
mov cl,PB [edi+PITCH*7+3]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*7+5]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T75+32
mov cl,PB [edi+PITCH*7+5]
add ebp,edx
mov edx,PD [eax+ebx*8]
mov bl,PB [esi+PITCH*7+7]
mov dl,PB [eax+ecx*8+2]
mov eax,BlockN.N8T77+32
mov cl,PB [edi+PITCH*7+7]
add ebp,edx
mov edx,PD [eax+ebx*8]
add esp,BlockLen
mov dl,PB [eax+ecx*8+2]
mov eax,ebp
add ebp,edx
add edx,eax
shr ebp,16 ; Extract SWD for ref1.
and edx,00000FFFFH ; Extract SWD for ref2.
mov esi,BlockN.Ref1Addr+32 ; Get address of next ref1 block.
mov edi,BlockN.Ref2Addr+32 ; Get address of next ref2 block.
mov BlockNM1.Ref1InterSWD+32,ebp ; Store SWD for ref1.
mov BlockNM1.Ref2InterSWD+32,edx ; Store SWD for ref2.
mov bl,PB [esi] ; 00A -- Get Pel 02 in reference ref1.
mov eax,BlockN.N8T00+32 ; 00B -- Get -8 times target pel 00.
test esp,000000018H ; Done when esp is 32-byte aligned.
mov cl,PB [edi] ; 00C -- Get Pel 02 in reference ref2.
jne SWDLoop
; Output:
; ebp -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
; ecx -- Upper 24 bits zero
; ebx -- Upper 24 bits zero
add esp,28
ret
IFDEF H261
ELSE ; H263
DoSWDHalfPelHorzLoop:
; ebp -- Initialized to 0, except when can't search off left or right edge.
; edi -- Ref addr for block 1. Ref1 is .5 pel to left. Ref2 is .5 to right.
xor ecx,ecx
sub esp,BlockLen*4+28
xor eax,eax
xor ebx,ebx
SWDHalfPelHorzLoop:
mov al,[edi] ; 00A -- Fetch center ref pel 00.
mov esi,BlockN.N8T00+32; 00B -- Target pel 00 (times -8).
mov bl,[edi+2] ; 02A -- Fetch center ref pel 02.
mov edx,BlockN.N8T02+32; 02B -- Target pel 02 (times -8).
lea esi,[esi+eax*4] ; 00C -- Combine target pel 00 and center ref pel 00.
mov al,[edi-1] ; 00D -- Get pel to left for match against pel 00.
lea edx,[edx+ebx*4] ; 02C -- Combine target pel 02 and center ref pel 02.
mov bl,[edi+1] ; 00E -- Get pel to right for match against pel 00,
; ; 02D -- and pel to left for match against pel 02.
mov ecx,[esi+eax*4] ; 00F -- [16:23] weighted diff for left ref pel 00.
mov al,[edi+3] ; 02E -- Get pel to right for match against pel 02.
add ebp,ecx ; 00G -- Accumulate left ref pel 00.
mov ecx,[edx+ebx*4] ; 02F -- [16:23] weighted diff for left ref pel 02.
mov cl,[edx+eax*4+2] ; 02H -- [0:7] is weighted diff for right ref pel 02.
mov al,[edi+4] ; 04A
add ebp,ecx ; 02I -- Accumulate right ref pel 02,
; ; 02G -- Accumulate left ref pel 02.
mov bl,[esi+ebx*4+2] ; 00H -- [0:7] is weighted diff for right ref pel 00.
add ebp,ebx ; 00I -- Accumulate right ref pel 00.
mov esi,BlockN.N8T04+32; 04B
mov bl,[edi+6] ; 06A
mov edx,BlockN.N8T06+32; 06B
lea esi,[esi+eax*4] ; 04C
mov al,[edi+3] ; 04D
lea edx,[edx+ebx*4] ; 06C
mov bl,[edi+5] ; 04E & 06D
mov ecx,[esi+eax*4] ; 04F
mov al,[edi+7] ; 06E
add ebp,ecx ; 04G
mov ecx,[edx+ebx*4] ; 06F
mov cl,[edx+eax*4+2] ; 06H
mov al,[edi+PITCH*1+1] ; 11A
add ebp,ecx ; 04I & 06G
mov bl,[esi+ebx*4+2] ; 04H
add ebp,ebx ; 04I
mov esi,BlockN.N8T11+32; 11B
mov bl,[edi+PITCH*1+3] ; 13A
mov edx,BlockN.N8T13+32; 13B
lea esi,[esi+eax*4] ; 11C
mov al,[edi+PITCH*1+0] ; 11D
lea edx,[edx+ebx*4] ; 13C
mov bl,[edi+PITCH*1+2] ; 11E & 13D
mov ecx,[esi+eax*4] ; 11F
mov al,[edi+PITCH*1+4] ; 13E
add ebp,ecx ; 11G
mov ecx,[edx+ebx*4] ; 13F
mov cl,[edx+eax*4+2] ; 13H
mov al,[edi+PITCH*1+5] ; 15A
add ebp,ecx ; 11I & 13G
mov bl,[esi+ebx*4+2] ; 11H
add ebp,ebx ; 11I
mov esi,BlockN.N8T15+32; 15B
mov bl,[edi+PITCH*1+7] ; 17A
mov edx,BlockN.N8T17+32; 17B
lea esi,[esi+eax*4] ; 15C
mov al,[edi+PITCH*1+4] ; 15D
lea edx,[edx+ebx*4] ; 17C
mov bl,[edi+PITCH*1+6] ; 15E & 17D
mov ecx,[esi+eax*4] ; 15F
mov al,[edi+PITCH*1+8] ; 17E
add ebp,ecx ; 15G
mov ecx,[edx+ebx*4] ; 17F
mov cl,[edx+eax*4+2] ; 17H
mov al,[edi+PITCH*2+0] ; 20A
add ebp,ecx ; 15I & 17G
mov bl,[esi+ebx*4+2] ; 15H
add ebp,ebx ; 15I
mov esi,BlockN.N8T20+32; 20B
mov bl,[edi+PITCH*2+2] ; 22A
mov edx,BlockN.N8T22+32; 22B
lea esi,[esi+eax*4] ; 20C
mov al,[edi+PITCH*2-1] ; 20D
lea edx,[edx+ebx*4] ; 22C
mov bl,[edi+PITCH*2+1] ; 20E & 22D
mov ecx,[esi+eax*4] ; 20F
mov al,[edi+PITCH*2+3] ; 22E
add ebp,ecx ; 20G
mov ecx,[edx+ebx*4] ; 22F
mov cl,[edx+eax*4+2] ; 22H
mov al,[edi+PITCH*2+4] ; 24A
add ebp,ecx ; 20I & 22G
mov bl,[esi+ebx*4+2] ; 20H
add ebp,ebx ; 20I
mov esi,BlockN.N8T24+32; 24B
mov bl,[edi+PITCH*2+6] ; 26A
mov edx,BlockN.N8T26+32; 26B
lea esi,[esi+eax*4] ; 24C
mov al,[edi+PITCH*2+3] ; 24D
lea edx,[edx+ebx*4] ; 26C
mov bl,[edi+PITCH*2+5] ; 24E & 26D
mov ecx,[esi+eax*4] ; 24F
mov al,[edi+PITCH*2+7] ; 26E
add ebp,ecx ; 24G
mov ecx,[edx+ebx*4] ; 26F
mov cl,[edx+eax*4+2] ; 26H
mov al,[edi+PITCH*3+1] ; 31A
add ebp,ecx ; 24I & 26G
mov bl,[esi+ebx*4+2] ; 24H
add ebp,ebx ; 24I
mov esi,BlockN.N8T31+32; 31B
mov bl,[edi+PITCH*3+3] ; 33A
mov edx,BlockN.N8T33+32; 33B
lea esi,[esi+eax*4] ; 31C
mov al,[edi+PITCH*3+0] ; 31D
lea edx,[edx+ebx*4] ; 33C
mov bl,[edi+PITCH*3+2] ; 31E & 33D
mov ecx,[esi+eax*4] ; 31F
mov al,[edi+PITCH*3+4] ; 33E
add ebp,ecx ; 31G
mov ecx,[edx+ebx*4] ; 33F
mov cl,[edx+eax*4+2] ; 33H
mov al,[edi+PITCH*3+5] ; 35A
add ebp,ecx ; 31I & 33G
mov bl,[esi+ebx*4+2] ; 31H
add ebp,ebx ; 31I
mov esi,BlockN.N8T35+32; 35B
mov bl,[edi+PITCH*3+7] ; 37A
mov edx,BlockN.N8T37+32; 37B
lea esi,[esi+eax*4] ; 35C
mov al,[edi+PITCH*3+4] ; 35D
lea edx,[edx+ebx*4] ; 37C
mov bl,[edi+PITCH*3+6] ; 35E & 37D
mov ecx,[esi+eax*4] ; 35F
mov al,[edi+PITCH*3+8] ; 37E
add ebp,ecx ; 35G
mov ecx,[edx+ebx*4] ; 37F
mov cl,[edx+eax*4+2] ; 37H
mov al,[edi+PITCH*4+0] ; 40A
add ebp,ecx ; 35I & 37G
mov bl,[esi+ebx*4+2] ; 35H
add ebp,ebx ; 35I
mov esi,BlockN.N8T40+32; 40B
mov bl,[edi+PITCH*4+2] ; 42A
mov edx,BlockN.N8T42+32; 42B
lea esi,[esi+eax*4] ; 40C
mov al,[edi+PITCH*4-1] ; 40D
lea edx,[edx+ebx*4] ; 42C
mov bl,[edi+PITCH*4+1] ; 40E & 42D
mov ecx,[esi+eax*4] ; 40F
mov al,[edi+PITCH*4+3] ; 42E
add ebp,ecx ; 40G
mov ecx,[edx+ebx*4] ; 42F
mov cl,[edx+eax*4+2] ; 42H
mov al,[edi+PITCH*4+4] ; 44A
add ebp,ecx ; 40I & 42G
mov bl,[esi+ebx*4+2] ; 40H
add ebp,ebx ; 40I
mov esi,BlockN.N8T44+32; 44B
mov bl,[edi+PITCH*4+6] ; 46A
mov edx,BlockN.N8T46+32; 46B
lea esi,[esi+eax*4] ; 44C
mov al,[edi+PITCH*4+3] ; 44D
lea edx,[edx+ebx*4] ; 46C
mov bl,[edi+PITCH*4+5] ; 44E & 46D
mov ecx,[esi+eax*4] ; 44F
mov al,[edi+PITCH*4+7] ; 46E
add ebp,ecx ; 44G
mov ecx,[edx+ebx*4] ; 46F
mov cl,[edx+eax*4+2] ; 46H
mov al,[edi+PITCH*5+1] ; 51A
add ebp,ecx ; 44I & 46G
mov bl,[esi+ebx*4+2] ; 44H
add ebp,ebx ; 44I
mov esi,BlockN.N8T51+32; 51B
mov bl,[edi+PITCH*5+3] ; 53A
mov edx,BlockN.N8T53+32; 53B
lea esi,[esi+eax*4] ; 51C
mov al,[edi+PITCH*5+0] ; 51D
lea edx,[edx+ebx*4] ; 53C
mov bl,[edi+PITCH*5+2] ; 51E & 53D
mov ecx,[esi+eax*4] ; 51F
mov al,[edi+PITCH*5+4] ; 53E
add ebp,ecx ; 51G
mov ecx,[edx+ebx*4] ; 53F
mov cl,[edx+eax*4+2] ; 53H
mov al,[edi+PITCH*5+5] ; 55A
add ebp,ecx ; 51I & 53G
mov bl,[esi+ebx*4+2] ; 51H
add ebp,ebx ; 51I
mov esi,BlockN.N8T55+32; 55B
mov bl,[edi+PITCH*5+7] ; 57A
mov edx,BlockN.N8T57+32; 57B
lea esi,[esi+eax*4] ; 55C
mov al,[edi+PITCH*5+4] ; 55D
lea edx,[edx+ebx*4] ; 57C
mov bl,[edi+PITCH*5+6] ; 55E & 57D
mov ecx,[esi+eax*4] ; 55F
mov al,[edi+PITCH*5+8] ; 57E
add ebp,ecx ; 55G
mov ecx,[edx+ebx*4] ; 57F
mov cl,[edx+eax*4+2] ; 57H
mov al,[edi+PITCH*6+0] ; 60A
add ebp,ecx ; 55I & 57G
mov bl,[esi+ebx*4+2] ; 55H
add ebp,ebx ; 55I
mov esi,BlockN.N8T60+32; 60B
mov bl,[edi+PITCH*6+2] ; 62A
mov edx,BlockN.N8T62+32; 62B
lea esi,[esi+eax*4] ; 60C
mov al,[edi+PITCH*6-1] ; 60D
lea edx,[edx+ebx*4] ; 62C
mov bl,[edi+PITCH*6+1] ; 60E & 62D
mov ecx,[esi+eax*4] ; 60F
mov al,[edi+PITCH*6+3] ; 62E
add ebp,ecx ; 60G
mov ecx,[edx+ebx*4] ; 62F
mov cl,[edx+eax*4+2] ; 62H
mov al,[edi+PITCH*6+4] ; 64A
add ebp,ecx ; 60I & 62G
mov bl,[esi+ebx*4+2] ; 60H
add ebp,ebx ; 60I
mov esi,BlockN.N8T64+32; 64B
mov bl,[edi+PITCH*6+6] ; 66A
mov edx,BlockN.N8T66+32; 66B
lea esi,[esi+eax*4] ; 64C
mov al,[edi+PITCH*6+3] ; 64D
lea edx,[edx+ebx*4] ; 66C
mov bl,[edi+PITCH*6+5] ; 64E & 66D
mov ecx,[esi+eax*4] ; 64F
mov al,[edi+PITCH*6+7] ; 66E
add ebp,ecx ; 64G
mov ecx,[edx+ebx*4] ; 66F
mov cl,[edx+eax*4+2] ; 66H
mov al,[edi+PITCH*7+1] ; 71A
add ebp,ecx ; 64I & 66G
mov bl,[esi+ebx*4+2] ; 64H
add ebp,ebx ; 64I
mov esi,BlockN.N8T71+32; 71B
mov bl,[edi+PITCH*7+3] ; 73A
mov edx,BlockN.N8T73+32; 73B
lea esi,[esi+eax*4] ; 71C
mov al,[edi+PITCH*7+0] ; 71D
lea edx,[edx+ebx*4] ; 73C
mov bl,[edi+PITCH*7+2] ; 71E & 73D
mov ecx,[esi+eax*4] ; 71F
mov al,[edi+PITCH*7+4] ; 73E
add ebp,ecx ; 71G
mov ecx,[edx+ebx*4] ; 73F
mov cl,[edx+eax*4+2] ; 73H
mov al,[edi+PITCH*7+5] ; 75A
add ebp,ecx ; 71I & 73G
mov bl,[esi+ebx*4+2] ; 71H
add ebp,ebx ; 71I
mov esi,BlockN.N8T75+32; 75B
mov bl,[edi+PITCH*7+7] ; 77A
mov edx,BlockN.N8T77+32; 77B
lea esi,[esi+eax*4] ; 75C
mov al,[edi+PITCH*7+4] ; 75D
lea edx,[edx+ebx*4] ; 77C
mov bl,[edi+PITCH*7+6] ; 75E & 77D
mov ecx,[esi+eax*4] ; 75F
mov al,[edi+PITCH*7+8] ; 77E
add ebp,ecx ; 75G
mov ecx,[edx+ebx*4] ; 77F
mov cl,[edx+eax*4+2] ; 77H
add esp,BlockLen
add ecx,ebp ; 75I & 77G
mov bl,[esi+ebx*4+2] ; 75H
add ebx,ecx ; 75I
mov edi,BlockN.AddrCentralPoint+32 ; Get address of next ref1 block.
shr ecx,16 ; Extract SWD for ref1.
and ebx,00000FFFFH ; Extract SWD for ref2.
mov BlockNM1.Ref1InterSWD+32,ecx ; Store SWD for ref1.
mov BlockNM1.Ref2InterSWD+32,ebx ; Store SWD for ref2.
xor ebp,ebp
mov edx,ebx
test esp,000000018H
mov ebx,ebp
jne SWDHalfPelHorzLoop
; Output:
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
add esp,28
ret
DoSWDHalfPelVertLoop:
; ebp -- Initialized to 0, except when can't search off left or right edge.
; edi -- Ref addr for block 1. Ref1 is .5 pel up. Ref2 is .5 down.
xor ecx,ecx
sub esp,BlockLen*4+28
xor eax,eax
xor ebx,ebx
SWDHalfPelVertLoop:
mov al,[edi]
mov esi,BlockN.N8T00+32
mov bl,[edi+2*PITCH]
mov edx,BlockN.N8T20+32
lea esi,[esi+eax*4]
mov al,[edi-1*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+1*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+3*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+4*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T40+32
mov bl,[edi+6*PITCH]
mov edx,BlockN.N8T60+32
lea esi,[esi+eax*4]
mov al,[edi+3*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+5*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+7*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+1+1*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T11+32
mov bl,[edi+1+3*PITCH]
mov edx,BlockN.N8T31+32
lea esi,[esi+eax*4]
mov al,[edi+1+0*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+1+2*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+1+4*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+1+5*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T51+32
mov bl,[edi+1+7*PITCH]
mov edx,BlockN.N8T71+32
lea esi,[esi+eax*4]
mov al,[edi+1+4*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+1+6*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+1+8*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+2+0*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T02+32
mov bl,[edi+2+2*PITCH]
mov edx,BlockN.N8T22+32
lea esi,[esi+eax*4]
mov al,[edi+2-1*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+2+1*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+2+3*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+2+4*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T42+32
mov bl,[edi+2+6*PITCH]
mov edx,BlockN.N8T62+32
lea esi,[esi+eax*4]
mov al,[edi+2+3*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+2+5*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+2+7*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+3+1*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T13+32
mov bl,[edi+3+3*PITCH]
mov edx,BlockN.N8T33+32
lea esi,[esi+eax*4]
mov al,[edi+3+0*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+3+2*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+3+4*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+3+5*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T53+32
mov bl,[edi+3+7*PITCH]
mov edx,BlockN.N8T73+32
lea esi,[esi+eax*4]
mov al,[edi+3+4*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+3+6*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+3+8*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+4+0*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T04+32
mov bl,[edi+4+2*PITCH]
mov edx,BlockN.N8T24+32
lea esi,[esi+eax*4]
mov al,[edi+4-1*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+4+1*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+4+3*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+4+4*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T44+32
mov bl,[edi+4+6*PITCH]
mov edx,BlockN.N8T64+32
lea esi,[esi+eax*4]
mov al,[edi+4+3*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+4+5*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+4+7*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+5+1*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T15+32
mov bl,[edi+5+3*PITCH]
mov edx,BlockN.N8T35+32
lea esi,[esi+eax*4]
mov al,[edi+5+0*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+5+2*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+5+4*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+5+5*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T55+32
mov bl,[edi+5+7*PITCH]
mov edx,BlockN.N8T75+32
lea esi,[esi+eax*4]
mov al,[edi+5+4*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+5+6*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+5+8*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+6+0*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T06+32
mov bl,[edi+6+2*PITCH]
mov edx,BlockN.N8T26+32
lea esi,[esi+eax*4]
mov al,[edi+6-1*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+6+1*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+6+3*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+6+4*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T46+32
mov bl,[edi+6+6*PITCH]
mov edx,BlockN.N8T66+32
lea esi,[esi+eax*4]
mov al,[edi+6+3*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+6+5*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+6+7*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+7+1*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T17+32
mov bl,[edi+7+3*PITCH]
mov edx,BlockN.N8T37+32
lea esi,[esi+eax*4]
mov al,[edi+7+0*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+7+2*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+7+4*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
mov al,[edi+7+5*PITCH]
add ebp,ecx
mov bl,[esi+ebx*4+2]
add ebp,ebx
mov esi,BlockN.N8T57+32
mov bl,[edi+7+7*PITCH]
mov edx,BlockN.N8T77+32
lea esi,[esi+eax*4]
mov al,[edi+7+4*PITCH]
lea edx,[edx+ebx*4]
mov bl,[edi+7+6*PITCH]
mov ecx,[esi+eax*4]
mov al,[edi+7+8*PITCH]
add ebp,ecx
mov ecx,[edx+ebx*4]
mov cl,[edx+eax*4+2]
add esp,BlockLen
add ecx,ebp
mov bl,[esi+ebx*4+2]
add ebx,ecx
mov edi,BlockN.AddrCentralPoint+32
shr ecx,16
and ebx,00000FFFFH
mov BlockNM1.Ref1InterSWD+32,ecx
mov BlockNM1.Ref2InterSWD+32,ebx
xor ebp,ebp
mov edx,ebx
test esp,000000018H
mov ebx,ebp
jne SWDHalfPelVertLoop
; Output:
; ebp, ebx -- Zero
; ecx -- Ref1 SWD for block 4
; edx -- Ref2 SWD for block 4
add esp,28
ret
ENDIF ; H263
; Performance for common macroblocks:
; 298 clocks: prepare target pels, compute avg target pel, compute 0-MV SWD.
; 90 clocks: compute IntraSWD.
; 1412 clocks: 6-level search for best SWD.
; 16 clocks: record best fit.
; 945 clocks: calculate spatial loop filtered prediction.
; 152 clocks: calculate SWD for spatially filtered prediction and classify.
; ----
; 2913 clocks total
;
; Performance for macroblocks in which 0-motion vector is "good enough":
; 298 clocks: prepare target pels, compute avg target pel, compute 0-MV SWD.
; 90 clocks: compute IntraSWD.
; 16 clocks: record best fit.
; 58 clocks: extra cache fill burden on adjacent MB if SWD-search not done.
; 945 clocks: calculate spatial loop filtered prediction.
; 152 clocks: calculate SWD for spatially filtered prediction and classify.
; ----
; 1559 clocks total
;
; Performance for macroblocks marked as intrablock by decree of caller:
; 298 clocks: prepare target pels, compute avg target pel, compute 0-MV SWD.
; 90 clocks: compute IntraSWD.
; 58 clocks: extra cache fill burden on adjacent MB if SWD-search not done.
; 20 clocks: classify (just weight the SWD for # of match points).
; ----
; 476 clocks total
;
; 160*120 performance, generously estimated (assuming lots of motion):
;
; 2913 * 80 = 233000 clocks for luma.
; 2913 * 12 = 35000 clocks for chroma.
; 268000 clocks per frame * 15 = 4,020,000 clocks/sec.
;
; 160*120 performance, assuming typical motion:
;
; 2913 * 40 + 1559 * 40 = 179000 clocks for luma.
; 2913 * 8 + 1559 * 4 = 30000 clocks for chroma.
; 209000 clocks per frame * 15 = 3,135,000 clocks/sec.
;
; Add 10-20% to allow for initial cache-filling, and unfortunate cases where
; cache-filling policy preempts areas of the tables that are not locally "hot",
; instead of preempting macroblocks upon which the processing was just finished.
Done:
mov eax,IntraSWDTotal
mov ebx,IntraSWDBlocks
mov ecx,InterSWDTotal
mov edx,InterSWDBlocks
mov esp,StashESP
mov edi,[esp+IntraSWDTotal_arg]
mov [edi],eax
mov edi,[esp+IntraSWDBlocks_arg]
mov [edi],ebx
mov edi,[esp+InterSWDTotal_arg]
mov [edi],ecx
mov edi,[esp+InterSWDBlocks_arg]
mov [edi],edx
pop ebx
pop ebp
pop edi
pop esi
rturn
MOTIONESTIMATION endp
END