Windows NT 4.0 source code leak
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

732 lines
27 KiB

4 years ago
  1. // TxDBase.h -- CTextDatabase class definition
  2. #ifndef __TXDBASE_H__
  3. #define __TXDBASE_H__
  4. #include "SegHash.h"
  5. #include "VMBuffer.h"
  6. #include "Indicate.h"
  7. #include "Classify.h"
  8. #include "Defines.h"
  9. #include "TextMat.h"
  10. #include "UnbuffIO.h"
  11. #include "IOList.h"
  12. #include "IOStream.h"
  13. #include "FTSIFace.h"
  14. #include "Compress.h"
  15. #include "Util.h"
  16. #include "Sorting.h"
  17. #include "dict.h"
  18. #include "vector.h"
  19. // !!! BugBug !!! Parts of the long comment below are now incorrect and need revision.
  20. // Term Tags Strategies
  21. //
  22. // The term tag data structures go with term entries in the segmented
  23. // hash tables. We use two hash table - the Global table and the Galactic
  24. // table. The term tags for the two tables are designed for the actions we
  25. // take with the hash table.
  26. //
  27. // Our index work proceeds in four phases --
  28. //
  29. // 1. Constructing Local Dictionaries
  30. // During this phase we build a local term dictionary with very restricted
  31. // context. The dictionary strictly covers a range of up to 65536 tokens.
  32. // The local dictionary is an unsegmented hash table biased for speed and very
  33. // low collision rates.
  34. //
  35. // 2. Linking Local Dictionaries with the Global Dictionary
  36. // When a local dictionary reaches a capacity limit or when we must force
  37. // our text database to a searchable state, we link local dictionary entries
  38. // with corresponding global hash table entries. This has two effects --
  39. //
  40. // A. We merge the local terms with the global terms, adding new unique terms
  41. // to the global list.
  42. //
  43. // B. We now have a global linked list which traverses all the references for
  44. // each term. This was the original searchable format for the database. It
  45. // works well so long as the database fits entirely within RAM and degrades
  46. // when our working set significantly exceed RAM space. The current code
  47. // doesn't construct these global links, but relies instead on the flattening
  48. // phase below.
  49. //
  50. // 3. Flattening Linked Lists
  51. // When the collection of reference links in the local dictionaries reaches
  52. // a memory size threshold, we traverse the linked lists and construct a
  53. // collection of flattened vectors of reference indices. At the same time we
  54. // compress the streams of reference indices. The compression algorithm relies
  55. // on three fields maintained for each term in the term tag --
  56. //
  57. // A. iNewRefFirst -- the index of the first instance in the linked stream
  58. // B. iNewRefLast -- the index of the last instance in the linked stream
  59. // C. cRefsNew -- the number of instances in the linked stream
  60. //
  61. // To merge new flattened vectors with previously accumulated vectors we
  62. // maintain four additional term tag fields --
  63. //
  64. // B. iRefListBase -- index to the stream of flattened references for this term.
  65. // C. cdwRefs -- size of the flattened reference lists in DWords.
  66. // D. iRefSequence -- ranking order of this term relative to the galactic table
  67. // A -1 value indicates no ranking.
  68. //
  69. // Note that we simply catenate the compressed reference lists from sequential
  70. // flattening passes. Thus the stream denoted by iRefListBase has the format --
  71. //
  72. // {cdw, cRefsStream, iRefFirst, <<basis>, <compressed ref>...>} ...
  73. //
  74. // where --
  75. //
  76. // cdw is the size of the reference list segment in DWords
  77. // cRefsStream is the number of references in the list.
  78. // iRefFirst is the first reference in the list
  79. // <basis> is a five-bit value which drives the compression
  80. // algorithm.
  81. // <compressed ref> values are variable length bit strings which
  82. // represent the delta between successive reference
  83. // indices.
  84. //
  85. // Note that cdw could be derrived from cRefsStream and <basis>. We include
  86. // it in the reference stream to allow a fast traversal of the stream when
  87. // we're looking for a particular indexing range.
  88. //
  89. // The iRefSequence field is a ordering value maintained in the galactic hash
  90. // table for each unique term. It's used to speed up the process of merging new
  91. // reference vector segments with previously accumulated segments. The strategy
  92. // is to keep the reference segments in an incrementally committed memory address
  93. // range and to insert segments by copying reference streams upward in memory,
  94. // inserting the new segments as we go. The iRefSequence field gives us the memory
  95. // order for the reference streams.
  96. //
  97. // Note that for terms with less than four references we keep the reference
  98. // information entirely in the term tag. That situation is denoted by negative
  99. // values in cRefsTotal, iRefListBase, and cdwRefs. Zero values are used to
  100. // mark the one and two cases. The actual index values are the logical
  101. // negation (~) of those fields in the order mentioned above. When a new vector
  102. // segment would push the global total beyond three, we merge the old vector
  103. // with the new one and create an external list.
  104. //
  105. // Note that the iRefSequence field may be undefined (-1) or defined >= 0 for
  106. // a term with less than four references. This is because the galactic table
  107. // may have other references that push us over the limit.
  108. //
  109. // Why do we bother with this complicated scheme? Its value lies in reducing
  110. // the number of items in the reference stream. During the merge work this
  111. // reduces the number of items that must be slid upward in memory and it
  112. // reduces the number of iRefListBase fields that must be adjusted during
  113. // the merge operation.Approximately 45% of all unique terms are used only
  114. // once. By keeping small lists in the tag, we reduce the number of external
  115. // lists by 75%.
  116. //
  117. // 4. Galactic Merges
  118. // When the global table reaches a memory size threshold, we merge its reference
  119. // information with the galactic hash table and restart our indexing work with
  120. // an empty global table. The issue here is keeping the global table small enough
  121. // so that it fits completely within RAM during phase 2 work.
  122. //
  123. // The galactic term tags contain only the accumulation fields --
  124. //
  125. // B. iRefListBase -- index to the stream of flattened references for this term.
  126. // C. cdwRefs -- size of the flattened reference lists in DWords.
  127. // D. iRefSequence -- ranking order of this term relative to the galactic table.
  128. // A -1 value indicates no ranking.
  129. typedef struct _TermTagGlobal
  130. {
  131. UINT iGlobalDesc; // Global sequence # for term.
  132. UINT iGalacticDesc; // Galactic sequence # for term.
  133. // UINT iNewRefFirst; // First linked global ref.
  134. // UINT iNewRefLast; // Last linked global ref.
  135. UINT cRefsNew; // # of linked global refs.
  136. UINT cRefsGlobal;
  137. } TermTagGlobal;
  138. typedef TermTagGlobal *PTermTagGlobal;
  139. typedef struct _TermTagGalactic
  140. {
  141. UINT iGalacticDesc; // Galactic sequence # for term.
  142. } TermTagGalactic;
  143. typedef TermTagGalactic *PTermTagGalactic;
  144. typedef struct _DESCRIPTOR
  145. {
  146. PWCHAR pwDisplay; // pbImage is Sort Key, pwDisplay is Display Image.
  147. union
  148. {
  149. PWCHAR pbImage; // Length given by delta with following pd->pbImage.
  150. UINT iGalactic;
  151. };
  152. union
  153. {
  154. UINT cReferences; // Used while building a CTextDatabase
  155. UINT iTokenInfo; // Used in CTokenCollection
  156. UINT iTextSet; // Used in CTitleCollection
  157. };
  158. WORD cwDisplay;
  159. BYTE bCharset;
  160. BYTE fImageFlags;
  161. } DESCRIPTOR;
  162. typedef DESCRIPTOR *PDESCRIPTOR;
  163. inline UINT CbImage(PDESCRIPTOR pd)
  164. {
  165. #ifdef MESSAGEBOXES
  166. if (256 < ((pd+1)->pbImage - pd->pbImage))
  167. {
  168. char ac[256], acToken[101];
  169. wsprintf(ac, "Token length: %d", ((pd+1)->pbImage - pd->pbImage));
  170. ::MessageBox(NULL, ac, "Very Large Token!", MB_OK);
  171. CopyMemory(acToken, pd->pbImage, 50);
  172. acToken[50]= 0;
  173. wsprintf(ac, "Token Image: \"%s...\"", acToken);
  174. ::MessageBox(NULL, ac, "Part of the token image!", MB_OK);
  175. }
  176. #else // MESSAGEBOXES
  177. ASSERT(1024 > ((pd+1)->pbImage - pd->pbImage));
  178. #endif // MESSAGEBOXES
  179. return (pd+1)->pbImage - pd->pbImage;
  180. }
  181. inline UINT CwDisplay(PDESCRIPTOR pd)
  182. {
  183. ASSERT(1024 > ((pd+1)->pwDisplay - pd->pwDisplay));
  184. return (pd+1)->pwDisplay - pd->pwDisplay;
  185. }
  186. // Flag definitions for DESCRIPTOR.fImageFlags:
  187. // #define LETTER_CHAR 0x0001
  188. // #define CONTAINS_A_TAB 0x0002
  189. // #define TOKEN_FLAGS_MASK 0x0003
  190. // #define REF_TYPE_MASK 0x000C
  191. // #define BASIS_MASK 0xF800
  192. // #define REFS_LINKED 0x0010
  193. // #define BASIS_SHIFT 11
  194. // Reference types for REF_TYPE_MASK:
  195. // #define SingleRef
  196. // #define PairRef
  197. // #define TripleRef
  198. UINT CBitsToRepresent(UINT ui);
  199. UINT FormatAToken(PDESCRIPTOR pd, int cbOffset, int iColStart, int iColLimit, PWCHAR pbLine);
  200. void SortTokenImages(PDESCRIPTOR pdBase, PDESCRIPTOR **pppdSorted, PDESCRIPTOR **pppdTailSorted,
  201. PUINT pcdSorted, UINT cd
  202. );
  203. // #define BUILD_LOCAL_HASH(hv,c) hv= ((hv << 5) | (hv >> 27)) - c
  204. // #define BUILD_GLOBAL_HASH(hv,c) hv= ((hv >> 5) | (hv << 27)) - c
  205. typedef struct _LocalToken
  206. {
  207. unsigned short iLocalDescriptorEntry;
  208. unsigned short iLocalReferenceNext;
  209. } LocalToken;
  210. typedef LocalToken *PLocalToken;
  211. // Descriptor reference tokens are processed in three phases. Tokens are
  212. // initially created with iLocalDescriptorEntry set and iLocalDescriptorNext
  213. // zeroed.
  214. //
  215. // Later when we bind a local dictionary to the global dictionary, the
  216. // iLocalDecriptorNext field is used to link together every instance of
  217. // each unique term in the local dictionary.
  218. //
  219. // Finally when we reach a specific memory limit, we flatten the linked lists
  220. // for all local dictionaries to create a vector of reference indices for
  221. // each unique term in the global dictionary. At this point we also map the
  222. // LocalToken structure shown above into GlobalToken values (See below).
  223. //
  224. // A GlobalToken is a 16-bit value which refers uniquely to a particular
  225. // global DESCRIPTOR. Since we can easily have more than 64K unique global
  226. // terms, we provide an indirection mechanism which maps some 16-bit values
  227. // into 32-bit values.
  228. //
  229. // Here's how it works. We divide GlobalToken values into two ranges.
  230. // Values between 0..59,983 are absolute indices into the global vector of
  231. // unique DESCRIPTORs. Values between 59,984 and 65,535 are mapped to 32-bit
  232. // via a local indrection vector of 32-bit indices.
  233. typedef USHORT GlobalToken;
  234. typedef GlobalToken *PGlobalToken;
  235. #define LOCAL_HASH_CLASSES 0x8000
  236. #define LOCAL_HASH_MASK 0x7FFF
  237. #define ENTRIES_PER_LOCAL_DICT 6552
  238. #define MAX_REFS_PER_LDICT 0x10000
  239. #define MAX_GLOBAL_TOKENS (0x10000 - ENTRIES_PER_LOCAL_DICT)
  240. // Note: The constant ENTRIES_PER_LOCAL_DICT is chosen to make the
  241. // LocalDictionary structure exactly 64K bytes.
  242. //
  243. // MAX_GLOBAL_TOKENS is a constant which allows streams of token
  244. // references to fit in 2-byte granules. The first MAX_GLOBAL_TOKENS
  245. // unique tokens we encounter are considered global. References to
  246. // those tokens are encode in the value range [0..MAX_GLOBAL_TOKENS-1]
  247. // while references to tokens outside that set are denoted by values
  248. // in the range [MAX_GLOBAL_TOKENS .. 0xFFFF]. The latter values can
  249. // be trivially mapped into indices into the local dictionary which
  250. // corresponds to the token reference. One effect of this coding is
  251. // that most local dictionaries will collapse to empty when we convert
  252. // to the vector representation from the linked token representation.
  253. typedef struct _LocalDictionary
  254. {
  255. PLocalToken pltFirst; // address of first token for this local dictionary
  256. UINT clt; // count of local tokens which refer to this Local dict
  257. PDESCRIPTOR *ppdNext; // next unused slot in apdLocal.
  258. union
  259. {
  260. PDESCRIPTOR apdLocal[ENTRIES_PER_LOCAL_DICT]; // Refs to descriptors used locally
  261. UINT aiGalactic[ENTRIES_PER_LOCAL_DICT]; // Galactic indices for local terms
  262. };
  263. USHORT aiTokenInstFirst[ENTRIES_PER_LOCAL_DICT]; // List heads for each local
  264. // descriptor.
  265. } LocalDictionary;
  266. typedef LocalDictionary *PLocalDictionary;
  267. #define IVB_TOKEN_STREAM 0
  268. #define IVB_TOKEN_IMAGES 1
  269. #define IVB_IMAGE_DESCRIPTORS 2
  270. #define IVB_DISPLAY_IMAGES 3
  271. #define COUNT_OF_VIRTUAL_BUFFERS 4
  272. #define vbTokenStream m_avb[IVB_TOKEN_STREAM ]
  273. #define vbTokenImages m_avb[IVB_TOKEN_IMAGES ]
  274. #define vbImageDescriptors m_avb[IVB_IMAGE_DESCRIPTORS]
  275. #define vbDisplayImages m_avb[IVB_DISPLAY_IMAGES]
  276. // Commit and Reservation constants for the virtual buffers
  277. // in the TextDatabaseControl object. These reservations are
  278. // based on an upper limit of 100,000,000 bytes scanned.
  279. #define INIT_TOKEN_REF_COMMIT 0x00010000 // 0x00430000
  280. #define INIT_TOKEN_REF_RESERVATION 0x08000000
  281. #define INIT_TOKEN_IMAGE_COMMIT 0x00010000 // 0x000A0000
  282. #define INIT_TOKEN_IMAGE_RESERVATION 0x03700000
  283. #define INIT_IMAGE_DESCRIPTOR_COMMIT 0x00010000 // 0x00160000
  284. #define INIT_IMAGE_DESCRIPTOR_RESERVATION 0x02A00000
  285. #define INIT_DISPLAY_IMAGE_COMMIT 0x00010000
  286. #define INIT_DISPLAY_IMAGE_RESERVATION 0x03700000
  287. #define BUFFER_INCREMENT 0x2FFFF
  288. #define CB_TEMP_BLOCKS 0x10000 // Approximate block size for unbuffered I/O
  289. #define CB_TRANSACTION_LIMIT 0x40000 // Approximate limit for unbuffered I/O transactions.
  290. const double MEMORY_FACTOR = 0.4; // Fraction of total memory which we're
  291. // allowed to use.
  292. #define CBITS_BASIS_MASK 5
  293. #define BASIS_MASK (~((~0) << CBITS_BASIS_MASK))
  294. typedef struct _ReferenceDescriptor
  295. {
  296. UINT iSerialGalactic;
  297. UINT idwRefList;
  298. UINT cdwRefs;
  299. UINT iLastRef;
  300. } ReferenceDescriptor;
  301. typedef ReferenceDescriptor *PReferenceDescriptor;
  302. typedef struct _RefClusterDescriptor
  303. {
  304. UINT iFilePosLow;
  305. UINT iFilePosHigh;
  306. UINT cdw;
  307. UINT cTerms;
  308. } RefClusterDescriptor;
  309. typedef RefClusterDescriptor *PRefClusterDescriptor;
  310. enum {
  311. MAX_LOCAL_DICTS = 4096,
  312. MAX_REF_SETS = 256,
  313. MAX_REF_CLUSTERS = 512,
  314. CB_MERGE_BUFFER = 262144,
  315. SPARE_FILE_BLOCKS = 6
  316. };
  317. // Note: alde and aiTokenRefFirst logically go together. They've been
  318. // split apart to maintain DWord alignment for the alte items.
  319. typedef struct _UnlinkedState
  320. {
  321. PDESCRIPTOR *appdLocalClasses [LOCAL_HASH_CLASSES];
  322. PDESCRIPTOR *appdCollisionChains[ENTRIES_PER_LOCAL_DICT];
  323. UINT cReferences [ENTRIES_PER_LOCAL_DICT];
  324. // USHORT aiTokenInstLast [ENTRIES_PER_LOCAL_DICT]; // List tails for each local
  325. // descriptor.
  326. PLocalDictionary pld;
  327. #ifdef _DEBUG
  328. UINT cCollisions;
  329. #endif // _DEBUG
  330. PWCHAR pbBuffer;
  331. PWCHAR pbCurrentLine;
  332. int cbLineAdjustment;
  333. // The following items are not used to construct local dictionaries.
  334. // They are placed here so that they will be allocated only when
  335. // the current text database is indexing text rather than processing
  336. // queries.
  337. RefClusterDescriptor m_rcd[MAX_REF_CLUSTERS];
  338. PLocalDictionary m_apLocalDict [MAX_LOCAL_DICTS ]; // Need a different upper
  339. #ifdef _DEBUG
  340. UINT m_acLocalCollisions[MAX_LOCAL_DICTS ];
  341. #endif // _DEBUG
  342. UINT m_aiBaseToken [MAX_LOCAL_DICTS+1];
  343. UINT m_aiBaseCByte [MAX_LOCAL_DICTS+1];
  344. } UnlinkedState;
  345. typedef struct _LOCAL_CONTEXT_1
  346. {
  347. CTextDatabase *ptdb;
  348. DESCRIPTOR **ppde;
  349. UINT iDescLimit;
  350. UINT iLTBase;
  351. UINT cAdded;
  352. USHORT ild;
  353. } LOCAL_CONTEXT_1;
  354. typedef struct _LOCAL_CONTEXT_2
  355. {
  356. UINT iSerialNext;
  357. PUINT paiSerial;
  358. } LOCAL_CONTEXT_2;
  359. typedef struct _CompressionState
  360. {
  361. // UINT iRef;
  362. UINT cRefs;
  363. // UINT cbitsBasis;
  364. // union
  365. // {
  366. // UINT ibitNext;
  367. // UINT cbits;
  368. // };
  369. } CompressionState;
  370. typedef struct _LOCAL_CONTEXT_3
  371. {
  372. PUINT puiMap;
  373. CompressionState *paCS;
  374. UINT idBase;
  375. UINT cdw;
  376. UINT cNewRefLists;
  377. } LOCAL_CONTEXT_3;
  378. typedef struct _LOCAL_CONTEXT_4
  379. {
  380. PDESCRIPTOR *ppd;
  381. PDESCRIPTOR pdBase;
  382. } LOCAL_CONTEXT_4;
  383. class CTextDatabase;
  384. class CTokenList;
  385. void MergeLocalEntries(UINT iValue, PVOID pvTag, PVOID pvEnvironment);
  386. void AddLocalEntries (UINT iValue, PVOID pvTag, PVOID pvEnvironment);
  387. class CTextDatabase : public CTextMatrix
  388. {
  389. friend class CTokenList;
  390. friend class CTokenCollection;
  391. friend class CHiliterTokenList;
  392. friend void MergeLocalEntries(UINT iValue, PVOID pvTag, PVOID pvEnvironment);
  393. friend void AddLocalEntries (UINT iValue, PVOID pvTag, PVOID pvEnvironment);
  394. public:
  395. // static CTextDatabase *NewTextDatabase();
  396. virtual ~CTextDatabase();
  397. virtual const BYTE *GetSourceName() {ASSERT(0);return NULL;} // Provide this function
  398. DECLARE_REF_COUNTERS(CTextDatabase)
  399. // Save/Load Interface --
  400. void StoreImage(CPersist *pDiskImage);
  401. int AppendText(PWCHAR pbText, int cbText, BOOL fArticleEnd, UINT iCharset= ANSI_CHARSET, UINT lcid= 0x409);
  402. void SyncForQueries();
  403. UINT CharacterCount ();
  404. UINT TokenCount ();
  405. UINT DescriptorCount();
  406. UINT MaxTokenWidth ();
  407. VOID GetTextMatrix(int iRowStart, int iColStart,
  408. int cRows, int cCols, PWCHAR pbDest);
  409. UINT TextLength(PDESCRIPTOR *ppdSorted, PUINT puiTokenMap, UINT iTokenStart, UINT iTokenLimit);
  410. UINT CopyText (PDESCRIPTOR *ppdSorted, PUINT puiTokenMap, UINT iTokenStart, UINT iTokenLimit, PWCHAR pbBuffer, UINT cbBuffer);
  411. void IndicateVocabularyRefs(CIndicatorSet *pisVocabulary, UINT iPartition, const UINT *piMap);
  412. void IndicateVocabularyRefs(CIndicatorSet *pisVocabulary, CIndicatorSet *pisTokens, const UINT *piMap);
  413. void IndicateArticleRefs (CIndicatorSet *pisArticles, UINT iDescriptor, const UINT *piMap);
  414. void IndicateTokenRefs (CIndicatorSet *pisTokens , UINT iDescriptor);
  415. CIndicatorSet *TopicInstancesFor (CTokenList *ptl);
  416. CIndicatorSet *TokenInstancesFor (CTokenList *ptl);
  417. UINT TokenInstanceCountFor(CTokenList *ptl);
  418. CIndicatorSet *SymbolLocations();
  419. CIndicatorSet *VocabularyFor(CIndicatorSet *pisArticles, BOOL fRemovePervasiveTerms= FALSE);
  420. CIndicatorSet *ValidTokens(CTokenList *ptl);
  421. inline BOOL FPhrases () { return m_fdwOptions & PHRASE_SEARCH; }
  422. inline BOOL FPhraseFeedback() { return m_fdwOptions & PHRASE_FEEDBACK; }
  423. inline BOOL FVectorSearch () { return m_fdwOptions & VECTOR_SEARCH; }
  424. inline UINT IndexOptions () { return m_fdwOptions; }
  425. CDictionary *PDict();
  426. CCollection *PColl();
  427. LCID SortingLCID();
  428. protected:
  429. #ifdef _DEBUG
  430. CTextDatabase(PSZ pszTypeName= "TextDatabase");
  431. #else // _DEBUG
  432. CTextDatabase();
  433. #endif // _DEBUG
  434. void InitTextDatabase(BOOL fFromFile= FALSE);
  435. void ConnectImage(CPersist *pDiskImage, BOOL fUnpackDisplayForm= TRUE);
  436. inline int Data_cRows() { return 1; }
  437. inline int Data_cCols() { return m_cbScanned; }
  438. inline void Data_GetTextMatrix(int rowTop, int colLeft,
  439. int rows, int cols, PWCHAR lpb, PUINT charsets
  440. )
  441. {
  442. GetTextMatrix(rowTop, colLeft, rows, cols, lpb);
  443. }
  444. const UINT * TermRanks();
  445. PUINT TokenBase();
  446. UINT m_fdwOptions;
  447. private:
  448. #ifdef _DEBUG
  449. BOOL m_fInitialized;
  450. #endif // _DEBUG
  451. UINT m_fFromFileImage;
  452. UINT m_cbScanned;
  453. UINT m_cTokensIndexed;
  454. USHORT m_cLocalDicts;
  455. USHORT m_iLocalDictBase;
  456. MY_VIRTUAL_BUFFER m_avb[COUNT_OF_VIRTUAL_BUFFERS];
  457. CSegHashTable *m_pshtGalactic;
  458. CSegHashTable *m_pshtGlobal;
  459. PUINT m_pwHash; // Working storage for the AppendSlave routine...
  460. PBYTE m_pbType;
  461. PWCHAR *m_paStart;
  462. PWCHAR *m_paEnd;
  463. CIndicatorSet *m_pisSymbols;
  464. PLocalToken m_pltNext;
  465. PUINT m_puiTokenNext;
  466. PDESCRIPTOR m_pdNext, m_pdNextGlobal, m_pdNextGalactic, m_pdNextBound;
  467. PWCHAR m_pbNext, m_pbNextGlobal, m_pbNextGalactic, m_pbLastGalactic;
  468. PWCHAR m_pwDispNext, m_pwDispNextGlobal, m_pwDispNextGalactic, m_pwDispLastGalactic;
  469. UINT m_iSerialNumberNext;
  470. PUINT m_paiGlobalToRefList;
  471. CUnbufferedIO *m_puioRefTemp;
  472. CUnbufferedIO *m_puioCompressedRefs;
  473. PRefListDescriptor m_prldTokenRefs;
  474. UINT m_cdwCompressedRefs;
  475. PUINT m_pdwCompressedRefs;
  476. CUnbufferedIO *m_puioCompressedArticleRefs;
  477. PRefListDescriptor m_prldArticleRefs;
  478. UINT m_cdwArticleRefs;
  479. PUINT m_pdwArticleRefs;
  480. CUnbufferedIO *m_puioCompressedVocabularyRefs;
  481. PRefListDescriptor m_prldVocabularyRefs;
  482. UINT m_cdwVocabularyRefs;
  483. PUINT m_pdwVocabularyRefs;
  484. UINT m_cbBlockSize;
  485. UINT m_cbTransactionLimit;
  486. UINT m_iNextRefSet;
  487. UINT m_ibNextFileBlockLow;
  488. UINT m_ibNextFileBlockHigh;
  489. PFileBlockLink m_pFirstFreeFileBlock;
  490. PFileBlockLink m_papFileBlockLinks;
  491. CIOList *m_piolLeft;
  492. CIOList *m_piolRight;
  493. CIOList *m_piolResult;
  494. LCID m_lcidSorting;
  495. PDESCRIPTOR *m_ppdSorted; // left-to-right sorting vector
  496. PDESCRIPTOR *m_ppdTailSorted; // right-to-left sorting vector
  497. UINT m_cdSorted; // number of sorted terms
  498. UINT m_cwDisplayMax;
  499. UINT m_cTermRanks;
  500. PUINT m_pTermRanks;
  501. CClassifier m_clsfTokens;
  502. PUINT m_pafClassifications;
  503. CDictionary *m_pDict;
  504. CCollection *m_pColl;
  505. // BugBug! The private members below are used only during index creation.
  506. // Convert them to external allocations so we don't pay the price
  507. // when we're loading an index.
  508. UnlinkedState *m_pulstate;
  509. virtual UINT GetPartitionInfo(const UINT **ppaiPartitions, const UINT **ppaiRanks= NULL, const UINT **ppaiMap= NULL) = 0;
  510. virtual UINT ArticleCount() = 0;
  511. PDESCRIPTOR DescriptorBase ();
  512. PWCHAR ImageBase ();
  513. PWCHAR DisplayBase ();
  514. int AppendSlave(PWCHAR pbText, int cbText, BOOL fArticleEnd, UINT iCharset, UINT lcid);
  515. int ExceptionFilter(IN DWORD ExceptionCode, IN PEXCEPTION_POINTERS ExceptionInfo);
  516. USHORT SearchLocalTable(PWCHAR pbToken, UINT cbToken, UINT hv, BYTE bType, UINT iCharset, UINT lcid);
  517. CAValRef *DescriptorList(PDESCRIPTOR pd, UINT cd);
  518. void ExtendClassifications(PDESCRIPTOR pdSuffix);
  519. void IndicateMappedRefs(PRefListDescriptor prld, PUINT pdwRefBase, CIndicatorSet *pisArticles, const UINT *piMap);
  520. int IndicateRefs(PRefListDescriptor prld, PUINT pdwRefLists, CIndicatorSet *pis, BOOL fCountOnly, PUINT paiCountArray= NULL);
  521. void WriteLargeBuff(PVOID pvBuffer, UINT iPosLow, UINT iPosHigh, UINT cbBuffer);
  522. PLocalDictionary AllocateLocalDictionary();
  523. PLocalDictionary MoveToNextLocalDict (PWCHAR pbScanLimit);
  524. PDESCRIPTOR *FindTokens(CTokenList *ptl, PUINT pcd= NULL);
  525. void BindToGlobalDict(PWCHAR pbScanLimit);
  526. void FlattenAndMergeLinks ();
  527. void GalacticMerge ();
  528. void CoalesceReferenceLists();
  529. void MergeRefLists(PRefStream prsResult, PRefStream pars, UINT cRefStreams);
  530. void ConstructVocabularyLists();
  531. void CompressVocabularyLists(CIOList *piolSource, UINT cdw);
  532. void CompressArticleRefLists(CIOList *piolSource, UINT cdw);
  533. void CompressRefLists (CIOList *piorSource, UINT cdw);
  534. void CopyRefStreamSegment(CIOList *piolSource, CIOList *piolDestination, UINT cdw);
  535. };
  536. inline PDESCRIPTOR CTextDatabase::DescriptorBase() { return (PDESCRIPTOR) (vbImageDescriptors.Base); }
  537. inline PWCHAR CTextDatabase::ImageBase () { return (PWCHAR ) (vbTokenImages .Base); }
  538. inline PWCHAR CTextDatabase::DisplayBase () { return (PWCHAR ) (vbDisplayImages .Base); }
  539. inline UINT CTextDatabase::CharacterCount () { return m_cbScanned; }
  540. inline UINT CTextDatabase::DescriptorCount() { return m_iSerialNumberNext; }
  541. inline UINT CTextDatabase::MaxTokenWidth () { return m_cwDisplayMax; }
  542. inline PUINT CTextDatabase::TokenBase()
  543. {
  544. return (PUINT) (vbTokenStream.Base);
  545. }
  546. inline UINT CTextDatabase::TokenCount()
  547. {
  548. PLocalDictionary pld;
  549. if (m_pulstate && (pld= m_pulstate->pld))
  550. return (pld->pltFirst + pld->clt) - (PLocalToken) TokenBase();
  551. else return m_pltNext - (PLocalToken) TokenBase();
  552. }
  553. inline CIndicatorSet *CTextDatabase::SymbolLocations() { return m_pisSymbols; }
  554. inline void CTextDatabase::IndicateTokenRefs(CIndicatorSet *pisTokens, UINT iDescriptor)
  555. {
  556. IndicateRefs(m_prldTokenRefs + iDescriptor, m_pdwCompressedRefs, pisTokens, FALSE);
  557. }
  558. inline void CTextDatabase::IndicateArticleRefs(CIndicatorSet *pisArticles, UINT iDescriptor, const UINT *piMap)
  559. {
  560. IndicateMappedRefs(m_prldArticleRefs + iDescriptor, m_pdwArticleRefs, pisArticles, piMap);
  561. }
  562. inline CDictionary *CTextDatabase::PDict() {ASSERT(FVectorSearch()); return m_pDict;}
  563. inline CCollection *CTextDatabase::PColl() {ASSERT(FVectorSearch()); return m_pColl;}
  564. inline LCID CTextDatabase::SortingLCID() { return m_lcidSorting; }
  565. #endif // __TXDBASE_H__