Source code of Windows XP (NT5)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

191 lines
18 KiB

  1. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
  2. <html>
  3. <head>
  4. <title>Microsoft Index Server Guide: Indexing</title>
  5. <meta name="FORMATTER" content="Microsoft FrontPage 1.1">
  6. <meta name="FORMATTER" content="Microsoft FrontPage 1.1">
  7. <meta name="GENERATOR" content="Microsoft FrontPage 1.1">
  8. </head>
  9. <body bgcolor="#FFFFFF">
  10. <!--Headerbegin--><p align=center><a name="TOP"><img src="onepix.gif" alt="Space" align=middle width=1 height=1></a> <a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
  11. <hr>
  12. <!--Headerend--><p><a name="Indexing"><font size=6><strong>Indexing</strong></font></a></p>
  13. <p align=left><!--Chaptoc--></p>
  14. <blockquote>
  15. <p><a href="indexhlp.htm#TypesofIndexes">Types of Indexes</a> <br>
  16. <a href="indexhlp.htm#Merging">Types of Merges</a> <br>
  17. <a href="indexhlp.htm#Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a> <br>
  18. <a href="indexhlp.htm#FailureHandling">Failure Handling</a> <br>
  19. <a href="indexhlp.htm#PropertyCache">Property Cache</a> <br>
  20. </p>
  21. </blockquote>
  22. <hr>
  23. <!--ChaptocEnd--><p>Storing the words and properties extracted by the <a href="filtrhlp.htm#CiDaemon"><em>CiDaemon</em></a> process in indexes is referred to as <em>indexing</em>. An index is a
  24. special data structure that is used to satisfy queries efficiently.</p>
  25. <p>As documents in the corpus (Web site) are modified, the indexing program is notified of the updates and those documents
  26. enter a change queue. The <a href="filtrhlp.htm#CiDaemon">CiDaemon</a> process retrieves documents from the <em>change queue</em> in &#147;first-in-first-out&#148; order and
  27. <a href="filtrhlp.htm#Filtering">filters</a> them. The resulting words and properties are then added to the index. If there are many documents waiting in the change
  28. queue, there will be a delay before the index has up-to-date information about those documents.</p>
  29. <hr>
  30. <h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="TypesofIndexes">Types of Indexes</a></h1>
  31. <p>There are three types of indexes: <a href="#WordLists"><em>word lists</em></a>, <a href="#ShadowIndex"><em>shadow indexes</em></a>, and a <a href="#MasterIndex"><em>master index</em></a>. Words and properties extracted from a
  32. document first appear in a word list, then move to a shadow index, and finally move to the master index. This organization is
  33. optimized for query responsiveness and performance. It also ensures optimal resource usage. Even though there are multiple
  34. indexes internally, these details are completely hidden from the user. The user sees only a list of documents that satisfy the
  35. query that was posted.</p>
  36. <h2><a name="WordLists">Word Lists</a></h2>
  37. <p>Word lists are small, in-memory indexes. Each word list contains data for a small number of documents. As soon as a
  38. document is <a href="filtrhlp.htm#Filtering">filtered</a>, its data is stored in a word list. Creation of a word list is very quick and does not require updating any
  39. on-disk data. It is used as a temporary staging area during indexing.</p>
  40. <p>There are several registry parameters that control word list behavior. All the keys are under the registry path </p>
  41. <blockquote>
  42. <pre>HKEY_LOCAL_MACHINE
  43. \SYSTEM
  44. &#160;\CurrentControlSet
  45. &#160;&#160;\Control
  46. &#160;&#160;&#160;\contentindex</pre>
  47. </blockquote>
  48. <p>The following table shows the registry parameters and explanations.</p>
  49. <table border=1 cellpadding=5 cellspacing=0 width=100%>
  50. <tr><th align=left valign=bottom width=40%><font size=2><strong>Parameter</strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
  51. <tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlists"><font size=2>MaxWordLists</font></a><font size=2> </font></td><td valign=top width=60%><font size=2>Maximum number of word lists. If the number exceeds this, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
  52. performed.</font></td></tr>
  53. <tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlistSize"><font size=2>MaxWordlistSize</font></a></td><td valign=top width=60%><font size=2>Maximum recommended size of a single word list. If the size of a word list exceeds
  54. this value, a new one will be created. </font><font color="#FF0000"><font size=2><em>This is an internal value and must not be
  55. changed.</em></font></font></td></tr>
  56. <tr><td valign=top width=40%><a href="reghelp.htm#MinSizeMergeWordlists"><font size=2>MinSizeMergeWordlists</font></a></td><td valign=top width=60%><font size=2>If the combined size of all word lists exceeds this number, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
  57. performed.</font></td></tr>
  58. </table>
  59. <p>Once the number of word lists exceeds the MaxWordLists parameter, the word lists are <a href="#Merging"><em>merged</em></a> into a <a href="#ShadowIndex"><em>shadow index</em></a><em>. </em>This
  60. merge process is called the <a href="#ShadowMerge">shadow merge</a>. Although the data in word lists is compressed to some extent, the compression is
  61. not very high because word lists are temporary structures. Because word lists are in-memory strucures, documents in a word
  62. list must be refiltered whenever IIS is restarted. The refiltering is automatically detected and performed by the Microsoft Index
  63. Server engine.</p>
  64. <h2><a name="PersistentIndex">Persistent Index</a></h2>
  65. <p>When data for an index is stored on disk, it is called a <em>persistent index</em>. Unlike word lists, which are in-memory indexes, a
  66. persistent index survives shutdowns and restarts. Persistent-index data is stored in a highly compressed format. There are two
  67. types of persistent indexes:</p>
  68. <ul>
  69. <li><a href="#ShadowIndex">Shadow index</a></li>
  70. <li><p align=left><a href="#MasterIndex">Master index</a></p>
  71. </li>
  72. </ul>
  73. <h3><a name="ShadowIndex">Shadow Index</a></h3>
  74. <p>A <em>shadow index</em> is a persistent index created by merging word lists and sometimes other shadow indexes into a single index.
  75. There can be multiple shadow indexes in the catalog.</p>
  76. <h3><a name="MasterIndex">Master Index</a></h3>
  77. <p>A <em>master index</em> is a persistent index that contains the indexed data for a large number of documents. This is usually the largest
  78. persistent data structure. In an ideal state, this is the only index present, because all the indexed data is stored in the master
  79. index and there are no <a href="#ShadowIndex">shadow indexes</a> or <a href="#WordLists">word lists</a>. The data is highly compressed.</p>
  80. <p>A master index is created by <a href="#MasterMerge">master merge</a>, which merges all the shadow indexes and the current master index (if any) into a
  81. new master index. After the master merge, all the source indexes are deleted and only the new master index will be left. In this
  82. state, queries are resolved most efficiently.</p>
  83. <p>The total number of persistent indexes (shadow indexes and master index) in a catalog cannot exceed 255.</p>
  84. <hr>
  85. <h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Merging">Types of Merges</a></h1>
  86. <p>The process of combining data from multiple indexes into a single index is called <em>merging</em>. Merging results in getting rid of
  87. some redundant data and also freeing up resources. Queries are also resolved faster with fewer indexes. There are three types
  88. of merges:</p>
  89. <ul>
  90. <li><a href="#ShadowMerge">Shadow merge</a></li>
  91. <li><a href="#MasterMerge">Master merge</a></li>
  92. <li><a href="#AnnealingMerge">Annealing merge</a></li>
  93. </ul>
  94. <p>After a merge completes, the multiple<em> source indexes</em> are replaced by a single <em>target index.</em> </p>
  95. <h2><a name="ShadowMerge">Shadow Merge</a></h2>
  96. <p>Combining multiple <a href="#WordLists">word lists</a> and <a href="#ShadowIndex">shadow indexes</a> into a single shadow index is called a <em>shadow merge</em>. A shadow merge is
  97. performed to free up memory used by word lists and also to make the filtered data persistent; it is usually a quick operation.</p>
  98. <p>In the most common case, the source indexes for a shadow merge are word lists. However, if the total number of shadow
  99. indexes exceeds <a href="reghelp.htm#MaxIndexes"><em>MaxIndexes</em></a>, some of the shadow indexes are also used as <em>source indexes</em>. Shadow indexes are also used
  100. as source indexes during an <a href="#AnnealingMerge"><em>annealing merge</em></a>.</p>
  101. <p>A shadow merge is triggered by one of the following conditions:</p>
  102. <ul>
  103. <li><p align=left>The number of word lists exceeds <a href="reghelp.htm#MaxWordLists">MaxWordLists</a>.</p>
  104. </li>
  105. <li><p align=left>The combined size of <a href="#WordLists">WordLists</a> exceeds <a href="reghelp.htm#MinSizeMergeWordLists">MinSizeMergeWordLists</a>.</p>
  106. </li>
  107. <li><p align=left>As a precursor to a master merge. Before starting a master merge, a shadow merge is performed to merge all existing
  108. word lists into a shadow index.</p>
  109. </li>
  110. <li><p align=left>For an annealing merge.</p>
  111. </li>
  112. </ul>
  113. <h2><a name="MasterMerge">Master Merge</a></h2>
  114. <p>For a master merge, the <em>source indexes </em>are all of the existing <a href="#ShadowIndex">shadow indexes</a> and the current <a href="#MasterIndex">master index</a><em> </em>(if any). At the end
  115. of a master merge, all the source indexes are replaced by a single target master index<em>. </em>Although the master merge itself is a
  116. very resource-intensive (both for CPU and disk space) operation, after the completion of a master merge, resources are freed
  117. up. A lot of the redundant data is deleted and queries run faster. </p>
  118. <p>Depending upon the size of the source indexes, a master merge can be a very long-running operation. However, it is fully
  119. restartable after failures and shutdowns. A master merge will continue from where it left off.</p>
  120. <p>Whenever a master merge is started, restarted, or paused, an <a href="errorhlp.htm#MasterMergeEvents">event</a> is written to the event log. There are several reasons for
  121. starting a master merge. Some reasons for starting a master merge follow.</p>
  122. <ul>
  123. <li><p align=left>Nightly maintenance master merge. This can be done at a specified time every day. The registry value
  124. <a href="reghelp.htm#MasterMergeTime">MasterMergeTime</a> is the number of minutes after midnight when the merge should happen. By default, the nightly master
  125. merge happens at midnight. This value should be adjusted to reflect the time when the load on the server is lowest.</p>
  126. </li>
  127. <li><p align=left>When the number of changed documents since the last master merge exceed <a href="reghelp.htm#MaxFreshCount">MaxFreshCount</a>, a master merge is
  128. performed to reduce the number of changed documents. If the number of changed documents is too high, it puts an
  129. extra load on memory usage. A master merge reduces the FreshCount to zero.</p>
  130. </li>
  131. <li><p align=left>When the disk space remaining on the catalog drive is less than <a href="reghelp.htm#MinDiskFreeForceMerge">MinDiskFreeForceMerge</a> and the cumulative space
  132. occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowFreeForceMerge">MaxShadowFreeForceMerge</a>, a master merge is started to combine the shadow
  133. indexes and free up disk space.</p>
  134. </li>
  135. <li><p align=left>When the total disk space occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowIndexSize">MaxShadowIndexSize</a>, a master merge is started to
  136. combine the shadow indexes. This condition has higher precedence than the previous condition.</p>
  137. </li>
  138. <li>Finally, a master merge can be <a href="adminhlp.htm#ForcingaMerge">forced</a> by an administrator by using the adminstrative Web pages. Because of the fact
  139. that a master merge will make queries run faster after it completes, the administrator may want to force a merge even
  140. before one of the preceding conditions triggers it.</li>
  141. </ul>
  142. <h2><a name="AnnealingMerge">Annealing Merge</a></h2>
  143. <p>An annealing merge is a special kind of <a href="#ShadowMerge">shadow merge</a> performed when the system is idle for a cerain length of time and the
  144. total number of <a href="#PersistentIndex">persistent indexes </a>exceed <a href="reghelp.htm#MaxIdealIndexes">MaxIdealIndexes</a>. The registry parameter <a href="reghelp.htm#MinMergeIdleTime">MinMergeIdleTime</a> specifies the
  145. percentage of CPU time that must be idle during a <a href="reghelp.htm#MaxMergeInterval">time period</a> to trigger an annealing merge. An annealing merge improves
  146. query performance and disk space usage by reducing the number of shadow indexes.</p>
  147. <hr>
  148. <h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a></h1>
  149. <p>The following performance counters are related to indexing and merging. They are all present under the Content Index object.</p>
  150. <table border=1 cellpadding=5 cellspacing=0 width=100%>
  151. <tr><th align=left valign=bottom width=40%><font size=2><strong>Counter Name </strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
  152. <tr><td valign=top width=40%><font size=2>Index Size</font></td><td valign=top width=60%><font size=2>Total size of all the </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2> in megabytes.</font></td></tr>
  153. <tr><td valign=top width=40%><font size=2>Persistent Indexes </font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2>.</font></td></tr>
  154. <tr><td valign=top width=40%><font size=2>Merge Progress </font></td><td valign=top width=60%><font size=2>Percentage of </font><a href="#Merging"><font size=2>merge</font></a><font size=2> completed.</font></td></tr>
  155. <tr><td valign=top width=40%><font size=2>Word lists</font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#WordLists"><font size=2>word lists</font></a><font size=2>.</font></td></tr>
  156. </table>
  157. <hr>
  158. <h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="FailureHandling">Failure Handling</a></h1>
  159. <p>There can be several kinds of failures during filtering, indexing, and merging. Except after a hardware failure, indexes are fully
  160. recoverable. Indexing and merging are temporarily paused if memory load is very high. Failed operations are retried later.</p>
  161. <h2>Disk Full Condition</h2>
  162. <p>A merge is not started if the free disk space is very low on the <a href="glossary.htm#CatalogDrive">catalog drive</a>. However, it is possible that the drive may run out
  163. of free space while the merge is happening. A <a href="#ShadowMerge">shadow merge</a> will be aborted and retried after disk space is freed up. A <a href="#MasterMerge">master
  164. merge</a> is not ended but temporarily paused. An <a href="Errorhlp.htm#MasterMergeEvents">event</a> is written to the Windows NT event log that the Master Merge is
  165. paused. A <a href="errorhlp.htm#LowDiskEvent">disk full</a> event will also be written. The administrator should free up disk space by moving or deleting data files from
  166. the corpus. <font color="#FF0000"><em>Do not delete any files under the Index Server Catalog Directory.</em></font> The system will automatically detect when
  167. enough disk space has been freed and restart the operations.</p>
  168. <hr>
  169. <h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="PropertyCache">Property Cache</a></h1>
  170. <p>The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved values such as Path,
  171. Abstract, Title, Attributes, Last Write time stamp, File Size, and some values for internal use only. In a future release,
  172. administrators will be able to configure the property cache for storing custom properties. </p>
  173. <p>The property cache is also a large data structure, its size being comparable to that of the <a href="#MasterIndex">master index</a>. The registry parameter
  174. <a href="reghelp.htm#PropertyStoreMappedCache">PropertyStoreMappedCache</a> controls how much of the property cache is always kept in memory. On large index servers,
  175. setting this value higher will yield better performance. If the physical memory is not adequate, the performance might suffer.</p>
  176. <h2>Failure Recovery</h2>
  177. <p>During a <a href="glossary.htm#DirtyShutdown">dirty shutdown</a>, the property cache may become corrupted. During startup after a dirty shutdown, a consistency
  178. check is performed on the property cache and if any problems are detected, they are fixed. However, some irrepairable
  179. inconsistencies may occur. In that case, all existing index data is thrown out and documents automatically re-index. Please see
  180. the <a href="errhandl.htm">Error Detection and Recovery</a> page for more information.</p>
  181. <p><a href="errorhlp.htm#PropStoreRecovery">Events</a> are written to the event log when a recovery operation is performed on the property cache.</p>
  182. <!--Footerbegin--><hr>
  183. <p align=center><a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="#TOP"><img src="up_end.gif" alt="To Top" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
  184. <hr>
  185. <p align=center><em>&#169; 1996 by Microsoft Corporation. All rights reserved.<!--Footerend--></em></p>
  186. </body>
  187. </html>