mirror of https://github.com/tongzx/nt5src
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
191 lines
18 KiB
191 lines
18 KiB
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
|
|
|
<html>
|
|
|
|
<head>
|
|
<title>Microsoft Index Server Guide: Indexing</title>
|
|
<meta name="FORMATTER" content="Microsoft FrontPage 1.1">
|
|
<meta name="FORMATTER" content="Microsoft FrontPage 1.1">
|
|
<meta name="GENERATOR" content="Microsoft FrontPage 1.1">
|
|
</head>
|
|
|
|
<body bgcolor="#FFFFFF">
|
|
<!--Headerbegin--><p align=center><a name="TOP"><img src="onepix.gif" alt="Space" align=middle width=1 height=1></a> <a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
|
|
<hr>
|
|
<!--Headerend--><p><a name="Indexing"><font size=6><strong>Indexing</strong></font></a></p>
|
|
<p align=left><!--Chaptoc--></p>
|
|
<blockquote>
|
|
<p><a href="indexhlp.htm#TypesofIndexes">Types of Indexes</a> <br>
|
|
<a href="indexhlp.htm#Merging">Types of Merges</a> <br>
|
|
<a href="indexhlp.htm#Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a> <br>
|
|
<a href="indexhlp.htm#FailureHandling">Failure Handling</a> <br>
|
|
<a href="indexhlp.htm#PropertyCache">Property Cache</a> <br>
|
|
</p>
|
|
</blockquote>
|
|
<hr>
|
|
<!--ChaptocEnd--><p>Storing the words and properties extracted by the <a href="filtrhlp.htm#CiDaemon"><em>CiDaemon</em></a> process in indexes is referred to as <em>indexing</em>. An index is a
|
|
special data structure that is used to satisfy queries efficiently.</p>
|
|
<p>As documents in the corpus (Web site) are modified, the indexing program is notified of the updates and those documents
|
|
enter a change queue. The <a href="filtrhlp.htm#CiDaemon">CiDaemon</a> process retrieves documents from the <em>change queue</em> in “first-in-first-out” order and
|
|
<a href="filtrhlp.htm#Filtering">filters</a> them. The resulting words and properties are then added to the index. If there are many documents waiting in the change
|
|
queue, there will be a delay before the index has up-to-date information about those documents.</p>
|
|
<hr>
|
|
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="TypesofIndexes">Types of Indexes</a></h1>
|
|
<p>There are three types of indexes: <a href="#WordLists"><em>word lists</em></a>, <a href="#ShadowIndex"><em>shadow indexes</em></a>, and a <a href="#MasterIndex"><em>master index</em></a>. Words and properties extracted from a
|
|
document first appear in a word list, then move to a shadow index, and finally move to the master index. This organization is
|
|
optimized for query responsiveness and performance. It also ensures optimal resource usage. Even though there are multiple
|
|
indexes internally, these details are completely hidden from the user. The user sees only a list of documents that satisfy the
|
|
query that was posted.</p>
|
|
<h2><a name="WordLists">Word Lists</a></h2>
|
|
<p>Word lists are small, in-memory indexes. Each word list contains data for a small number of documents. As soon as a
|
|
document is <a href="filtrhlp.htm#Filtering">filtered</a>, its data is stored in a word list. Creation of a word list is very quick and does not require updating any
|
|
on-disk data. It is used as a temporary staging area during indexing.</p>
|
|
<p>There are several registry parameters that control word list behavior. All the keys are under the registry path </p>
|
|
<blockquote>
|
|
<pre>HKEY_LOCAL_MACHINE
|
|
\SYSTEM
|
|
 \CurrentControlSet
|
|
  \Control
|
|
   \contentindex</pre>
|
|
</blockquote>
|
|
<p>The following table shows the registry parameters and explanations.</p>
|
|
<table border=1 cellpadding=5 cellspacing=0 width=100%>
|
|
<tr><th align=left valign=bottom width=40%><font size=2><strong>Parameter</strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
|
|
<tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlists"><font size=2>MaxWordLists</font></a><font size=2> </font></td><td valign=top width=60%><font size=2>Maximum number of word lists. If the number exceeds this, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
|
|
performed.</font></td></tr>
|
|
<tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlistSize"><font size=2>MaxWordlistSize</font></a></td><td valign=top width=60%><font size=2>Maximum recommended size of a single word list. If the size of a word list exceeds
|
|
this value, a new one will be created. </font><font color="#FF0000"><font size=2><em>This is an internal value and must not be
|
|
changed.</em></font></font></td></tr>
|
|
<tr><td valign=top width=40%><a href="reghelp.htm#MinSizeMergeWordlists"><font size=2>MinSizeMergeWordlists</font></a></td><td valign=top width=60%><font size=2>If the combined size of all word lists exceeds this number, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
|
|
performed.</font></td></tr>
|
|
</table>
|
|
<p>Once the number of word lists exceeds the MaxWordLists parameter, the word lists are <a href="#Merging"><em>merged</em></a> into a <a href="#ShadowIndex"><em>shadow index</em></a><em>. </em>This
|
|
merge process is called the <a href="#ShadowMerge">shadow merge</a>. Although the data in word lists is compressed to some extent, the compression is
|
|
not very high because word lists are temporary structures. Because word lists are in-memory strucures, documents in a word
|
|
list must be refiltered whenever IIS is restarted. The refiltering is automatically detected and performed by the Microsoft Index
|
|
Server engine.</p>
|
|
<h2><a name="PersistentIndex">Persistent Index</a></h2>
|
|
<p>When data for an index is stored on disk, it is called a <em>persistent index</em>. Unlike word lists, which are in-memory indexes, a
|
|
persistent index survives shutdowns and restarts. Persistent-index data is stored in a highly compressed format. There are two
|
|
types of persistent indexes:</p>
|
|
<ul>
|
|
<li><a href="#ShadowIndex">Shadow index</a></li>
|
|
<li><p align=left><a href="#MasterIndex">Master index</a></p>
|
|
</li>
|
|
</ul>
|
|
<h3><a name="ShadowIndex">Shadow Index</a></h3>
|
|
<p>A <em>shadow index</em> is a persistent index created by merging word lists and sometimes other shadow indexes into a single index.
|
|
There can be multiple shadow indexes in the catalog.</p>
|
|
<h3><a name="MasterIndex">Master Index</a></h3>
|
|
<p>A <em>master index</em> is a persistent index that contains the indexed data for a large number of documents. This is usually the largest
|
|
persistent data structure. In an ideal state, this is the only index present, because all the indexed data is stored in the master
|
|
index and there are no <a href="#ShadowIndex">shadow indexes</a> or <a href="#WordLists">word lists</a>. The data is highly compressed.</p>
|
|
<p>A master index is created by <a href="#MasterMerge">master merge</a>, which merges all the shadow indexes and the current master index (if any) into a
|
|
new master index. After the master merge, all the source indexes are deleted and only the new master index will be left. In this
|
|
state, queries are resolved most efficiently.</p>
|
|
<p>The total number of persistent indexes (shadow indexes and master index) in a catalog cannot exceed 255.</p>
|
|
<hr>
|
|
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Merging">Types of Merges</a></h1>
|
|
<p>The process of combining data from multiple indexes into a single index is called <em>merging</em>. Merging results in getting rid of
|
|
some redundant data and also freeing up resources. Queries are also resolved faster with fewer indexes. There are three types
|
|
of merges:</p>
|
|
<ul>
|
|
<li><a href="#ShadowMerge">Shadow merge</a></li>
|
|
<li><a href="#MasterMerge">Master merge</a></li>
|
|
<li><a href="#AnnealingMerge">Annealing merge</a></li>
|
|
</ul>
|
|
<p>After a merge completes, the multiple<em> source indexes</em> are replaced by a single <em>target index.</em> </p>
|
|
<h2><a name="ShadowMerge">Shadow Merge</a></h2>
|
|
<p>Combining multiple <a href="#WordLists">word lists</a> and <a href="#ShadowIndex">shadow indexes</a> into a single shadow index is called a <em>shadow merge</em>. A shadow merge is
|
|
performed to free up memory used by word lists and also to make the filtered data persistent; it is usually a quick operation.</p>
|
|
<p>In the most common case, the source indexes for a shadow merge are word lists. However, if the total number of shadow
|
|
indexes exceeds <a href="reghelp.htm#MaxIndexes"><em>MaxIndexes</em></a>, some of the shadow indexes are also used as <em>source indexes</em>. Shadow indexes are also used
|
|
as source indexes during an <a href="#AnnealingMerge"><em>annealing merge</em></a>.</p>
|
|
<p>A shadow merge is triggered by one of the following conditions:</p>
|
|
<ul>
|
|
<li><p align=left>The number of word lists exceeds <a href="reghelp.htm#MaxWordLists">MaxWordLists</a>.</p>
|
|
</li>
|
|
<li><p align=left>The combined size of <a href="#WordLists">WordLists</a> exceeds <a href="reghelp.htm#MinSizeMergeWordLists">MinSizeMergeWordLists</a>.</p>
|
|
</li>
|
|
<li><p align=left>As a precursor to a master merge. Before starting a master merge, a shadow merge is performed to merge all existing
|
|
word lists into a shadow index.</p>
|
|
</li>
|
|
<li><p align=left>For an annealing merge.</p>
|
|
</li>
|
|
</ul>
|
|
<h2><a name="MasterMerge">Master Merge</a></h2>
|
|
<p>For a master merge, the <em>source indexes </em>are all of the existing <a href="#ShadowIndex">shadow indexes</a> and the current <a href="#MasterIndex">master index</a><em> </em>(if any). At the end
|
|
of a master merge, all the source indexes are replaced by a single target master index<em>. </em>Although the master merge itself is a
|
|
very resource-intensive (both for CPU and disk space) operation, after the completion of a master merge, resources are freed
|
|
up. A lot of the redundant data is deleted and queries run faster. </p>
|
|
<p>Depending upon the size of the source indexes, a master merge can be a very long-running operation. However, it is fully
|
|
restartable after failures and shutdowns. A master merge will continue from where it left off.</p>
|
|
<p>Whenever a master merge is started, restarted, or paused, an <a href="errorhlp.htm#MasterMergeEvents">event</a> is written to the event log. There are several reasons for
|
|
starting a master merge. Some reasons for starting a master merge follow.</p>
|
|
<ul>
|
|
<li><p align=left>Nightly maintenance master merge. This can be done at a specified time every day. The registry value
|
|
<a href="reghelp.htm#MasterMergeTime">MasterMergeTime</a> is the number of minutes after midnight when the merge should happen. By default, the nightly master
|
|
merge happens at midnight. This value should be adjusted to reflect the time when the load on the server is lowest.</p>
|
|
</li>
|
|
<li><p align=left>When the number of changed documents since the last master merge exceed <a href="reghelp.htm#MaxFreshCount">MaxFreshCount</a>, a master merge is
|
|
performed to reduce the number of changed documents. If the number of changed documents is too high, it puts an
|
|
extra load on memory usage. A master merge reduces the FreshCount to zero.</p>
|
|
</li>
|
|
<li><p align=left>When the disk space remaining on the catalog drive is less than <a href="reghelp.htm#MinDiskFreeForceMerge">MinDiskFreeForceMerge</a> and the cumulative space
|
|
occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowFreeForceMerge">MaxShadowFreeForceMerge</a>, a master merge is started to combine the shadow
|
|
indexes and free up disk space.</p>
|
|
</li>
|
|
<li><p align=left>When the total disk space occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowIndexSize">MaxShadowIndexSize</a>, a master merge is started to
|
|
combine the shadow indexes. This condition has higher precedence than the previous condition.</p>
|
|
</li>
|
|
<li>Finally, a master merge can be <a href="adminhlp.htm#ForcingaMerge">forced</a> by an administrator by using the adminstrative Web pages. Because of the fact
|
|
that a master merge will make queries run faster after it completes, the administrator may want to force a merge even
|
|
before one of the preceding conditions triggers it.</li>
|
|
</ul>
|
|
<h2><a name="AnnealingMerge">Annealing Merge</a></h2>
|
|
<p>An annealing merge is a special kind of <a href="#ShadowMerge">shadow merge</a> performed when the system is idle for a cerain length of time and the
|
|
total number of <a href="#PersistentIndex">persistent indexes </a>exceed <a href="reghelp.htm#MaxIdealIndexes">MaxIdealIndexes</a>. The registry parameter <a href="reghelp.htm#MinMergeIdleTime">MinMergeIdleTime</a> specifies the
|
|
percentage of CPU time that must be idle during a <a href="reghelp.htm#MaxMergeInterval">time period</a> to trigger an annealing merge. An annealing merge improves
|
|
query performance and disk space usage by reducing the number of shadow indexes.</p>
|
|
<hr>
|
|
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a></h1>
|
|
<p>The following performance counters are related to indexing and merging. They are all present under the Content Index object.</p>
|
|
<table border=1 cellpadding=5 cellspacing=0 width=100%>
|
|
<tr><th align=left valign=bottom width=40%><font size=2><strong>Counter Name </strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
|
|
<tr><td valign=top width=40%><font size=2>Index Size</font></td><td valign=top width=60%><font size=2>Total size of all the </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2> in megabytes.</font></td></tr>
|
|
<tr><td valign=top width=40%><font size=2>Persistent Indexes </font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2>.</font></td></tr>
|
|
<tr><td valign=top width=40%><font size=2>Merge Progress </font></td><td valign=top width=60%><font size=2>Percentage of </font><a href="#Merging"><font size=2>merge</font></a><font size=2> completed.</font></td></tr>
|
|
<tr><td valign=top width=40%><font size=2>Word lists</font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#WordLists"><font size=2>word lists</font></a><font size=2>.</font></td></tr>
|
|
</table>
|
|
<hr>
|
|
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="FailureHandling">Failure Handling</a></h1>
|
|
<p>There can be several kinds of failures during filtering, indexing, and merging. Except after a hardware failure, indexes are fully
|
|
recoverable. Indexing and merging are temporarily paused if memory load is very high. Failed operations are retried later.</p>
|
|
<h2>Disk Full Condition</h2>
|
|
<p>A merge is not started if the free disk space is very low on the <a href="glossary.htm#CatalogDrive">catalog drive</a>. However, it is possible that the drive may run out
|
|
of free space while the merge is happening. A <a href="#ShadowMerge">shadow merge</a> will be aborted and retried after disk space is freed up. A <a href="#MasterMerge">master
|
|
merge</a> is not ended but temporarily paused. An <a href="Errorhlp.htm#MasterMergeEvents">event</a> is written to the Windows NT event log that the Master Merge is
|
|
paused. A <a href="errorhlp.htm#LowDiskEvent">disk full</a> event will also be written. The administrator should free up disk space by moving or deleting data files from
|
|
the corpus. <font color="#FF0000"><em>Do not delete any files under the Index Server Catalog Directory.</em></font> The system will automatically detect when
|
|
enough disk space has been freed and restart the operations.</p>
|
|
<hr>
|
|
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="PropertyCache">Property Cache</a></h1>
|
|
<p>The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved values such as Path,
|
|
Abstract, Title, Attributes, Last Write time stamp, File Size, and some values for internal use only. In a future release,
|
|
administrators will be able to configure the property cache for storing custom properties. </p>
|
|
<p>The property cache is also a large data structure, its size being comparable to that of the <a href="#MasterIndex">master index</a>. The registry parameter
|
|
<a href="reghelp.htm#PropertyStoreMappedCache">PropertyStoreMappedCache</a> controls how much of the property cache is always kept in memory. On large index servers,
|
|
setting this value higher will yield better performance. If the physical memory is not adequate, the performance might suffer.</p>
|
|
<h2>Failure Recovery</h2>
|
|
<p>During a <a href="glossary.htm#DirtyShutdown">dirty shutdown</a>, the property cache may become corrupted. During startup after a dirty shutdown, a consistency
|
|
check is performed on the property cache and if any problems are detected, they are fixed. However, some irrepairable
|
|
inconsistencies may occur. In that case, all existing index data is thrown out and documents automatically re-index. Please see
|
|
the <a href="errhandl.htm">Error Detection and Recovery</a> page for more information.</p>
|
|
<p><a href="errorhlp.htm#PropStoreRecovery">Events</a> are written to the event log when a recovery operation is performed on the property cache.</p>
|
|
<!--Footerbegin--><hr>
|
|
<p align=center><a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="#TOP"><img src="up_end.gif" alt="To Top" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
|
|
<hr>
|
|
<p align=center><em>© 1996 by Microsoft Corporation. All rights reserved.<!--Footerend--></em></p>
|
|
</body>
|
|
|
|
</html>
|