Source code of Windows XP (NT5)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

191 lines
18 KiB

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Microsoft Index Server Guide: Indexing</title>
<meta name="FORMATTER" content="Microsoft FrontPage 1.1">
<meta name="FORMATTER" content="Microsoft FrontPage 1.1">
<meta name="GENERATOR" content="Microsoft FrontPage 1.1">
</head>
<body bgcolor="#FFFFFF">
<!--Headerbegin--><p align=center><a name="TOP"><img src="onepix.gif" alt="Space" align=middle width=1 height=1></a> <a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
<hr>
<!--Headerend--><p><a name="Indexing"><font size=6><strong>Indexing</strong></font></a></p>
<p align=left><!--Chaptoc--></p>
<blockquote>
<p><a href="indexhlp.htm#TypesofIndexes">Types of Indexes</a> <br>
<a href="indexhlp.htm#Merging">Types of Merges</a> <br>
<a href="indexhlp.htm#Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a> <br>
<a href="indexhlp.htm#FailureHandling">Failure Handling</a> <br>
<a href="indexhlp.htm#PropertyCache">Property Cache</a> <br>
</p>
</blockquote>
<hr>
<!--ChaptocEnd--><p>Storing the words and properties extracted by the <a href="filtrhlp.htm#CiDaemon"><em>CiDaemon</em></a> process in indexes is referred to as <em>indexing</em>. An index is a
special data structure that is used to satisfy queries efficiently.</p>
<p>As documents in the corpus (Web site) are modified, the indexing program is notified of the updates and those documents
enter a change queue. The <a href="filtrhlp.htm#CiDaemon">CiDaemon</a> process retrieves documents from the <em>change queue</em> in &#147;first-in-first-out&#148; order and
<a href="filtrhlp.htm#Filtering">filters</a> them. The resulting words and properties are then added to the index. If there are many documents waiting in the change
queue, there will be a delay before the index has up-to-date information about those documents.</p>
<hr>
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="TypesofIndexes">Types of Indexes</a></h1>
<p>There are three types of indexes: <a href="#WordLists"><em>word lists</em></a>, <a href="#ShadowIndex"><em>shadow indexes</em></a>, and a <a href="#MasterIndex"><em>master index</em></a>. Words and properties extracted from a
document first appear in a word list, then move to a shadow index, and finally move to the master index. This organization is
optimized for query responsiveness and performance. It also ensures optimal resource usage. Even though there are multiple
indexes internally, these details are completely hidden from the user. The user sees only a list of documents that satisfy the
query that was posted.</p>
<h2><a name="WordLists">Word Lists</a></h2>
<p>Word lists are small, in-memory indexes. Each word list contains data for a small number of documents. As soon as a
document is <a href="filtrhlp.htm#Filtering">filtered</a>, its data is stored in a word list. Creation of a word list is very quick and does not require updating any
on-disk data. It is used as a temporary staging area during indexing.</p>
<p>There are several registry parameters that control word list behavior. All the keys are under the registry path </p>
<blockquote>
<pre>HKEY_LOCAL_MACHINE
\SYSTEM
&#160;\CurrentControlSet
&#160;&#160;\Control
&#160;&#160;&#160;\contentindex</pre>
</blockquote>
<p>The following table shows the registry parameters and explanations.</p>
<table border=1 cellpadding=5 cellspacing=0 width=100%>
<tr><th align=left valign=bottom width=40%><font size=2><strong>Parameter</strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
<tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlists"><font size=2>MaxWordLists</font></a><font size=2> </font></td><td valign=top width=60%><font size=2>Maximum number of word lists. If the number exceeds this, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
performed.</font></td></tr>
<tr><td valign=top width=40%><a href="reghelp.htm#MaxWordlistSize"><font size=2>MaxWordlistSize</font></a></td><td valign=top width=60%><font size=2>Maximum recommended size of a single word list. If the size of a word list exceeds
this value, a new one will be created. </font><font color="#FF0000"><font size=2><em>This is an internal value and must not be
changed.</em></font></font></td></tr>
<tr><td valign=top width=40%><a href="reghelp.htm#MinSizeMergeWordlists"><font size=2>MinSizeMergeWordlists</font></a></td><td valign=top width=60%><font size=2>If the combined size of all word lists exceeds this number, a </font><a href="#ShadowMerge"><font size=2><em>shadow merge</em></font></a><font size=2> will be
performed.</font></td></tr>
</table>
<p>Once the number of word lists exceeds the MaxWordLists parameter, the word lists are <a href="#Merging"><em>merged</em></a> into a <a href="#ShadowIndex"><em>shadow index</em></a><em>. </em>This
merge process is called the <a href="#ShadowMerge">shadow merge</a>. Although the data in word lists is compressed to some extent, the compression is
not very high because word lists are temporary structures. Because word lists are in-memory strucures, documents in a word
list must be refiltered whenever IIS is restarted. The refiltering is automatically detected and performed by the Microsoft Index
Server engine.</p>
<h2><a name="PersistentIndex">Persistent Index</a></h2>
<p>When data for an index is stored on disk, it is called a <em>persistent index</em>. Unlike word lists, which are in-memory indexes, a
persistent index survives shutdowns and restarts. Persistent-index data is stored in a highly compressed format. There are two
types of persistent indexes:</p>
<ul>
<li><a href="#ShadowIndex">Shadow index</a></li>
<li><p align=left><a href="#MasterIndex">Master index</a></p>
</li>
</ul>
<h3><a name="ShadowIndex">Shadow Index</a></h3>
<p>A <em>shadow index</em> is a persistent index created by merging word lists and sometimes other shadow indexes into a single index.
There can be multiple shadow indexes in the catalog.</p>
<h3><a name="MasterIndex">Master Index</a></h3>
<p>A <em>master index</em> is a persistent index that contains the indexed data for a large number of documents. This is usually the largest
persistent data structure. In an ideal state, this is the only index present, because all the indexed data is stored in the master
index and there are no <a href="#ShadowIndex">shadow indexes</a> or <a href="#WordLists">word lists</a>. The data is highly compressed.</p>
<p>A master index is created by <a href="#MasterMerge">master merge</a>, which merges all the shadow indexes and the current master index (if any) into a
new master index. After the master merge, all the source indexes are deleted and only the new master index will be left. In this
state, queries are resolved most efficiently.</p>
<p>The total number of persistent indexes (shadow indexes and master index) in a catalog cannot exceed 255.</p>
<hr>
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Merging">Types of Merges</a></h1>
<p>The process of combining data from multiple indexes into a single index is called <em>merging</em>. Merging results in getting rid of
some redundant data and also freeing up resources. Queries are also resolved faster with fewer indexes. There are three types
of merges:</p>
<ul>
<li><a href="#ShadowMerge">Shadow merge</a></li>
<li><a href="#MasterMerge">Master merge</a></li>
<li><a href="#AnnealingMerge">Annealing merge</a></li>
</ul>
<p>After a merge completes, the multiple<em> source indexes</em> are replaced by a single <em>target index.</em> </p>
<h2><a name="ShadowMerge">Shadow Merge</a></h2>
<p>Combining multiple <a href="#WordLists">word lists</a> and <a href="#ShadowIndex">shadow indexes</a> into a single shadow index is called a <em>shadow merge</em>. A shadow merge is
performed to free up memory used by word lists and also to make the filtered data persistent; it is usually a quick operation.</p>
<p>In the most common case, the source indexes for a shadow merge are word lists. However, if the total number of shadow
indexes exceeds <a href="reghelp.htm#MaxIndexes"><em>MaxIndexes</em></a>, some of the shadow indexes are also used as <em>source indexes</em>. Shadow indexes are also used
as source indexes during an <a href="#AnnealingMerge"><em>annealing merge</em></a>.</p>
<p>A shadow merge is triggered by one of the following conditions:</p>
<ul>
<li><p align=left>The number of word lists exceeds <a href="reghelp.htm#MaxWordLists">MaxWordLists</a>.</p>
</li>
<li><p align=left>The combined size of <a href="#WordLists">WordLists</a> exceeds <a href="reghelp.htm#MinSizeMergeWordLists">MinSizeMergeWordLists</a>.</p>
</li>
<li><p align=left>As a precursor to a master merge. Before starting a master merge, a shadow merge is performed to merge all existing
word lists into a shadow index.</p>
</li>
<li><p align=left>For an annealing merge.</p>
</li>
</ul>
<h2><a name="MasterMerge">Master Merge</a></h2>
<p>For a master merge, the <em>source indexes </em>are all of the existing <a href="#ShadowIndex">shadow indexes</a> and the current <a href="#MasterIndex">master index</a><em> </em>(if any). At the end
of a master merge, all the source indexes are replaced by a single target master index<em>. </em>Although the master merge itself is a
very resource-intensive (both for CPU and disk space) operation, after the completion of a master merge, resources are freed
up. A lot of the redundant data is deleted and queries run faster. </p>
<p>Depending upon the size of the source indexes, a master merge can be a very long-running operation. However, it is fully
restartable after failures and shutdowns. A master merge will continue from where it left off.</p>
<p>Whenever a master merge is started, restarted, or paused, an <a href="errorhlp.htm#MasterMergeEvents">event</a> is written to the event log. There are several reasons for
starting a master merge. Some reasons for starting a master merge follow.</p>
<ul>
<li><p align=left>Nightly maintenance master merge. This can be done at a specified time every day. The registry value
<a href="reghelp.htm#MasterMergeTime">MasterMergeTime</a> is the number of minutes after midnight when the merge should happen. By default, the nightly master
merge happens at midnight. This value should be adjusted to reflect the time when the load on the server is lowest.</p>
</li>
<li><p align=left>When the number of changed documents since the last master merge exceed <a href="reghelp.htm#MaxFreshCount">MaxFreshCount</a>, a master merge is
performed to reduce the number of changed documents. If the number of changed documents is too high, it puts an
extra load on memory usage. A master merge reduces the FreshCount to zero.</p>
</li>
<li><p align=left>When the disk space remaining on the catalog drive is less than <a href="reghelp.htm#MinDiskFreeForceMerge">MinDiskFreeForceMerge</a> and the cumulative space
occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowFreeForceMerge">MaxShadowFreeForceMerge</a>, a master merge is started to combine the shadow
indexes and free up disk space.</p>
</li>
<li><p align=left>When the total disk space occupied by <a href="#ShadowIndex">shadow indexes</a> exceeds <a href="reghelp.htm#MaxShadowIndexSize">MaxShadowIndexSize</a>, a master merge is started to
combine the shadow indexes. This condition has higher precedence than the previous condition.</p>
</li>
<li>Finally, a master merge can be <a href="adminhlp.htm#ForcingaMerge">forced</a> by an administrator by using the adminstrative Web pages. Because of the fact
that a master merge will make queries run faster after it completes, the administrator may want to force a merge even
before one of the preceding conditions triggers it.</li>
</ul>
<h2><a name="AnnealingMerge">Annealing Merge</a></h2>
<p>An annealing merge is a special kind of <a href="#ShadowMerge">shadow merge</a> performed when the system is idle for a cerain length of time and the
total number of <a href="#PersistentIndex">persistent indexes </a>exceed <a href="reghelp.htm#MaxIdealIndexes">MaxIdealIndexes</a>. The registry parameter <a href="reghelp.htm#MinMergeIdleTime">MinMergeIdleTime</a> specifies the
percentage of CPU time that must be idle during a <a href="reghelp.htm#MaxMergeInterval">time period</a> to trigger an annealing merge. An annealing merge improves
query performance and disk space usage by reducing the number of shadow indexes.</p>
<hr>
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="Indexing-RelatedPerformanceCounters">Indexing-Related Performance Counters</a></h1>
<p>The following performance counters are related to indexing and merging. They are all present under the Content Index object.</p>
<table border=1 cellpadding=5 cellspacing=0 width=100%>
<tr><th align=left valign=bottom width=40%><font size=2><strong>Counter Name </strong></font></th><th align=left valign=bottom width=60%><font size=2><strong>Explanation</strong></font></th></tr>
<tr><td valign=top width=40%><font size=2>Index Size</font></td><td valign=top width=60%><font size=2>Total size of all the </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2> in megabytes.</font></td></tr>
<tr><td valign=top width=40%><font size=2>Persistent Indexes </font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#PersistentIndex"><font size=2>persistent indexes</font></a><font size=2>.</font></td></tr>
<tr><td valign=top width=40%><font size=2>Merge Progress </font></td><td valign=top width=60%><font size=2>Percentage of </font><a href="#Merging"><font size=2>merge</font></a><font size=2> completed.</font></td></tr>
<tr><td valign=top width=40%><font size=2>Word lists</font></td><td valign=top width=60%><font size=2>Total number of </font><a href="#WordLists"><font size=2>word lists</font></a><font size=2>.</font></td></tr>
</table>
<hr>
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="FailureHandling">Failure Handling</a></h1>
<p>There can be several kinds of failures during filtering, indexing, and merging. Except after a hardware failure, indexes are fully
recoverable. Indexing and merging are temporarily paused if memory load is very high. Failed operations are retried later.</p>
<h2>Disk Full Condition</h2>
<p>A merge is not started if the free disk space is very low on the <a href="glossary.htm#CatalogDrive">catalog drive</a>. However, it is possible that the drive may run out
of free space while the merge is happening. A <a href="#ShadowMerge">shadow merge</a> will be aborted and retried after disk space is freed up. A <a href="#MasterMerge">master
merge</a> is not ended but temporarily paused. An <a href="Errorhlp.htm#MasterMergeEvents">event</a> is written to the Windows NT event log that the Master Merge is
paused. A <a href="errorhlp.htm#LowDiskEvent">disk full</a> event will also be written. The administrator should free up disk space by moving or deleting data files from
the corpus. <font color="#FF0000"><em>Do not delete any files under the Index Server Catalog Directory.</em></font> The system will automatically detect when
enough disk space has been freed and restart the operations.</p>
<hr>
<h1><a href="#TOP"><img src="up.gif" alt="To Top" align=middle border=0 width=14 height=11></a><a name="PropertyCache">Property Cache</a></h1>
<p>The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved values such as Path,
Abstract, Title, Attributes, Last Write time stamp, File Size, and some values for internal use only. In a future release,
administrators will be able to configure the property cache for storing custom properties. </p>
<p>The property cache is also a large data structure, its size being comparable to that of the <a href="#MasterIndex">master index</a>. The registry parameter
<a href="reghelp.htm#PropertyStoreMappedCache">PropertyStoreMappedCache</a> controls how much of the property cache is always kept in memory. On large index servers,
setting this value higher will yield better performance. If the physical memory is not adequate, the performance might suffer.</p>
<h2>Failure Recovery</h2>
<p>During a <a href="glossary.htm#DirtyShutdown">dirty shutdown</a>, the property cache may become corrupted. During startup after a dirty shutdown, a consistency
check is performed on the property cache and if any problems are detected, they are fixed. However, some irrepairable
inconsistencies may occur. In that case, all existing index data is thrown out and documents automatically re-index. Please see
the <a href="errhandl.htm">Error Detection and Recovery</a> page for more information.</p>
<p><a href="errorhlp.htm#PropStoreRecovery">Events</a> are written to the event log when a recovery operation is performed on the property cache.</p>
<!--Footerbegin--><hr>
<p align=center><a href="default.htm#Top"><img src="toc.gif" alt=" Contents" align=middle border=0 width=89 height=31></a> <a href="qrylang.htm"><img src="previous.gif" alt="Previous" align=middle border=0 width=32 height=31></a> <a href="#TOP"><img src="up_end.gif" alt="To Top" align=middle border=0 width=32 height=31></a> <a href="filtrhlp.htm"><img src="next.gif" alt="Next" align=middle border=0 width=32 height=31></a> </p>
<hr>
<p align=center><em>&#169; 1996 by Microsoft Corporation. All rights reserved.<!--Footerend--></em></p>
</body>
</html>