Leaked source code of windows server 2003
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

264 lines
17 KiB

12/11/00 JosephJ Fix for #23727
23727 wlbs drain all command should return an error message
if no port rules exist.
The problem (if you can call it that) is that if there are NO user-specified
port rules, we treat port-specific operations directed to "ALL" ports as
successful. These commands are start,stop, drain and set (adjust weights).
Fix is for Load_port_change to return IOCTL_CVY_NOT_FOUND in this case.
Note that Load_port_change does some special casing for
IOCTL_CVY_CLUSTER_DRAIN and IOCTL_CVY_CLUSTER_PLUG -- it includes
the default port rule.
07.17.01 shouse
Due to a change in user-space where we no longer disable and re-enable the
adapter when the MAC address changes, the ded_mac_addr will now ALWAYS be
the burnt-in MAC address of the adapter, whereas it has been the NLB 02-bf
MAC address because by the time NLB bound to the adapter, it had already
picked up the new MAC address. Now, that is no longer the case, which
should not be a problem because all indications are that this was the way
that it was in win2k until we started disabling/enabling the adapters in
SP 1. However, an alignment issue resulted in a bug fix that appears to
rely on the fact that in unicast mode, the ded_mac_addr is the cl_mac_addr.
This fix was a hack, and doesn't seem to have really been thought out
anyway, because the code added was guaranteed to always be a no-op; it
amounted to "if (foo == 2) { foo = 2; }. Anyway, this "fix" was also
only applied in one of three places the exact same code resisded, so the
fixed "fix" has also been propagated to all three places. The fix involves
spoofing source MAC addresses in unicast mode to prevent network switches
from learning the cluster MAC address. Rather than simply casting a
pointer to a PULONG and dereferencing it to set a ULONG, which may cause
an alignment fault, we set each byte of the ULONG individually to avoid
the alignment issue.
10.21.01 shouse
Amendment to the above statement concerning the dedicated MAC address. It
appears that since sending a property change notification to the NIC results
in NDIS tearing down and rebuilding all bindings, by the time the adapter
is back up and running and NLB queries for the dedicated MAC address, the
adapter will have already picked up the 02-bf MAC address, so the statement
that the dedicated MAC address would now be the burnt-in MAC is not entirely
accurate.
10.21.01 shouse
Some lingering issues and their resolutions from a conversation with Bill Bain:
Dirty connections: The real question has been, "Why the seemingly arbitrary
five minute timeout?" Well, it turns out that the value is not arbitrary,
but rather was measured and based on empirical evidence. If a large number
of connections were left dagling by NLB when a "stop" was performed, this
would result in a reset "storm" if the host was quickly added back into the
cluster. It was observed that if NLB could block this traffic to the host
with the stale data, NLB could _significantly_ reduce the reset problems. So,
while its true that this five minutes is no silver bullet, it was based on
real measurable data available and solved the problem for a significant
number of the stale connections.
PPTP: Of course, PPTP was supposed to be supported in Windows 2000, but a
cursory look at the source code shows that tracking the calls, which are
GRE packets, did NOT work in Windows 2000. GRE packets were supposed to be
treated like TCP data packets on the PPTP tunnel (TCP connection), and since
no port numbers from the PPTP tunnel are recoverable in a GRE packet, NLB
hard-coded the source and destination ports to zero and 1723, respectively.
the 1723 corresponds to the server port number of the PPTP tunnel and the zero
is arbitrary and as good a choice for a source port as any. So, GRE packets
would be hashed the same as the TCP tunnel in single affinity, sticking the
GRE traffic to the correct host. However, when ambiguity arose (unoptimized
mode), GRE packets were looking for a descriptor with a source port of zero
and a destination port of 1723. Because the tunnel was established with the
ephemoral port assigned by TCP on the client machine, no descriptor would
EVER be found, and the packets were discarded. What was _intended_ was to
create the descriptor for the PPTP tunnel using the same hard-coded source
port of zero. In that case, GRE packets would find a matching descriptor
when necessary. This was the small piece of logic missing in Windows 2000,
which will be added in an upcoming service pack. However, this fix eliminates
any method by which NLB could distinguish multiple PPTP tunnels from the same
client IP address (since the client ports are masked). So, a limitation of
this implementation is that clients may NOT establish multiple tunnels (which
they won't by default) and clients from behind a NAT are not supported, as
multiple clients from behind a NAT would look like the same client to NLB,
differentiated only by source port, which NLB cannot distinguish.
Fragmentation: NLB has had an "optimized" fragmentation mode in it that
didn't seem to make sense. The problem is that subsequent packets in a
fragmented segment will not have the TCP/UDP ports, which NLB needs in order
to properly filter them. The "unoptimized" mode said that if the packet in
question was the first packet of a fragment, then NLB can get to the port
numbers, so it will be treated normally and passed up only on the correct host.
Subsequent packets in the fragmented segment will not have the port numbers,
so NLB would pass them up on _all_ hosts in the cluster. The IP layer would
simply drop the fragments on the hosts that did not pass up the first packet
in the fragmented segment. So, other than a bit of extra stress on the IP
layer in the stack, this method should be guaranteed to work. The "optimized"
mode was a method by which to let NLB do the filtering in the limited cases
that it could. Basically, this mode asserted that if you have a single port
rule that covers all ports (0-65535), then the server port is essentially
irrelevent - you'd lookup the same port rule regardless of what the port
actually was. Further, if that port rule was configured in single affinity,
then the client port was also irrelevent - its not used in the hashing
algorithm. If the cluster is configured as such (which happens to be the
default), then NLB need not know the actual source ports to pass the packet
up ONLY on the correct host. Well, that is almost correct. It is true that
the client and server ports then become irrelevent insofar as port rule
lookup and hashing, but they ARE needed for descriptor lookup - if we're
hoping to find a matching connection descriptor in order to know which host
owns a particular connection, we need to know the _actual_ client and server
ports to match a descriptor. So, this "optimized" mode doesn't really work
after all. However, as it turns out, in Windows 2000, where it was introduced,
it DID actually work. That's assuming that you discount TCP, through which
fragmentation is _highly_ discouraged by setting maximum segment sizes
appropriately, then for UDP/GRE/IPSEC it DID work because those protocols did
not utilize descriptors at all - their ownership was based solely on who
currently owned the bucket to which the packet mapped. So, its a bit muddled,
but did "work" in Windows 2000. In .Net server however, this "optimized" mode
has been removed because it no longer works. This is because some UDP traffic,
namely IPSec (port 500) is now tracked through the use of descriptors. This
failure was actually found through IPSec testing in which the initial fragment
went up on the correct server, but the subsequent fragment went up on the
_wrong_ server (not all servers, as it would have in "unoptimized" mode). GRE
and IPSec protocol traffic use hard-coded ports in connection tracking, so they
continue to be ambivolent to fragments.
12.05.01 chrisdar
BUG 482284 NLB: stores its private state in wrong Ndis packet causes break
during standby
When there is no packet stack available in an NDIS packet for NLB to store
information, NLB needs to allocate an NDIS packet for its own use, copy the
information from the original packet into it, then deallocate it when we are
finished using it. One place where this happens is in a rarely executed code
path of Prot_recv_indicate. The bug was that in this code path, we subsequently
used the original packet and tried to access packet stack that wasn't available.
The packet we allocated to get packet stack wasn't used. The fix is to use the
allocated packet instead of the original.
While testing a private fix in the lab, I also made temporary changes to force
Prot_recv_indicate to use this code path for every received non-remote control
packet.
1.21.02, shouse
Note: Due to recent changes in the GRE virtual descriptor tracking mechanism in
the driver, SINGLE affinity is now REQUIRED for PPTP. In general, single affinity
has always be "required" for VPN, but until this change was made, no affinity
would still have basically worked for PPTP. No affinity WILL STILL WORK for IPsec,
but only helps in the case that clients come from behind a NAT device; if they do
not come from behind a NAT, the source and destination ports are ALWAYS UDP 500
anyway, which defeats any advantage no affinity might provide.
Why did no affinity previously work for PPTP?
When a PPTP tunnel is created, NLB hashes the TCP control tunnel just like any
other TCP connection. If the affinity is set to none, then it uses the TCP port
numbers during the hashing process. If the host owns the bucket to which the
TCP SYN hashes, it accepts the connection and creates state to track the PPTP
tunnel. When a PPTP tunnel is accepted, it is also necessary to create a virtual
GRE descriptor to track the GRE call data for this tunnel. When this descriptor
is created, since no ports exist in the GRE protocol, it used the hard-coded ports
of 0 (source) and 1723 (destination). Since GRE is treated like TCP for the
purposes of port rule lookup and state maintenance, the GRE state creation in the
load module would certainly find the same port rule that the PPTP tunnel did; TCP
1723. However, if no affinity is set, it will NOT derive the same hashing result
that the PPTP tunnel did because the source (client) ports are different; an
arbitrary port number in the PPTP SYN packet and a hardcoded port number of 0 in
the GRE "virtual connection". Therefore, the load module would end up "injecting"
a descriptor into a port rule and "bucket" that it MIGHT NOT EVEN OWN (because bucket
ownership is not considered when creating these virtual descriptors that correspond
to a real connection being serviced by a host. In general, that's fine and by
the next heartbeat, the host that DOES own that bucket will notice and stop blindly
accepting traffic that hashes to that bucket (it moves in non-optimized mode). So,
while it SHOULD work in no affinity, this runs the risk of unnecessarily shifting
the cluster into non-optimized mode because hosts that are not the bucket owners
may handle connections on those buckets.
Why won't no affinity work any more?
Basically, because the second hash performed on the GRE "connection" has been removed.
Up-going PPTP tunnels used to require at least 3, and as many as 4, calls to the NLB
hash function. Because the hash function is a LARGE portion of the NLB overhead, this
is non-optimal, and, as it happens, unnecessary. By moving the virtual descriptor
and descriptor cleanup intelligence from main.c to load.c, these multiple calls to the
hash function were eliminated. A single hash is now performed on all packets. However,
when GRE virtual descriptors are created now, they use the hash value already computed
as part of the PPTP TCP SYN processing. This is a better solution, as it ensures that
both the PPTP TCP tunnel and the GRE virtual "connection" both belong to the same bucket,
and therefore the same host. This prevents us from unnecessarily putting the cluster
into a non-optimized state. However, when GRE data packets do arrive and need to hash
and perform a state lookup, there is no way that it can regenerate the same hash value
that was computed by the PPTP TCP tunnel setup if the affinity is set to none. That,
of course, is because the TCP source port of the PPTP tunnel is not recoverable from the
GRE packets. Therefore, to ensure that GRE packet lookup can re-calculate the necessary
hash value, single affinity is REQUIRED.
02/14/2002 JosephJ Location of fake ndis usermode code...
\\winsefre\nt5src\private\ntos\tdi\tcpipmerge\1394\arp1394\tests
04/15/2002 JosephJ To temporarily build the um ndis stuff (needs cleaning up)
#ifdef TESTPROGRAM
#include "rmtest.h"
#define KERNEL_MODE
#else
#include <ndis.h>
/* For querying TCP about the state of a TCP connection. */
#include "ntddtcp.h"
#include "ntddip.h"
#endif // !TESTPROGRAM
04/24/2002 JosephJ diplist: Added skeleton diplist code
diplist.c, diplist.h
Also added code under .\test to component test the diplist code.
04/24/2002 JosephJ diplist: Added the fast lookup functionality.
04/25/2002 JosephJ diplist: Changed internal constants to "production" values.
#define MAX_ITEMS 32 // TODO: replace by appropriate CVY constant.
#define HASH1_SIZE 257 // size (in bits) of bit-vector (make it a prime)
#define HASH2_SIZE 59 // size of hashtable (make it a prime)
08.16.02, shouse
The driver no longer fills in the pg_rsvd array in the heartbeat because it was
discovered that it routinely produces a Wake On LAN pattern in the heartbeat that
causes BroadCom NICs to panic. Although this is NOT an NLB issue, but rather a
firmware issue in BroadCom NICs, it was decided to remove the information from the
heartbeat to alleviate the problem for customers with BroadCom NICs upgrading to
.NET. This array is UNUSED by NLB, so there is no harm in not filling it in; it
was added a long time ago for debugging purposes as part of the now-defunct FIN-
counting fix that was part of Win2k SP1.
For future reference, should we need to use this space in the heartbeat at some
future point in time, it appears that we will need to be careful to avoid potential
WOL patterns in our heartbeats where we can avoid it. A WOL pattern is:
6 bytes of 0xFF, followed by 16 idential instances of a "MAC address" that can
appear ANYWHERE in ANY frame type, including our very own NLB heartbeats. E.g.:
FF FF FF FF FF FF 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
01 02 03 04 05 06
The MAC address need not be valid, however. In NLB heartbeats, the "MAC address"
in the mistaken WOL pattern is "00 00 00 00 00 00". NLB routinely fills heartbeats
with FF and 00 bytes, but it seems that by "luck" no other place in the heartbeat
seems this vulnerable. For instance, in the load_amt array, each entry has a
maximum value of 100 (decimal), so there is no possibility of generating the initial
6 bytes of FF to start the WOL pattern. All of the "map" arrays seem to be saved
by two strokes of fortune; (i) little endian and (ii) the bin distribution algorithm.
(i) Since we don't use the 4 most significant bits of the ULONGLONGs used to store
each map, the most significant bit is NEVER FF. Because Intel is little endian, the
most significant byte appears last. For example:
0F FF FF FF FF FF FF FF appears in the packet as FF FF FF FF FF FF 0F
This breaks the FF sequence in many scenarios.
(ii) The way the bin distribution algorithm distributes buckets to hosts seems to
discourage other possibilities. For instance, a current map of:
00 FF FF FF FF FF FF 00
just isn't likely. However, it IS STILL POSSIBLE! So, it is important to note that:
REMOVING THIS LINE OF CODE DOES NOT, IN ANY WAY, GUARANTEE THAT AN NLB HEARTBEAT
CANNOT STILL CONTAIN A VALID WAKE ON LAN PATTERN SOMEWHERE ELSE IN THE FRAME!!!