Leaked source code of windows server 2003
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

264 lines
17 KiB

  1. 12/11/00 JosephJ Fix for #23727
  2. 23727 wlbs drain all command should return an error message
  3. if no port rules exist.
  4. The problem (if you can call it that) is that if there are NO user-specified
  5. port rules, we treat port-specific operations directed to "ALL" ports as
  6. successful. These commands are start,stop, drain and set (adjust weights).
  7. Fix is for Load_port_change to return IOCTL_CVY_NOT_FOUND in this case.
  8. Note that Load_port_change does some special casing for
  9. IOCTL_CVY_CLUSTER_DRAIN and IOCTL_CVY_CLUSTER_PLUG -- it includes
  10. the default port rule.
  11. 07.17.01 shouse
  12. Due to a change in user-space where we no longer disable and re-enable the
  13. adapter when the MAC address changes, the ded_mac_addr will now ALWAYS be
  14. the burnt-in MAC address of the adapter, whereas it has been the NLB 02-bf
  15. MAC address because by the time NLB bound to the adapter, it had already
  16. picked up the new MAC address. Now, that is no longer the case, which
  17. should not be a problem because all indications are that this was the way
  18. that it was in win2k until we started disabling/enabling the adapters in
  19. SP 1. However, an alignment issue resulted in a bug fix that appears to
  20. rely on the fact that in unicast mode, the ded_mac_addr is the cl_mac_addr.
  21. This fix was a hack, and doesn't seem to have really been thought out
  22. anyway, because the code added was guaranteed to always be a no-op; it
  23. amounted to "if (foo == 2) { foo = 2; }. Anyway, this "fix" was also
  24. only applied in one of three places the exact same code resisded, so the
  25. fixed "fix" has also been propagated to all three places. The fix involves
  26. spoofing source MAC addresses in unicast mode to prevent network switches
  27. from learning the cluster MAC address. Rather than simply casting a
  28. pointer to a PULONG and dereferencing it to set a ULONG, which may cause
  29. an alignment fault, we set each byte of the ULONG individually to avoid
  30. the alignment issue.
  31. 10.21.01 shouse
  32. Amendment to the above statement concerning the dedicated MAC address. It
  33. appears that since sending a property change notification to the NIC results
  34. in NDIS tearing down and rebuilding all bindings, by the time the adapter
  35. is back up and running and NLB queries for the dedicated MAC address, the
  36. adapter will have already picked up the 02-bf MAC address, so the statement
  37. that the dedicated MAC address would now be the burnt-in MAC is not entirely
  38. accurate.
  39. 10.21.01 shouse
  40. Some lingering issues and their resolutions from a conversation with Bill Bain:
  41. Dirty connections: The real question has been, "Why the seemingly arbitrary
  42. five minute timeout?" Well, it turns out that the value is not arbitrary,
  43. but rather was measured and based on empirical evidence. If a large number
  44. of connections were left dagling by NLB when a "stop" was performed, this
  45. would result in a reset "storm" if the host was quickly added back into the
  46. cluster. It was observed that if NLB could block this traffic to the host
  47. with the stale data, NLB could _significantly_ reduce the reset problems. So,
  48. while its true that this five minutes is no silver bullet, it was based on
  49. real measurable data available and solved the problem for a significant
  50. number of the stale connections.
  51. PPTP: Of course, PPTP was supposed to be supported in Windows 2000, but a
  52. cursory look at the source code shows that tracking the calls, which are
  53. GRE packets, did NOT work in Windows 2000. GRE packets were supposed to be
  54. treated like TCP data packets on the PPTP tunnel (TCP connection), and since
  55. no port numbers from the PPTP tunnel are recoverable in a GRE packet, NLB
  56. hard-coded the source and destination ports to zero and 1723, respectively.
  57. the 1723 corresponds to the server port number of the PPTP tunnel and the zero
  58. is arbitrary and as good a choice for a source port as any. So, GRE packets
  59. would be hashed the same as the TCP tunnel in single affinity, sticking the
  60. GRE traffic to the correct host. However, when ambiguity arose (unoptimized
  61. mode), GRE packets were looking for a descriptor with a source port of zero
  62. and a destination port of 1723. Because the tunnel was established with the
  63. ephemoral port assigned by TCP on the client machine, no descriptor would
  64. EVER be found, and the packets were discarded. What was _intended_ was to
  65. create the descriptor for the PPTP tunnel using the same hard-coded source
  66. port of zero. In that case, GRE packets would find a matching descriptor
  67. when necessary. This was the small piece of logic missing in Windows 2000,
  68. which will be added in an upcoming service pack. However, this fix eliminates
  69. any method by which NLB could distinguish multiple PPTP tunnels from the same
  70. client IP address (since the client ports are masked). So, a limitation of
  71. this implementation is that clients may NOT establish multiple tunnels (which
  72. they won't by default) and clients from behind a NAT are not supported, as
  73. multiple clients from behind a NAT would look like the same client to NLB,
  74. differentiated only by source port, which NLB cannot distinguish.
  75. Fragmentation: NLB has had an "optimized" fragmentation mode in it that
  76. didn't seem to make sense. The problem is that subsequent packets in a
  77. fragmented segment will not have the TCP/UDP ports, which NLB needs in order
  78. to properly filter them. The "unoptimized" mode said that if the packet in
  79. question was the first packet of a fragment, then NLB can get to the port
  80. numbers, so it will be treated normally and passed up only on the correct host.
  81. Subsequent packets in the fragmented segment will not have the port numbers,
  82. so NLB would pass them up on _all_ hosts in the cluster. The IP layer would
  83. simply drop the fragments on the hosts that did not pass up the first packet
  84. in the fragmented segment. So, other than a bit of extra stress on the IP
  85. layer in the stack, this method should be guaranteed to work. The "optimized"
  86. mode was a method by which to let NLB do the filtering in the limited cases
  87. that it could. Basically, this mode asserted that if you have a single port
  88. rule that covers all ports (0-65535), then the server port is essentially
  89. irrelevent - you'd lookup the same port rule regardless of what the port
  90. actually was. Further, if that port rule was configured in single affinity,
  91. then the client port was also irrelevent - its not used in the hashing
  92. algorithm. If the cluster is configured as such (which happens to be the
  93. default), then NLB need not know the actual source ports to pass the packet
  94. up ONLY on the correct host. Well, that is almost correct. It is true that
  95. the client and server ports then become irrelevent insofar as port rule
  96. lookup and hashing, but they ARE needed for descriptor lookup - if we're
  97. hoping to find a matching connection descriptor in order to know which host
  98. owns a particular connection, we need to know the _actual_ client and server
  99. ports to match a descriptor. So, this "optimized" mode doesn't really work
  100. after all. However, as it turns out, in Windows 2000, where it was introduced,
  101. it DID actually work. That's assuming that you discount TCP, through which
  102. fragmentation is _highly_ discouraged by setting maximum segment sizes
  103. appropriately, then for UDP/GRE/IPSEC it DID work because those protocols did
  104. not utilize descriptors at all - their ownership was based solely on who
  105. currently owned the bucket to which the packet mapped. So, its a bit muddled,
  106. but did "work" in Windows 2000. In .Net server however, this "optimized" mode
  107. has been removed because it no longer works. This is because some UDP traffic,
  108. namely IPSec (port 500) is now tracked through the use of descriptors. This
  109. failure was actually found through IPSec testing in which the initial fragment
  110. went up on the correct server, but the subsequent fragment went up on the
  111. _wrong_ server (not all servers, as it would have in "unoptimized" mode). GRE
  112. and IPSec protocol traffic use hard-coded ports in connection tracking, so they
  113. continue to be ambivolent to fragments.
  114. 12.05.01 chrisdar
  115. BUG 482284 NLB: stores its private state in wrong Ndis packet causes break
  116. during standby
  117. When there is no packet stack available in an NDIS packet for NLB to store
  118. information, NLB needs to allocate an NDIS packet for its own use, copy the
  119. information from the original packet into it, then deallocate it when we are
  120. finished using it. One place where this happens is in a rarely executed code
  121. path of Prot_recv_indicate. The bug was that in this code path, we subsequently
  122. used the original packet and tried to access packet stack that wasn't available.
  123. The packet we allocated to get packet stack wasn't used. The fix is to use the
  124. allocated packet instead of the original.
  125. While testing a private fix in the lab, I also made temporary changes to force
  126. Prot_recv_indicate to use this code path for every received non-remote control
  127. packet.
  128. 1.21.02, shouse
  129. Note: Due to recent changes in the GRE virtual descriptor tracking mechanism in
  130. the driver, SINGLE affinity is now REQUIRED for PPTP. In general, single affinity
  131. has always be "required" for VPN, but until this change was made, no affinity
  132. would still have basically worked for PPTP. No affinity WILL STILL WORK for IPsec,
  133. but only helps in the case that clients come from behind a NAT device; if they do
  134. not come from behind a NAT, the source and destination ports are ALWAYS UDP 500
  135. anyway, which defeats any advantage no affinity might provide.
  136. Why did no affinity previously work for PPTP?
  137. When a PPTP tunnel is created, NLB hashes the TCP control tunnel just like any
  138. other TCP connection. If the affinity is set to none, then it uses the TCP port
  139. numbers during the hashing process. If the host owns the bucket to which the
  140. TCP SYN hashes, it accepts the connection and creates state to track the PPTP
  141. tunnel. When a PPTP tunnel is accepted, it is also necessary to create a virtual
  142. GRE descriptor to track the GRE call data for this tunnel. When this descriptor
  143. is created, since no ports exist in the GRE protocol, it used the hard-coded ports
  144. of 0 (source) and 1723 (destination). Since GRE is treated like TCP for the
  145. purposes of port rule lookup and state maintenance, the GRE state creation in the
  146. load module would certainly find the same port rule that the PPTP tunnel did; TCP
  147. 1723. However, if no affinity is set, it will NOT derive the same hashing result
  148. that the PPTP tunnel did because the source (client) ports are different; an
  149. arbitrary port number in the PPTP SYN packet and a hardcoded port number of 0 in
  150. the GRE "virtual connection". Therefore, the load module would end up "injecting"
  151. a descriptor into a port rule and "bucket" that it MIGHT NOT EVEN OWN (because bucket
  152. ownership is not considered when creating these virtual descriptors that correspond
  153. to a real connection being serviced by a host. In general, that's fine and by
  154. the next heartbeat, the host that DOES own that bucket will notice and stop blindly
  155. accepting traffic that hashes to that bucket (it moves in non-optimized mode). So,
  156. while it SHOULD work in no affinity, this runs the risk of unnecessarily shifting
  157. the cluster into non-optimized mode because hosts that are not the bucket owners
  158. may handle connections on those buckets.
  159. Why won't no affinity work any more?
  160. Basically, because the second hash performed on the GRE "connection" has been removed.
  161. Up-going PPTP tunnels used to require at least 3, and as many as 4, calls to the NLB
  162. hash function. Because the hash function is a LARGE portion of the NLB overhead, this
  163. is non-optimal, and, as it happens, unnecessary. By moving the virtual descriptor
  164. and descriptor cleanup intelligence from main.c to load.c, these multiple calls to the
  165. hash function were eliminated. A single hash is now performed on all packets. However,
  166. when GRE virtual descriptors are created now, they use the hash value already computed
  167. as part of the PPTP TCP SYN processing. This is a better solution, as it ensures that
  168. both the PPTP TCP tunnel and the GRE virtual "connection" both belong to the same bucket,
  169. and therefore the same host. This prevents us from unnecessarily putting the cluster
  170. into a non-optimized state. However, when GRE data packets do arrive and need to hash
  171. and perform a state lookup, there is no way that it can regenerate the same hash value
  172. that was computed by the PPTP TCP tunnel setup if the affinity is set to none. That,
  173. of course, is because the TCP source port of the PPTP tunnel is not recoverable from the
  174. GRE packets. Therefore, to ensure that GRE packet lookup can re-calculate the necessary
  175. hash value, single affinity is REQUIRED.
  176. 02/14/2002 JosephJ Location of fake ndis usermode code...
  177. \\winsefre\nt5src\private\ntos\tdi\tcpipmerge\1394\arp1394\tests
  178. 04/15/2002 JosephJ To temporarily build the um ndis stuff (needs cleaning up)
  179. #ifdef TESTPROGRAM
  180. #include "rmtest.h"
  181. #define KERNEL_MODE
  182. #else
  183. #include <ndis.h>
  184. /* For querying TCP about the state of a TCP connection. */
  185. #include "ntddtcp.h"
  186. #include "ntddip.h"
  187. #endif // !TESTPROGRAM
  188. 04/24/2002 JosephJ diplist: Added skeleton diplist code
  189. diplist.c, diplist.h
  190. Also added code under .\test to component test the diplist code.
  191. 04/24/2002 JosephJ diplist: Added the fast lookup functionality.
  192. 04/25/2002 JosephJ diplist: Changed internal constants to "production" values.
  193. #define MAX_ITEMS 32 // TODO: replace by appropriate CVY constant.
  194. #define HASH1_SIZE 257 // size (in bits) of bit-vector (make it a prime)
  195. #define HASH2_SIZE 59 // size of hashtable (make it a prime)
  196. 08.16.02, shouse
  197. The driver no longer fills in the pg_rsvd array in the heartbeat because it was
  198. discovered that it routinely produces a Wake On LAN pattern in the heartbeat that
  199. causes BroadCom NICs to panic. Although this is NOT an NLB issue, but rather a
  200. firmware issue in BroadCom NICs, it was decided to remove the information from the
  201. heartbeat to alleviate the problem for customers with BroadCom NICs upgrading to
  202. .NET. This array is UNUSED by NLB, so there is no harm in not filling it in; it
  203. was added a long time ago for debugging purposes as part of the now-defunct FIN-
  204. counting fix that was part of Win2k SP1.
  205. For future reference, should we need to use this space in the heartbeat at some
  206. future point in time, it appears that we will need to be careful to avoid potential
  207. WOL patterns in our heartbeats where we can avoid it. A WOL pattern is:
  208. 6 bytes of 0xFF, followed by 16 idential instances of a "MAC address" that can
  209. appear ANYWHERE in ANY frame type, including our very own NLB heartbeats. E.g.:
  210. FF FF FF FF FF FF 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
  211. 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
  212. 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
  213. 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06 01 02 03 04 05 06
  214. 01 02 03 04 05 06
  215. The MAC address need not be valid, however. In NLB heartbeats, the "MAC address"
  216. in the mistaken WOL pattern is "00 00 00 00 00 00". NLB routinely fills heartbeats
  217. with FF and 00 bytes, but it seems that by "luck" no other place in the heartbeat
  218. seems this vulnerable. For instance, in the load_amt array, each entry has a
  219. maximum value of 100 (decimal), so there is no possibility of generating the initial
  220. 6 bytes of FF to start the WOL pattern. All of the "map" arrays seem to be saved
  221. by two strokes of fortune; (i) little endian and (ii) the bin distribution algorithm.
  222. (i) Since we don't use the 4 most significant bits of the ULONGLONGs used to store
  223. each map, the most significant bit is NEVER FF. Because Intel is little endian, the
  224. most significant byte appears last. For example:
  225. 0F FF FF FF FF FF FF FF appears in the packet as FF FF FF FF FF FF 0F
  226. This breaks the FF sequence in many scenarios.
  227. (ii) The way the bin distribution algorithm distributes buckets to hosts seems to
  228. discourage other possibilities. For instance, a current map of:
  229. 00 FF FF FF FF FF FF 00
  230. just isn't likely. However, it IS STILL POSSIBLE! So, it is important to note that:
  231. REMOVING THIS LINE OF CODE DOES NOT, IN ANY WAY, GUARANTEE THAT AN NLB HEARTBEAT
  232. CANNOT STILL CONTAIN A VALID WAKE ON LAN PATTERN SOMEWHERE ELSE IN THE FRAME!!!