Leaked source code of windows server 2003
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2058 lines
62 KiB

  1. /* ---------------------- MMapi.c ----------------------- */
  2. /* This module contains cluster Membership Manager (MM) functions.
  3. *
  4. * These functions are for the sole use of the ClusterManager (CM).
  5. * All are privileged and local; no user can call them. Security is
  6. * not checked. The module is not thread-aware; only a single thread
  7. * can use these functions at a time. Higher levels must ensure this
  8. * before the calls.
  9. *
  10. *
  11. * All nodes of the cluster must know their own unique nodenumber
  12. * within that cluster (a small int in the range 1..some_max). This
  13. * number is defined for the node at configuration time (either by the
  14. * user or by the setup code; this module doesn't care which) and is
  15. * essentially permanent. (The node number allows indexing and
  16. * bitmask operations easily, where names and non-small ints don't).
  17. * There is no code in MM to detect illegal use of nodenumber, staleness
  18. * of node number, etc.
  19. *
  20. * Clusters may also be named and/or numbered. Nodes are named. This
  21. * module makes no use of such facilities; it is based entirely on
  22. * node-number.
  23. *
  24. * It is assumed that all use of routines here is done on nodes which
  25. * agree to be members of the same cluster. This module does not check
  26. * such things.
  27. *
  28. * Cluster network connectivity must also be provided:
  29. *
  30. * - A node N must specify the various paths by which it can
  31. * communicate with every other node; each other node must define
  32. * its communication paths back to N. Full connectivity must be
  33. * guaranteed; each node must be able to talk directly to every
  34. * other node (and the reverse); for fault-tolerance, communication
  35. * paths must not only be replicated (minimally, duplicated) but
  36. * must also use entirely independent wiring and drivers. TCP/IP
  37. * lans and async connections are suggested. Heartbeat traffic
  38. * (which establishes cluster membership) may travel on any or all
  39. * of the connectivity paths. [Cluster management traffic may
  40. * travel on any or all of the connectivity paths, but may be
  41. * restricted to high-performance paths (eg, tcp/ip)].
  42. *
  43. * - A node must know the address of the cluster as a whole. This is
  44. * an IP address which failsover (or a netbios name which fails
  45. * over.. TBD) such that connecting to that cluster address provides
  46. * a way to talk to a valid active member of the cluster, here
  47. * called the PCM.
  48. *
  49. * Note that cluster connectivity is not defined by this interface;
  50. * it is assumed to be in a separate module. This module deals only in
  51. * communication to the cluster or communication to a nodenumber
  52. * within that cluster; it does not care about the details of how such
  53. * communcation is done.
  54. *
  55. * Cluster connectivity must be known to all nodes in the cluster
  56. * and to a joining node, before the join attempt is made.
  57. *
  58. */
  59. #ifdef __cplusplus
  60. extern "C" {
  61. #endif /* __cplusplus */
  62. #if defined (TDM_DEBUG)
  63. #include <mmapi.h>
  64. #else // WOLFPACK
  65. #include <service.h>
  66. #endif
  67. //#include <windows.h>
  68. #include <wrgp.h>
  69. #include <clmsg.h>
  70. // #define INCONSISTENT_REGROUP_MONITOR_FAILED
  71. // #define INCONSISTENT_REGROUP_ADD_FAILED
  72. // #define INCONSISTENT_REGROUP_IGNORE_JOINER 3
  73. void
  74. rgp_receive_events( rgp_msgbuf *rgpbuf );
  75. void
  76. MMiNodeDownCallback(IN cluster_t failed_nodes);
  77. /************************************************************************
  78. *
  79. * MMInit
  80. * ======
  81. *
  82. * Description:
  83. *
  84. * This initialises various local MM data structures. It should be
  85. * called at CM startup time on every node. It sends no messages; the
  86. * node need not have connectivity defined yet.
  87. *
  88. * Parameters:
  89. *
  90. * mynode -
  91. * is the node# of this node within the cluster. This is
  92. * assumed to be unique (but cannot be checked here to be so).
  93. *
  94. * UpDownCallback -
  95. * a function which will be called in this Up node
  96. * whenever the MM declares another node Up or Down. The CM may then
  97. * initiate failovers, device ownership changes, user node status
  98. * events, etc. This routine must be quick and must not block
  99. * (acceptible time TBD). Note that this will happen on all nodes
  100. * of the cluster; it is up to the CM design to decide whether to
  101. * issue events from only the PCM or from each CM node.
  102. *
  103. *
  104. * QuorumCallback -
  105. * This is a callback to deal with the special case where only
  106. * 2 members of the cluster existed, and a Regroup incident occurred
  107. * such that only one member now survives OR there is a partition
  108. * and both members survive (but cannot know that). The intent of the
  109. * Quorum function is to determine whether the other node is alive
  110. * or not, using mechanisms other than the normal heartbeating over
  111. * the normal comm links (eg, to do so by using non-heartbeat
  112. * communication paths, such as SCSI reservations). This function is
  113. * called only in the case of where cluster membership was previously
  114. * exactly two nodes; and is called on any surviving node of these
  115. * two (which might mean it is called on one node or on both
  116. * partitioned nodes).
  117. *
  118. * If this routine returns TRUE, then the calling node stays in the
  119. * cluster. If the quorum algorithm determines that this node must
  120. * die (because the other cluster member exists), then this function
  121. * should return FALSE;this will initiate an orderly shutdown of the
  122. * cluster services.
  123. *
  124. * In the case of a true partition, exactly one node should
  125. * return TRUE.
  126. *
  127. * This routine may block and take a long time to execute (>2 secs).
  128. *
  129. * HoldIOCallback -
  130. * This routine is called early (prior to Stage 1) in a Regroup
  131. * incident. It suspends all cluster IO (to all cluster-owned
  132. * devices), and any relevant intra-cluster messages, until resumed
  133. * (or until this node dies).
  134. *
  135. * ResumeIOCallback -
  136. * This is called during Regroup after the new cluster membership
  137. * has been determined, when it is known that this node will remain
  138. * a member of the cluster (early in Stage 4). All IO previously
  139. * suspended by MMHoldAllIO should be resumed.
  140. *
  141. * MsgCleanup1Callback -
  142. * This is called as the first part of intra-cluster message system
  143. * cleanup (in stage 4). It cancels all incoming messages from a
  144. * failed node. In the case where multiple nodes are evicted from
  145. * the cluster, this function is called repeatedly, once for each node.
  146. *
  147. * This routine is synchronous and Regroup is suspended until it
  148. * returns. It must execute quickly.
  149. *
  150. * MsgCleanup2Callback -
  151. * This is the second phase of message system cleanup (in stage 5). It
  152. * cancels all outgoing messages to dead nodes. Characteristics are
  153. * as for Cleanup1.
  154. *
  155. * HaltCallback -
  156. * This function is called whenever the MM detects that this node
  157. * should immediately leave the cluster (eg, on receipt of a poison
  158. * packet or at some impossible error situation). The HALT function
  159. * should immediately initiate Cluster Management shutdown. No MM
  160. * functions should be called after this, other than MMShutdown.
  161. *
  162. * The parameter "haltcode" is a number identifying the halt reason.
  163. *
  164. * JoinFailedCallback -
  165. * This is called on a node being joined into the cluster when the
  166. * join attempt in the PCM fails. Following this callback, the node
  167. * may petition to join again, after cleaning up via a call to
  168. * MMLeave.
  169. *
  170. *
  171. * Returns:
  172. *
  173. * MM_OK Success.
  174. *
  175. * MM_FAULT Something impossible happened.
  176. *
  177. ************************************************************************/
  178. DWORD MMInit(
  179. IN DWORD mynode,
  180. IN DWORD MaxNodes,
  181. IN MMNodeChange UpDownCallback,
  182. IN MMQuorumSelect QuorumCallback,
  183. IN MMHoldAllIO HoldIOCallback,
  184. IN MMResumeAllIO ResumeIOCallback,
  185. IN MMMsgCleanup1 MsgCleanup1Callback,
  186. IN MMMsgCleanup2 MsgCleanup2Callback,
  187. IN MMHalt HaltCallback,
  188. IN MMJoinFailed JoinFailedCallback,
  189. IN MMNodesDown NodesDownCallback
  190. )
  191. {
  192. #if !defined (TDM_DEBUG)
  193. DWORD status;
  194. DWORD dwValue;
  195. #endif
  196. rgp_msgsys_t *rgp_msgsys_ptr;
  197. rgp_control_t *rgp_buffer_p;
  198. int rgp_buffer_len;
  199. //
  200. // allocate/clear storage for the message system area
  201. //
  202. rgp_msgsys_ptr = ( rgp_msgsys_t *) calloc(1, sizeof(rgp_msgsys_t) );
  203. if ( rgp_msgsys_ptr == NULL ) {
  204. ClRtlLogPrint(LOG_UNUSUAL,
  205. "[MM] Unable to allocate msgsys_ptr.\n");
  206. return(ERROR_NOT_ENOUGH_MEMORY);
  207. }
  208. memset( rgp_msgsys_ptr, 0, sizeof(rgp_msgsys_t) );
  209. //
  210. // ask regroup how much memory it needs and then allocate/clear it.
  211. //
  212. rgp_buffer_len = rgp_estimate_memory();
  213. rgp_buffer_p = (rgp_control_t *) calloc( 1, rgp_buffer_len );
  214. if ( rgp_buffer_p == NULL ) {
  215. ClRtlLogPrint(LOG_UNUSUAL,
  216. "[MM] Unable to allocate buffer_p.\n");
  217. return(ERROR_NOT_ENOUGH_MEMORY);
  218. }
  219. memset(rgp_buffer_p, 0, rgp_buffer_len);
  220. //
  221. // let the regroup engine allocate and initialize its data structures.
  222. //
  223. rgp_init( (node_t)mynode,
  224. MaxNodes,
  225. (void *)rgp_buffer_p,
  226. rgp_buffer_len,
  227. rgp_msgsys_ptr );
  228. #if !defined (TDM_DEBUG)
  229. //
  230. // Initialize message system
  231. //
  232. status = ClMsgInit(mynode);
  233. if (status != ERROR_SUCCESS) {
  234. ClRtlLogPrint(LOG_UNUSUAL,
  235. "[MM] Unable to initialize comm interface, status %1!u!.\n",
  236. status
  237. );
  238. return(status);
  239. }
  240. #endif // TDM_DEBUG
  241. if( ERROR_SUCCESS == DmQueryDword(
  242. DmClusterParametersKey,
  243. CLUSREG_NAME_QUORUM_ARBITRATION_TIMEOUT,
  244. &dwValue, NULL) )
  245. {
  246. MmQuorumArbitrationTimeout = dwValue;
  247. ClRtlLogPrint(LOG_UNUSUAL,
  248. "[MM] MmQuorumArbitrationTimeout %1!d!.\n", dwValue);
  249. }
  250. if( ERROR_SUCCESS == DmQueryDword(
  251. DmClusterParametersKey,
  252. CLUSREG_NAME_QUORUM_ARBITRATION_EQUALIZER,
  253. &dwValue, NULL) )
  254. {
  255. MmQuorumArbitrationEqualizer = dwValue;
  256. ClRtlLogPrint(LOG_UNUSUAL,
  257. "[MM] MmQuorumArbitrationEqualizer %1!d!.\n", dwValue);
  258. }
  259. //
  260. // Save the user's callback entrypoints
  261. //
  262. rgp->OS_specific_control.UpDownCallback = UpDownCallback;
  263. rgp->OS_specific_control.QuorumCallback = QuorumCallback;
  264. rgp->OS_specific_control.HoldIOCallback = HoldIOCallback;
  265. rgp->OS_specific_control.ResumeIOCallback = ResumeIOCallback;
  266. rgp->OS_specific_control.MsgCleanup1Callback = MsgCleanup1Callback;
  267. rgp->OS_specific_control.MsgCleanup2Callback = MsgCleanup2Callback;
  268. rgp->OS_specific_control.HaltCallback = HaltCallback;
  269. rgp->OS_specific_control.JoinFailedCallback = JoinFailedCallback;
  270. rgp->OS_specific_control.NodesDownCallback = NodesDownCallback;
  271. return MM_OK;
  272. }
  273. /************************************************************************
  274. * JoinNodeDelete
  275. * ==============
  276. *
  277. *
  278. * Internal MM procedure to assist in Join failure recovery.
  279. *
  280. *
  281. * Parameters:
  282. *
  283. * Node which failed to join.
  284. *
  285. * Returns:
  286. * none.
  287. *
  288. ************************************************************************/
  289. void JoinNodeDelete ( joinNode)
  290. {
  291. rgp_msgbuf rgpbuf;
  292. node_t i;
  293. int status;
  294. #if 1
  295. RGP_LOCK;
  296. rgp_event_handler( RGP_EVT_BANISH_NODE, (node_t) joinNode );
  297. RGP_UNLOCK;
  298. #else
  299. // [HACKHACK] Remove this when you feel confident that
  300. // banishing is much better than the following code
  301. rgpbuf.event = RGP_EVT_REMOVE_NODE;
  302. rgpbuf.data.node = (node_t)joinNode;
  303. for ( i=0; i < (node_t) rgp->num_nodes; i++ )
  304. {
  305. if ( rgp->node_states[i].status == RGP_NODE_ALIVE)
  306. {
  307. if ( i == rgp->mynode )
  308. rgp_receive_events( &rgpbuf ); // take the quick route
  309. else
  310. {
  311. status = ClSend( EXT_NODE(i),
  312. (void *)&rgpbuf,
  313. sizeof(rgp_msgbuf),
  314. RGP_ACKMSG_TIMEOUT);
  315. if ( status ) RGP_TRACE( "ClSend failed to send Remove Node msg",
  316. rgp->rgppkt.stage,
  317. (uint32) EXT_NODE(i),
  318. (uint32) status,
  319. 0 );
  320. }
  321. }
  322. }
  323. #endif
  324. }
  325. /************************************************************************
  326. *
  327. * MMJoin
  328. * ======
  329. *
  330. * Description:
  331. *
  332. * This causes the specified node to join the active cluster.
  333. *
  334. * This routine should be issued by only one node of the cluster (the
  335. * PCM); all join attempts must be single-threaded (by code outside
  336. * this module).
  337. *
  338. * [Prior to this being called:
  339. * - joiningNode has communicated to the PCM of the cluster
  340. * that it wants to join.
  341. * - checks on validity of clustername, nodenumber, etc have been
  342. * made; any security checks have been done;
  343. * - connectivity paths have been established to/from the cluster
  344. * and joiningNode.
  345. * - the Registry etc has been downloaded.
  346. * ]
  347. *
  348. * Parameters:
  349. *
  350. * joiningNode
  351. * is the node number of the node being brought into
  352. * the cluster.
  353. *
  354. * If joiningNode = self (as passed in via MMinit), then the node
  355. * will become the first member of a new cluster; if not, the node
  356. * will be brought into the existing cluster.
  357. *
  358. * clockPeriod, sendRate, and rcvRate
  359. * can only be set by the first call (ie
  360. * when the cluster is formed); later calls (from joining members)
  361. * inherit the original cluster values. The entire cluster therefore operates
  362. * with the same values.
  363. *
  364. * clockPeriod
  365. * is the basic clock interval which drives all internal
  366. * MM activities, such as the various stages
  367. * of membership reconfiguration, and eventually user-perceived
  368. * recovery time. Unit= ms. This must be between the min and max
  369. * allowed (values TBD; current best setting = 300ms). Note that
  370. * clockperiod is path independent and node independent. All
  371. * cluster members regroup at the same rate over any/all available
  372. * paths; all periods are identical in all nodes.
  373. * A value of 0 implies default setting (currently 300ms).
  374. *
  375. * sendHBRate
  376. * is the multiple of clockPeriod at which heartbeats are sent. This
  377. * must be between the min and max allowed (values TBD; current best setting = 4).
  378. * A value of 0 implies default setting (currently 4).
  379. *
  380. * rcvHBRate
  381. * is the multiple of sendRate during which a heartbeat must arrive, or the
  382. * node initiates a Regroup (probably resulting in some node leaving the cluster).
  383. * This must be between min and max; (values TBD; current best setting = 2).
  384. * A value of 0 implies default setting (currently 2);
  385. *
  386. * JoinTimeout
  387. * is an overall timer in milliseconds on the entire Join attempt. If the
  388. * node has not achieved full cluster membership in this time, the
  389. * attempt is abandoned.
  390. *
  391. * Returns:
  392. *
  393. * MM_OK Success; cluster joined. The CM is then safe to
  394. * assign ownership to cluster-owned devices on this
  395. * node, and to start failover/failback processing.
  396. *
  397. * Note: this routine establishes cluster membership.
  398. * However, it is usually inadvisable to start high
  399. * level CM failbacks immediately, because other
  400. * cluster members are often still joining. The CM
  401. * should typically wait a while to see whether other
  402. * nodes arrive in the cluster soon.
  403. *
  404. * MM_ALREADY The node is already a cluster member. This can
  405. * happen if a node reboots (or a CM is restarted)
  406. * and rejoins even before the cluster determines
  407. * that it has disappeared. The CM should Leave and
  408. * reJoin.
  409. *
  410. * MM_FAULT Permanent failure; something is very bad: the
  411. * node# is duplicated; some parameter is some
  412. * entirely illegal value. The CM is in deep weeds.
  413. *
  414. * MM_TRANSIENT Transient failure. The cluster state changed
  415. * during the operation (eg a node left the cluster).
  416. * The operation should be retried.
  417. *
  418. * MM_TIMEOUT Timeout; cluster membership not achieved in time.
  419. *
  420. *
  421. * more
  422. * TBD
  423. *
  424. ************************************************************************/
  425. DWORD MMJoin(
  426. IN DWORD joiningNode,
  427. IN DWORD clockPeriod,
  428. IN DWORD sendHBRate,
  429. IN DWORD rcvHBRate,
  430. IN DWORD joinTimeout
  431. )
  432. {
  433. node_t my_reloadee_num = INT_NODE(joiningNode); // internal node #
  434. rgp_msgbuf rgpbuf; // buffer to send messages
  435. node_t i;
  436. rgpinfo_t rgpinfo;
  437. int status;
  438. BOOL joinfailed = FALSE;
  439. uint32 myseqnum;
  440. #if defined(TDM_DEBUG)
  441. int randNode1,randNode2;
  442. #endif
  443. #if defined(INCONSISTENT_REGROUP_IGNORE_JOINER)
  444. extern int IgnoreJoinerNodeUp;
  445. #endif
  446. if ( my_reloadee_num >= (node_t) rgp->num_nodes )
  447. return MM_FAULT;
  448. //
  449. // If the caller is the joining node then we assume this is the
  450. // first member of the cluster.
  451. //
  452. if ( my_reloadee_num == rgp->mynode )
  453. {
  454. //
  455. // Set clockPeriod into the regroup information.
  456. //
  457. do {
  458. status = rgp_getrgpinfo( &rgpinfo );
  459. }
  460. while ( status == -1 /* regroup is perturbed */ );
  461. rgpinfo.a_tick = (uint16)clockPeriod;
  462. rgpinfo.iamalive_ticks = (uint16) sendHBRate;
  463. rgpinfo.check_ticks = (uint16) rcvHBRate;
  464. rgpinfo.Min_Stage1_ticks = (uint16) (sendHBRate * rcvHBRate);
  465. if ( rgp_setrgpinfo( &rgpinfo ) == -1 )
  466. RGP_ERROR( RGP_INTERNAL_ERROR ); // for now??
  467. //
  468. // Regroup can now start monitoring
  469. //
  470. rgp_start( MMiNodeDownCallback, RGP_NULL_PTR );
  471. MmSetRegroupAllowed(TRUE);
  472. return MM_OK;
  473. }
  474. //
  475. // Not the first system up.
  476. //
  477. if ( (rgp->node_states[my_reloadee_num].status == RGP_NODE_ALIVE) ||
  478. (rgp->node_states[my_reloadee_num].status == RGP_NODE_COMING_UP)
  479. )
  480. return MM_ALREADY;
  481. RGP_LOCK;
  482. myseqnum = rgp->rgppkt.seqno; // save rgp seq number to check for new rgp incident
  483. //
  484. // If regroup is perturbed wait until it stablizes.
  485. //
  486. while ( rgp_is_perturbed() )
  487. {
  488. RGP_UNLOCK;
  489. Sleep( 1 ); // wait a millisecond
  490. if ( --joinTimeout <= 0 )
  491. return MM_TIMEOUT;
  492. RGP_LOCK;
  493. myseqnum = rgp->rgppkt.seqno;
  494. }
  495. RGP_UNLOCK;
  496. //
  497. // First, we must tell all running nodes about the reloadee.
  498. //
  499. rgpbuf.event = RGP_EVT_ADD_NODE;
  500. rgpbuf.data.node = (node_t)joiningNode;
  501. #if defined(TDM_DEBUG)
  502. randNode1 = rand() % MAX_CLUSTER_SIZE;
  503. randNode2 = rand() % MAX_CLUSTER_SIZE;
  504. #endif
  505. for ( i=0; i < (node_t) rgp->num_nodes; i++ )
  506. {
  507. #if defined(TDM_DEBUG)
  508. if (rgp->OS_specific_control.debug.MyTestPoints.TestPointBits.joinfailADD)
  509. {
  510. if ((node_t) randNode1 == i)
  511. rgp_event_handler(RGP_EVT_LATEPOLLPACKET, (node_t) randNode2);
  512. }
  513. #endif
  514. if (myseqnum != rgp->rgppkt.seqno)
  515. {
  516. joinfailed = TRUE;
  517. break;
  518. }
  519. else if ( rgp->node_states[i].status == RGP_NODE_ALIVE )
  520. {
  521. if ( i == rgp->mynode )
  522. rgp_receive_events( &rgpbuf ); // take the quick route
  523. else
  524. {
  525. #if defined(INCONSISTENT_REGROUP_ADD_FAILED)
  526. if (i != my_reloadee_num) {
  527. joinfailed = TRUE;
  528. break;
  529. }
  530. #endif
  531. status = ClSend( EXT_NODE(i), (void *)&rgpbuf, sizeof(rgp_msgbuf), joinTimeout );
  532. if ( status )
  533. {
  534. RGP_TRACE( "ClSend failed to send Add Node msg",
  535. rgp->rgppkt.stage,
  536. (uint32) EXT_NODE(i),
  537. (uint32) status,
  538. 0 );
  539. joinfailed = TRUE;
  540. break;
  541. }
  542. }
  543. }
  544. }
  545. if (joinfailed)
  546. {
  547. JoinNodeDelete (joiningNode);
  548. return MM_TRANSIENT;
  549. }
  550. //
  551. // Next, we must tell the reloadee to come up.
  552. //
  553. rgpbuf.event = RGP_EVT_SETRGPINFO;
  554. do {
  555. status = rgp_getrgpinfo( &rgpbuf.data.rgpinfo );
  556. }
  557. while ( status == -1 /* regroup is perturbed */ );
  558. #if defined(INCONSISTENT_REGROUP_IGNORE_JOINER)
  559. IgnoreJoinerNodeUp = INCONSISTENT_REGROUP_IGNORE_JOINER;
  560. #endif
  561. status = ClSend( EXT_NODE(my_reloadee_num), (void *)&rgpbuf, sizeof(rgp_msgbuf), joinTimeout );
  562. if ( status )
  563. {
  564. RGP_TRACE( "ClSend failed to send Set Regroup Info msg",
  565. rgp->rgppkt.stage,
  566. (uint32) EXT_NODE(my_reloadee_num),
  567. (uint32) status,
  568. 0 );
  569. JoinNodeDelete(joiningNode);
  570. return MM_FAULT;
  571. }
  572. // Wait until the reloadee has sent us the first IamAlive message
  573. // which changes the reloadee state to RGP_NODE_ALIVE.
  574. while (rgp->node_states[my_reloadee_num].status != RGP_NODE_ALIVE)
  575. {
  576. // The regroup messages will be handled by the message thread. This
  577. // thread has nothing to do until the reloadee comes alive.
  578. Sleep( 1 ); // snooze for 1 millisecond
  579. // Check if timeout exceeded
  580. if ( --joinTimeout <= 0 )
  581. {
  582. // Reloadee hasn't started sending I'm alives. Tell all the nodes
  583. // to remove it.
  584. JoinNodeDelete ( joiningNode);
  585. return MM_TIMEOUT;
  586. }
  587. if (myseqnum != rgp->rgppkt.seqno)
  588. {
  589. JoinNodeDelete ( joiningNode);
  590. return MM_TRANSIENT;
  591. }
  592. }
  593. //
  594. // Next, we must tell all running nodes that the reloadee is up.
  595. //
  596. rgpbuf.event = RGP_EVT_MONITOR_NODE;
  597. rgpbuf.data.node = (node_t)joiningNode;
  598. for ( i=0; i < (node_t) rgp->num_nodes; i++ )
  599. {
  600. #if defined(TDM_DEBUG)
  601. if (rgp->OS_specific_control.debug.MyTestPoints.TestPointBits.joinfailMON)
  602. {
  603. if ((node_t) randNode1 == i)
  604. rgp_event_handler(RGP_EVT_LATEPOLLPACKET, (node_t) randNode2);
  605. }
  606. #endif
  607. if (myseqnum != rgp->rgppkt.seqno)
  608. {
  609. joinfailed = TRUE;
  610. break;
  611. }
  612. else if ( rgp->node_states[i].status == RGP_NODE_ALIVE )
  613. {
  614. if ( i == rgp->mynode )
  615. rgp_receive_events( &rgpbuf ); // take the quick route
  616. else
  617. {
  618. #if defined(INCONSISTENT_REGROUP_MONITOR_FAILED)
  619. if (i != my_reloadee_num) {
  620. joinfailed = TRUE;
  621. break;
  622. }
  623. #endif
  624. status = ClSend( EXT_NODE(i), (void *)&rgpbuf, sizeof(rgp_msgbuf), joinTimeout );
  625. if ( status )
  626. {
  627. RGP_TRACE( "ClSend failed to send Monitor Node msg",
  628. rgp->rgppkt.stage,
  629. (uint32) EXT_NODE(i),
  630. (uint32) status,
  631. 0 );
  632. joinfailed = TRUE;
  633. break;
  634. }
  635. }
  636. }
  637. }
  638. if (joinfailed)
  639. {
  640. JoinNodeDelete (joiningNode);
  641. return MM_TRANSIENT;
  642. }
  643. //
  644. // Next, we must tell the reloadee that reload is complete.
  645. //
  646. rgpbuf.event = RGP_EVT_START;
  647. rgpbuf.data.node = (node_t)joiningNode;
  648. status = ClSend( EXT_NODE(my_reloadee_num), (void *)&rgpbuf, sizeof(rgp_msgbuf), joinTimeout );
  649. if ( status )
  650. {
  651. RGP_TRACE( "ClSend failed to send Start msg",
  652. rgp->rgppkt.stage,
  653. (uint32) EXT_NODE(my_reloadee_num),
  654. (uint32) status,
  655. 0 );
  656. JoinNodeDelete(joiningNode);
  657. return MM_FAULT;
  658. }
  659. return MM_OK;
  660. }
  661. /************************************************************************
  662. *
  663. * MMLeave
  664. * =======
  665. *
  666. * Description:
  667. * This function causes the current node to leave the active cluster (go to
  668. * Down state). The node no longer sends Regroup or Heartbeats to other cluster members.
  669. * A NodeDown event will not be generated in this node. A Regroup is triggered in the
  670. * remaining nodes (if this node was a member of the cluster).
  671. * A node-down callback will occur on all remaining cluster members.
  672. *
  673. * This initiates a clean, voluntary, leave operation. For safety, prior to this,
  674. * the calling node's CM should arrange to lose ownership of all cluster-owned
  675. * devices assigned to this node (and so cause failovers, etc).
  676. *
  677. * This routine returns normally. The caller (the CM) should then shutdown
  678. * the cluster. MMShutdown or MMHalt may occur after this call, or
  679. * the node may be re-joined to the cluster. All apply-to-the-PCM-to-join
  680. * attempts by a node must be preceded by a call to MMleave().
  681. *
  682. * This routine may block.
  683. *
  684. *
  685. * Parameters:
  686. * -
  687. *
  688. * Returns:
  689. *
  690. * MM_OK : Elvis has left the cluster (but has been reportedly
  691. * sighted on numerous occasions).
  692. *
  693. * MM_NOTMEMBER : the node is not currently a cluster member.
  694. *
  695. ************************************************************************/
  696. DWORD MMLeave( void )
  697. {
  698. if (!rgp) {
  699. ClRtlLogPrint(LOG_UNUSUAL,
  700. "[MM] MMLeave is called when rgp=NULL.\n");
  701. return MM_FAULT;
  702. }
  703. if (! ClusterMember (rgp->OS_specific_control.CPUUPMASK, rgp->mynode) )
  704. return MM_NOTMEMBER;
  705. RGP_LOCK; // to ensure that we don't send in response to incoming pkt
  706. rgp_event_handler (MM_EVT_LEAVE, EXT_NODE(rgp->mynode));
  707. rgp_cleanup();
  708. rgp_cleanup_OS();
  709. RGP_UNLOCK;
  710. return MM_OK;
  711. }
  712. DWORD MMForceRegroup( IN DWORD NodeId )
  713. {
  714. if (! ClusterMember (rgp->OS_specific_control.CPUUPMASK, (node_t)NodeId) )
  715. {
  716. ClRtlLogPrint(LOG_CRITICAL,
  717. "[MM] MMForceRegroup: NodeId %1!u! is not a clustermember\r\n",
  718. NodeId);
  719. return MM_NOTMEMBER;
  720. }
  721. rgp_event_handler(RGP_EVT_LATEPOLLPACKET, (node_t)NodeId);
  722. return MM_OK;
  723. }
  724. /************************************************************************
  725. *
  726. * MMNodeUnreachable
  727. * =================
  728. *
  729. * Description:
  730. *
  731. * This should be called by the CM's messaging module when a node
  732. * becomes unreachable from this node via all paths. It allows quicker
  733. * detection of failures, but is otherwise equivalent to discovering
  734. * that the node has disappeared as a result of lost heartbeats.
  735. *
  736. * Parameters:
  737. *
  738. * node -
  739. * specifies the node that is unreachable.
  740. *
  741. * Returns:
  742. *
  743. * Always MM_OK
  744. *
  745. ************************************************************************/
  746. DWORD MMNodeUnreachable( DWORD node )
  747. {
  748. rgp_event_handler( RGP_EVT_NODE_UNREACHABLE, (node_t) node );
  749. return MM_OK;
  750. }
  751. /************************************************************************
  752. *
  753. * MMPowerOn
  754. * =========
  755. *
  756. * Description:
  757. *
  758. * This routine is used on systems which support power-fail
  759. * ride-throughs. When power is restored, this function should be
  760. * called by the CM (on each node).
  761. *
  762. * Power-on normally occurs on multiple nodes at about the same time.
  763. * This routine temporarily changes the cluster integrity handling so
  764. * that the cluster can better survive transient loss of heartbeats
  765. * which accompany power-fail; in normal cases, the cluster will
  766. * survive power-fails without cluster members being
  767. * evicted because of lack of timely response.
  768. *
  769. * Parameters:
  770. *
  771. * None.
  772. *
  773. * Returns:
  774. *
  775. * Always MM_OK
  776. *
  777. ************************************************************************/
  778. DWORD MMPowerOn( void )
  779. {
  780. rgp_event_handler( RGP_EVT_POWERFAIL, EXT_NODE(rgp->mynode) );
  781. return MM_OK;
  782. }
  783. /************************************************************************
  784. *
  785. * MMClusterInfo
  786. * =============
  787. *
  788. * Description:
  789. *
  790. * Returns the current cluster information.
  791. *
  792. * This can be called in nodes which are not members of the cluster;
  793. * such calls always return NumActiveNodes = 0, because Down nodes
  794. * have no knowledge of current cluster membership.
  795. *
  796. * Parameters:
  797. *
  798. * clinfo
  799. * pointer to CLUSTERINFO structure that receives the cluster
  800. * information.
  801. *
  802. * Returns:
  803. *
  804. * Always MM_OK
  805. *
  806. ************************************************************************/
  807. DWORD
  808. MMClusterInfo(
  809. OUT LPCLUSTERINFO clinfo
  810. )
  811. {
  812. node_t i,j;
  813. cluster_t MyCluster;
  814. RGP_LOCK;
  815. clinfo->clockPeriod = rgp->rgpinfo.a_tick;
  816. clinfo->sendHBRate = rgp->rgpinfo.iamalive_ticks;
  817. clinfo->rcvHBRate = rgp->rgpinfo.check_ticks;
  818. ClusterCopy(MyCluster,rgp->OS_specific_control.CPUUPMASK);
  819. RGP_UNLOCK;
  820. for ( i=0,j=0; i < MAX_CLUSTER_SIZE; i++ )
  821. {
  822. if ( ClusterMember (MyCluster, i) )
  823. {
  824. if (clinfo->UpNodeList != RGP_NULL_PTR)
  825. clinfo->UpNodeList[j] = (DWORD)i;
  826. j++;
  827. }
  828. }
  829. clinfo->NumActiveNodes = j;
  830. return MM_OK;
  831. }
  832. /************************************************************************
  833. *
  834. * MMShutdown
  835. * ==========
  836. *
  837. * Description:
  838. * This shuts down the MM and Regroup services. Prior to this, the node should
  839. * voluntarily have left the cluster. Following this, all membership services
  840. * are non-functional; no further MM call may occur.
  841. *
  842. * THIS CALL MUST BE PRECEDED BY INCOMING MESSAGE CALLBACK SHUTDOWN.
  843. *
  844. * Parameters:
  845. * None.
  846. *
  847. * Returns:
  848. * None.
  849. *
  850. ************************************************************************/
  851. void MMShutdown (void)
  852. {
  853. rgp_cleanup();
  854. rgp_cleanup_OS();
  855. // terminate timer thread
  856. rgp->rgpinfo.a_tick = 0; // special value indicates exit request
  857. SetEvent( rgp->OS_specific_control.TimerSignal); // wake up Timer Thread
  858. // wait for timer thread to exit; clean up associated handles for good measure
  859. WaitForSingleObject( rgp->OS_specific_control.TimerThread, INFINITE );
  860. rgp->OS_specific_control.TimerThread = 0;
  861. if ( rgp->OS_specific_control.RGPTimer ) {
  862. CloseHandle ( rgp->OS_specific_control.RGPTimer );
  863. rgp->OS_specific_control.RGPTimer = 0;
  864. }
  865. if (rgp->OS_specific_control.TimerSignal) {
  866. CloseHandle ( rgp->OS_specific_control.TimerSignal );
  867. rgp->OS_specific_control.TimerSignal = 0;
  868. }
  869. #if !defined (TDM_DEBUG)
  870. //
  871. // Uninitialize message system
  872. //
  873. ClMsgCleanup();
  874. #endif // TDM_DEBUG
  875. // delete regroup's critical section object
  876. DeleteCriticalSection( &rgp->OS_specific_control.RgpCriticalSection );
  877. // delete calloc'd space
  878. free (rgp->rgp_msgsys_p);
  879. free (rgp);
  880. rgp = NULL;
  881. }
  882. /************************************************************************
  883. *
  884. * MMEject
  885. * =======
  886. *
  887. * Description:
  888. *
  889. * This function causes the specified node to be ejected from the
  890. * active cluster. The targetted node will be sent a poison packet and
  891. * will enter its MMHalt code. A Regroup incident will be initiated. A
  892. * node-down callback will occur on all remaining cluster members.
  893. *
  894. * Note that the targetted node is Downed before that node has
  895. * a chance to call any remove-ownership or voluntary failover code. As
  896. * such, this is very dangerous. This call is provided only as a last
  897. * resort in removing an insane node from the cluster; normal removal
  898. * of a node from the cluster should occur by CM-CM communication,
  899. * followed by the node itself doing a voluntary Leave on itself.
  900. *
  901. * This routine returns when the node has been told to die. Completion
  902. * of the removal occurs asynchronously, and a NodeDown event will be
  903. * generated when successful.
  904. *
  905. * This routine may block.
  906. *
  907. * Parameters:
  908. *
  909. * Node Number.
  910. *
  911. * Returns:
  912. *
  913. * MM_OK : The node has been told to leave the cluster.
  914. *
  915. * MM_NOTMEMBER : the node is not currently a cluster member.
  916. *
  917. * MM_TRANSIENT : My node state is in transition. OK to retry.
  918. *
  919. ************************************************************************/
  920. DWORD MMEject( IN DWORD node )
  921. {
  922. int i;
  923. RGP_LOCK;
  924. if (! ClusterMember (
  925. rgp->OS_specific_control.CPUUPMASK,
  926. (node_t) INT_NODE(node))
  927. )
  928. {
  929. RGP_UNLOCK;
  930. ClRtlLogPrint(LOG_UNUSUAL,
  931. "[MM] MmEject failed. %1!u! is not a member of %2!04X!.\n",
  932. node, rgp->OS_specific_control.CPUUPMASK
  933. );
  934. return MM_NOTMEMBER;
  935. }
  936. if ( !ClusterMember (
  937. rgp->outerscreen,
  938. INT_NODE(node) )
  939. || ClusterMember(rgp->OS_specific_control.Banished, INT_NODE(node) )
  940. )
  941. {
  942. int perturbed = rgp_is_perturbed();
  943. RGP_UNLOCK;
  944. if (perturbed) {
  945. ClRtlLogPrint(LOG_UNUSUAL,
  946. "[MM] MMEject: %1!u!, banishing is already in progress.\n",
  947. node
  948. );
  949. } else {
  950. ClRtlLogPrint(LOG_UNUSUAL,
  951. "[MM] MmEject: %1!u! is already banished.\n",
  952. node
  953. );
  954. }
  955. return MM_OK;
  956. }
  957. //
  958. // Adding a node to a rgp->OS_specific_control.Banished mask
  959. // will cause us to send a poison packet as a reply to any
  960. // regroup packet coming from Banishee
  961. //
  962. ClusterInsert(rgp->OS_specific_control.Banished, (node_t)INT_NODE(node));
  963. if ( !ClusterMember(rgp->ignorescreen, (node_t)INT_NODE(node)) ) {
  964. //
  965. // It doesn't matter in what stage of the regroup
  966. // we are. If the node needs to be banished we have to
  967. // initiate a new regroup
  968. //
  969. rgp_event_handler( RGP_EVT_BANISH_NODE, (node_t) node );
  970. RGP_UNLOCK;
  971. } else {
  972. RGP_UNLOCK;
  973. ClRtlLogPrint(LOG_UNUSUAL,
  974. "[MM] MmEject: %1!u! is already being ignored.\n",
  975. node
  976. );
  977. }
  978. RGP_TRACE( "RGP Poison sent ", node, 0, 0, 0 );
  979. fflush( stdout );
  980. //
  981. // Send 3 poison packets with half a second interval in between.
  982. // We hope that at least one of the will get through
  983. //
  984. ClusnetSendPoisonPacket( NmClusnetHandle, node );
  985. Sleep(500);
  986. ClusnetSendPoisonPacket( NmClusnetHandle, node );
  987. Sleep(500);
  988. ClusnetSendPoisonPacket( NmClusnetHandle, node );
  989. return MM_OK;
  990. }
  991. /************************************************************************
  992. * MMIsNodeUp
  993. * ==========
  994. *
  995. *
  996. * Returns true iff the node is a member of the current cluster.
  997. *
  998. * *** debugging and test only.
  999. *
  1000. * Parameters:
  1001. * Node Number of interest.
  1002. *
  1003. * Returns:
  1004. * TRUE if Node is member of cluster else FALSE.
  1005. *
  1006. ************************************************************************/
  1007. BOOL MMIsNodeUp(IN DWORD node)
  1008. {
  1009. return (ClusterMember(
  1010. rgp->OS_specific_control.CPUUPMASK,
  1011. (node_t) INT_NODE(node)
  1012. )
  1013. );
  1014. }
  1015. /************************************************************************
  1016. *
  1017. * MMDiag
  1018. * ======
  1019. *
  1020. * Description:
  1021. *
  1022. * Handles "diagnostic" messages. Some of these messages will
  1023. * have responses that are returned. This function is typically
  1024. * called by the Cluster Manager with connection oriented
  1025. * messages from CLI.
  1026. *
  1027. *
  1028. * Parameters:
  1029. *
  1030. * messageBuffer
  1031. * (IN) pointer to a buffer that contains the diagnostic message.
  1032. * (OUT) response to the diagnostic message
  1033. *
  1034. * maximumLength
  1035. * maximum number of bytes to return in messageBuffer
  1036. *
  1037. * ActualLength
  1038. * (IN) length of diagnostic message
  1039. * (OUT) length of response
  1040. *
  1041. * Returns:
  1042. *
  1043. * Always MM_OK
  1044. *
  1045. ************************************************************************/
  1046. DWORD
  1047. MMDiag(
  1048. IN OUT LPCSTR messageBuffer, // Diagnostic message
  1049. IN DWORD maximumLength, // maximum size of buffer to return
  1050. IN OUT LPDWORD ActualLength // length of messageBuffer going in and coming out
  1051. )
  1052. {
  1053. // ??? need to return info in the future
  1054. rgp_receive_events( (rgp_msgbuf *)messageBuffer );
  1055. return MM_OK;
  1056. }
  1057. /************************************************************************
  1058. *
  1059. * rgp_receive_events
  1060. * ==================
  1061. *
  1062. * Description:
  1063. *
  1064. * This routine is called from MMDiag and from the Cluster Manager
  1065. * message thread (via our callback) to handle regroup messages
  1066. * and diagnostic messages.
  1067. *
  1068. * Parameters:
  1069. *
  1070. * rgpbuf
  1071. * the message that needs to be handled.
  1072. *
  1073. * Returns:
  1074. *
  1075. * none
  1076. *
  1077. ************************************************************************/
  1078. void
  1079. rgp_receive_events(
  1080. IN rgp_msgbuf *rgpbuf
  1081. )
  1082. {
  1083. int event;
  1084. rgpinfo_t rgpinfo;
  1085. poison_pkt_t poison_pkt; /* poison packet sent from stack */
  1086. DWORD status;
  1087. #if defined(TDM_DEBUG)
  1088. extern BOOL GUIfirstTime;
  1089. extern HANDLE gGUIEvent;
  1090. #endif
  1091. event = rgpbuf->event;
  1092. #if defined(TDM_DEBUG)
  1093. if ( (rgp->OS_specific_control.debug.frozen) && (event != RGP_EVT_THAW) )
  1094. return; /* don't do anything if the node is frozen */
  1095. #endif
  1096. if ( event == RGP_EVT_RECEIVED_PACKET )
  1097. {
  1098. //
  1099. // Go handle the regroup packet.
  1100. //
  1101. rgp_received_packet(rgpbuf->data.node,
  1102. (void *) &(rgpbuf->unseq_pkt),
  1103. sizeof(rgpbuf->unseq_pkt) );
  1104. }
  1105. else if (event < RGP_EVT_FIRST_DEBUG_EVENT)
  1106. {
  1107. //
  1108. // "regular" regroup message
  1109. //
  1110. rgp_event_handler(event, rgpbuf->data.node);
  1111. }
  1112. else
  1113. {
  1114. //
  1115. // Debugging message
  1116. //
  1117. RGP_TRACE( "RGP Debug event ", event, rgpbuf->data.node, 0, 0 );
  1118. switch (event)
  1119. {
  1120. case RGP_EVT_START :
  1121. {
  1122. rgp_start( MMiNodeDownCallback, RGP_NULL_PTR );
  1123. break;
  1124. }
  1125. case RGP_EVT_ADD_NODE :
  1126. {
  1127. rgp_add_node( rgpbuf->data.node );
  1128. break;
  1129. }
  1130. case RGP_EVT_MONITOR_NODE :
  1131. {
  1132. rgp_monitor_node( rgpbuf->data.node );
  1133. break;
  1134. }
  1135. case RGP_EVT_REMOVE_NODE :
  1136. {
  1137. rgp_remove_node( rgpbuf->data.node );
  1138. break;
  1139. }
  1140. case RGP_EVT_GETRGPINFO :
  1141. {
  1142. rgp_getrgpinfo( &rgpinfo );
  1143. RGP_TRACE( "RGP GetRGPInfo ",
  1144. rgpinfo.version, /* TRACE */
  1145. rgpinfo.seqnum, /* TRACE */
  1146. rgpinfo.iamalive_ticks, /* TRACE */
  1147. GetCluster( rgpinfo.cluster ) ); /* TRACE */
  1148. break;
  1149. }
  1150. case RGP_EVT_SETRGPINFO :
  1151. {
  1152. rgp_setrgpinfo( &(rgpbuf->data.rgpinfo) );
  1153. /* This event is traced in rgp_setrgpinfo(). */
  1154. break;
  1155. }
  1156. case RGP_EVT_HALT :
  1157. {
  1158. exit( 1 );
  1159. break;
  1160. }
  1161. #if defined(TDM_DEBUG)
  1162. case RGP_EVT_FREEZE :
  1163. {
  1164. rgp->OS_specific_control.debug.frozen = 1;
  1165. break;
  1166. }
  1167. case RGP_EVT_THAW :
  1168. {
  1169. rgp->OS_specific_control.debug.frozen = 0;
  1170. break;
  1171. }
  1172. case RGP_EVT_STOP_SENDING :
  1173. {
  1174. ClusterInsert( rgp->OS_specific_control.debug.stop_sending,
  1175. INT_NODE(rgpbuf->data.node) );
  1176. /* Generate a node unreachable event to indicate that
  1177. * we cannot send to this node.
  1178. */
  1179. rgp_event_handler( RGP_EVT_NODE_UNREACHABLE, rgpbuf->data.node );
  1180. break;
  1181. }
  1182. case RGP_EVT_RESUME_SENDING :
  1183. {
  1184. ClusterDelete(rgp->OS_specific_control.debug.stop_sending,
  1185. INT_NODE(rgpbuf->data.node));
  1186. break;
  1187. }
  1188. case RGP_EVT_STOP_RECEIVING :
  1189. {
  1190. ClusterInsert(rgp->OS_specific_control.debug.stop_receiving,
  1191. INT_NODE(rgpbuf->data.node));
  1192. break;
  1193. }
  1194. case RGP_EVT_RESUME_RECEIVING :
  1195. {
  1196. ClusterDelete(rgp->OS_specific_control.debug.stop_receiving,
  1197. INT_NODE(rgpbuf->data.node));
  1198. break;
  1199. }
  1200. case RGP_EVT_SEND_POISON :
  1201. {
  1202. poison_pkt.pktsubtype = RGP_UNACK_POISON;
  1203. poison_pkt.seqno = rgp->rgppkt.seqno;
  1204. poison_pkt.reason = rgp->rgppkt.reason;
  1205. poison_pkt.activatingnode = rgp->rgppkt.activatingnode;
  1206. poison_pkt.causingnode = rgp->rgppkt.causingnode;
  1207. ClusterCopy(poison_pkt.initnodes, rgp->initnodes);
  1208. ClusterCopy(poison_pkt.endnodes, rgp->endnodes);
  1209. rgp_send( rgpbuf->data.node, (char *)&poison_pkt, POISONPKTLEN );
  1210. break;
  1211. }
  1212. case RGP_EVT_STOP_TIMER_POPS :
  1213. {
  1214. rgp->OS_specific_control.debug.timer_frozen = 1;
  1215. break;
  1216. }
  1217. case RGP_EVT_RESUME_TIMER_POPS :
  1218. {
  1219. rgp->OS_specific_control.debug.timer_frozen = 0;
  1220. break;
  1221. }
  1222. case RGP_EVT_RELOAD :
  1223. {
  1224. if (rgp->OS_specific_control.debug.reload_in_progress)
  1225. {
  1226. RGP_TRACE( "RGP Rld in prog ", 0, 0, 0, 0 );
  1227. return;
  1228. }
  1229. rgp->OS_specific_control.debug.reload_in_progress = 1;
  1230. if (rgpbuf->data.node == RGP_NULL_NODE)
  1231. {
  1232. RGP_TRACE( "RGP Invalid join parms ", -1, 0, 0, 0 );
  1233. return;
  1234. // Not supported since this server doesn't know which ones
  1235. // are currently running.
  1236. /* Reload all down nodes */
  1237. //for (i = 0; i < rgp->num_nodes; i++)
  1238. //MMJoin( EXT_NODE(i), 0 /*use default*/, -1 /*???*/ );
  1239. }
  1240. else
  1241. {
  1242. /* Reload the specified node */
  1243. status = MMJoin( rgpbuf->data.node /* joiningNode */,
  1244. 0 /* use default clockPeriod */,
  1245. 0 /* use default sendHBRate */,
  1246. 0 /* use default rcvHBRate */,
  1247. 500 /*millisecond timeout*/ );
  1248. if ( status != MM_OK )
  1249. {
  1250. RGP_TRACE( "RGP Join Failed ",
  1251. rgpbuf->data.node,
  1252. status, 0, 0 );
  1253. Sleep( 1000 ); // stablize regroup for reload * case - testing purposes
  1254. }
  1255. }
  1256. rgp->OS_specific_control.debug.reload_in_progress = 0;
  1257. break;
  1258. }
  1259. case RGP_EVT_TRACING :
  1260. {
  1261. rgp->OS_specific_control.debug.doing_tracing =
  1262. ( rgpbuf->data.node ? 1 : 0 );
  1263. if (!rgp->OS_specific_control.debug.doing_tracing)
  1264. {
  1265. GUIfirstTime = TRUE;
  1266. SetEvent( gGUIEvent );
  1267. }
  1268. break;
  1269. }
  1270. #endif // TDM_DEFINED
  1271. case RGP_EVT_INFO:
  1272. // nop for now
  1273. break;
  1274. case MM_EVT_LEAVE:
  1275. status = MMLeave(); // (self) leave cluster
  1276. break;
  1277. case MM_EVT_EJECT:
  1278. status = MMEject (rgpbuf->data.node); // eject other node
  1279. break;
  1280. #if defined(TDM_DEBUG)
  1281. case MM_EVT_INSERT_TESTPOINTS:
  1282. rgp->OS_specific_control.debug.MyTestPoints.TestPointWord =
  1283. rgpbuf->data.node;
  1284. break;
  1285. #endif
  1286. default :
  1287. {
  1288. RGP_TRACE( "RGP Unknown evt ", event, 0, 0, 0 );
  1289. break;
  1290. }
  1291. } /* end switch */
  1292. }
  1293. }
  1294. /************************************************************************
  1295. *
  1296. * rgp_send
  1297. * ========
  1298. *
  1299. * Description:
  1300. *
  1301. * This routine is called to send an unacknowledged message to
  1302. * the specified node.
  1303. *
  1304. * Parameters:
  1305. *
  1306. * node
  1307. * node number to send the message to.
  1308. *
  1309. * data
  1310. * pointer to the data to send
  1311. *
  1312. * datasize
  1313. * number of bytes to send
  1314. *
  1315. * Returns:
  1316. *
  1317. * none.
  1318. *
  1319. ************************************************************************/
  1320. void
  1321. rgp_send(
  1322. IN node_t node,
  1323. IN void *data,
  1324. IN int datasize
  1325. )
  1326. {
  1327. rgp_msgbuf rgpbuf;
  1328. DWORD status;
  1329. if (rgp->node_states[rgp->mynode].status != RGP_NODE_ALIVE)
  1330. return; // suppress sending if we're not alive
  1331. #if defined(TDM_DEBUG)
  1332. if ( ClusterMember( rgp->OS_specific_control.debug.stop_sending,
  1333. INT_NODE(node) ) )
  1334. return; /* don't send to this node */
  1335. #endif
  1336. rgpbuf.event = RGP_EVT_RECEIVED_PACKET;
  1337. rgpbuf.data.node = EXT_NODE(rgp->mynode);
  1338. memmove( &(rgpbuf.unseq_pkt), data, datasize);
  1339. switch (rgpbuf.unseq_pkt.pktsubtype) {
  1340. case RGP_UNACK_REGROUP :
  1341. status = ClMsgSendUnack( node, (void *)&rgpbuf, sizeof(rgp_msgbuf) );
  1342. if ( status && (status != WSAENOTSOCK) )
  1343. {
  1344. RGP_TRACE( "ClMsgSendUnack failed",
  1345. rgp->rgppkt.stage,
  1346. (uint32) node,
  1347. (uint32) status,
  1348. 0 );
  1349. fflush(stdout);
  1350. }
  1351. break;
  1352. case RGP_UNACK_IAMALIVE :
  1353. break;
  1354. case RGP_UNACK_POISON :
  1355. RGP_TRACE( "RGP Poison sent ", node, 0, 0, 0 );
  1356. fflush( stdout );
  1357. ClusnetSendPoisonPacket( NmClusnetHandle, node );
  1358. break;
  1359. default :
  1360. break;
  1361. }
  1362. }
  1363. /************************************************************************
  1364. *
  1365. * SetMulticastReachable
  1366. * ===============
  1367. *
  1368. * Description:
  1369. *
  1370. * This routine is called by the message.c to update
  1371. * the info of which nodes are reachable thru multicast.
  1372. *
  1373. * Parameters:
  1374. *
  1375. * none
  1376. *
  1377. * Returns:
  1378. *
  1379. * none
  1380. *
  1381. ************************************************************************/
  1382. void SetMulticastReachable(uint32 mask)
  1383. {
  1384. *(PUSHORT)rgp->OS_specific_control.MulticastReachable = (USHORT)mask;
  1385. }
  1386. /************************************************************************
  1387. *
  1388. * rgp_msgsys_work
  1389. * ===============
  1390. *
  1391. * Description:
  1392. *
  1393. * This routine is called by the regroup engine to broadcast
  1394. * messages.
  1395. *
  1396. * Parameters:
  1397. *
  1398. * none
  1399. *
  1400. * Returns:
  1401. *
  1402. * none
  1403. *
  1404. ************************************************************************/
  1405. void
  1406. rgp_msgsys_work( )
  1407. {
  1408. node_t i;
  1409. do /* do while more regroup work to do */
  1410. {
  1411. if (rgp->rgp_msgsys_p->sendrgppkts)
  1412. { /* broadcast regroup packets */
  1413. rgp->rgp_msgsys_p->sendrgppkts = 0;
  1414. if ( ClusterNumMembers(rgp->OS_specific_control.MulticastReachable) >= 1)
  1415. {
  1416. cluster_t tmp;
  1417. ClusterCopy(tmp, rgp->rgp_msgsys_p->regroup_nodes);
  1418. ClusterDifference(rgp->rgp_msgsys_p->regroup_nodes,
  1419. rgp->rgp_msgsys_p->regroup_nodes,
  1420. rgp->OS_specific_control.MulticastReachable);
  1421. RGP_TRACE( "RGP Multicast",
  1422. GetCluster(rgp->OS_specific_control.MulticastReachable),
  1423. GetCluster(tmp),
  1424. GetCluster(rgp->rgp_msgsys_p->regroup_nodes),
  1425. 0);
  1426. rgp_send( 0,
  1427. rgp->rgp_msgsys_p->regroup_data,
  1428. rgp->rgp_msgsys_p->regroup_datalen
  1429. );
  1430. }
  1431. for (i = 0; i < (node_t) rgp->num_nodes; i++)
  1432. if (ClusterMember(rgp->rgp_msgsys_p->regroup_nodes, i))
  1433. {
  1434. ClusterDelete(rgp->rgp_msgsys_p->regroup_nodes, i);
  1435. RGP_TRACE( "RGP Unicast", EXT_NODE(i), 0,0,0);
  1436. rgp_send( EXT_NODE(i),
  1437. rgp->rgp_msgsys_p->regroup_data,
  1438. rgp->rgp_msgsys_p->regroup_datalen
  1439. );
  1440. }
  1441. } /* broadcast regroup packets */
  1442. if (rgp->rgp_msgsys_p->sendiamalives)
  1443. { /* broadcast iamalive packets */
  1444. rgp->rgp_msgsys_p->sendiamalives = 0;
  1445. for (i = 0; i < (node_t) rgp->num_nodes; i++)
  1446. if (ClusterMember(rgp->rgp_msgsys_p->iamalive_nodes, i))
  1447. {
  1448. ClusterDelete(rgp->rgp_msgsys_p->iamalive_nodes, i);
  1449. rgp_send( EXT_NODE(i),
  1450. rgp->rgp_msgsys_p->iamalive_data,
  1451. rgp->rgp_msgsys_p->iamalive_datalen
  1452. );
  1453. }
  1454. } /* broadcast iamalive packets */
  1455. if (rgp->rgp_msgsys_p->sendpoisons)
  1456. { /* send poison packets */
  1457. rgp->rgp_msgsys_p->sendpoisons = 0;
  1458. for (i = 0; i < (node_t) rgp->num_nodes; i++)
  1459. if (ClusterMember(rgp->rgp_msgsys_p->poison_nodes, i))
  1460. {
  1461. ClusterDelete(rgp->rgp_msgsys_p->poison_nodes, i);
  1462. rgp_send( EXT_NODE(i),
  1463. rgp->rgp_msgsys_p->poison_data,
  1464. rgp->rgp_msgsys_p->poison_datalen
  1465. );
  1466. }
  1467. } /* send poison packets */
  1468. } while ((rgp->rgp_msgsys_p->sendrgppkts) ||
  1469. (rgp->rgp_msgsys_p->sendiamalives) ||
  1470. (rgp->rgp_msgsys_p->sendpoisons)
  1471. );
  1472. }
  1473. DWORD
  1474. MMMapStatusToDosError(
  1475. IN DWORD MMStatus
  1476. )
  1477. {
  1478. DWORD dosStatus;
  1479. switch(MMStatus) {
  1480. case MM_OK:
  1481. dosStatus = ERROR_SUCCESS;
  1482. break;
  1483. case MM_TIMEOUT:
  1484. dosStatus = ERROR_TIMEOUT;
  1485. break;
  1486. case MM_TRANSIENT:
  1487. dosStatus = ERROR_RETRY;
  1488. break;
  1489. case MM_FAULT:
  1490. dosStatus = ERROR_INVALID_PARAMETER;
  1491. break;
  1492. case MM_ALREADY:
  1493. dosStatus = ERROR_SUCCESS;
  1494. break;
  1495. case MM_NOTMEMBER:
  1496. dosStatus = ERROR_CLUSTER_NODE_NOT_MEMBER;
  1497. break;
  1498. }
  1499. return(dosStatus);
  1500. } // MMMapStatusToDosError
  1501. DWORD
  1502. MMMapHaltCodeToDosError(
  1503. IN DWORD HaltCode
  1504. )
  1505. {
  1506. DWORD dosStatus;
  1507. switch(HaltCode) {
  1508. case RGP_SHUTDOWN_DURING_RGP:
  1509. case RGP_RELOADFAILED:
  1510. dosStatus = ERROR_CLUSTER_MEMBERSHIP_INVALID_STATE;
  1511. break;
  1512. default:
  1513. dosStatus = ERROR_CLUSTER_MEMBERSHIP_HALT;
  1514. }
  1515. return(dosStatus);
  1516. } // MMMapHaltCodeToDosError
  1517. /* ---------------------------- */
  1518. DWORD MmSetRegroupAllowed( IN BOOL allowed )
  1519. /* This function can be used to allow/disallow regroup participation
  1520. * for the current node.
  1521. *
  1522. * Originally regroup was allowed immediately after receiving RGP_START
  1523. * event. Since this happens before join is complete
  1524. * joiner can arbitrate and win, leaving
  1525. * the other side without a quorum device.
  1526. *
  1527. * It is required to add MmSetRegroupAllowed(TRUE) at the very end
  1528. * of the ClusterJoin. The node doesn't need to call MmSetRegroupAllowed(TRUE)
  1529. * for ClusterForm, since MMJoin will call
  1530. * MmSetRegroupAllowed(TRUE) for the cluster forming node
  1531. *
  1532. * MmSetRegroupAllowed(FALSE) can be used to disable regroup
  1533. * participation during shutdown.
  1534. *
  1535. *
  1536. * Errors:
  1537. *
  1538. * MM_OK : successful completition
  1539. *
  1540. * MM_TRANSIENT : disallowing regroup when regroup is in progress
  1541. *
  1542. * MM_ALREADY : node is already in the desired condition
  1543. *
  1544. *
  1545. */
  1546. {
  1547. DWORD status;
  1548. if (rgp) {
  1549. RGP_LOCK;
  1550. if (allowed) {
  1551. if (rgp->rgppkt.stage == RGP_COLDLOADED) {
  1552. rgp->rgppkt.stage = RGP_STABILIZED;
  1553. status = MM_OK;
  1554. } else {
  1555. status = MM_ALREADY;
  1556. }
  1557. } else {
  1558. if (rgp->rgppkt.stage == RGP_STABILIZED) {
  1559. rgp->rgppkt.stage = RGP_COLDLOADED;
  1560. status = MM_OK;
  1561. } else if (rgp->rgppkt.stage == RGP_COLDLOADED) {
  1562. status = MM_ALREADY;
  1563. } else {
  1564. //
  1565. // Regroup is already in progress. Kill this node.
  1566. //
  1567. RGP_ERROR(RGP_SHUTDOWN_DURING_RGP);
  1568. }
  1569. }
  1570. RGP_UNLOCK;
  1571. } else if (allowed) {
  1572. ClRtlLogPrint(LOG_UNUSUAL,
  1573. "[MM] SetRegroupAllowed(%1!u!) is called when rgp=NULL.\n",
  1574. allowed
  1575. );
  1576. status = MM_FAULT;
  1577. } else {
  1578. // if rgp is null and the caller wants to disable regroup.
  1579. status = MM_ALREADY;
  1580. }
  1581. return status;
  1582. }
  1583. DWORD MMSetQuorumOwner(
  1584. IN DWORD NodeId,
  1585. IN BOOL Block,
  1586. OUT PDWORD pdwSelQuoOwnerId
  1587. )
  1588. /*++
  1589. Routine Description:
  1590. Inform Membership engine about changes in ownership of
  1591. the quorum resource.
  1592. Arguments:
  1593. NodeId - Node number to be set as a quorum owner.
  1594. Code assumes that Node is either equal to MyNodeId.
  1595. In this case the current node is about to become a
  1596. quorum owner or it has a value MM_INVALID_NODE, when
  1597. the owner decides to relinquish the quorum ownership
  1598. Block - if the quorum owner needs to relinquish the
  1599. quorum immediately no matter what (RmTerminate, RmFail),
  1600. this parameter should be set to FALSE and to TRUE otherwise.
  1601. pdwSelQuoOwnerId - if this was invoked while a regroup was in progress
  1602. then this contains the id of the node that was chosen to
  1603. arbitrate for the quorum in that last regroup else it contains
  1604. MM_INVALID_NODE.
  1605. Return Value:
  1606. ERROR_SUCCESS - QuorumOwner variable is set to specified value
  1607. ERROR_RETRY - Regroup was in progress when this function
  1608. was called and regroup engine decision conflicts with current assignment.
  1609. Comments:
  1610. This function needs to be called before calls to
  1611. RmArbitrate, RmOnline, RmOffline, RmTerminate, RmFailResource
  1612. Depending on the result, the caller should either proceed with
  1613. Arbitrate/Online or Offline or return an error if MM_TRANSIENT is returned.
  1614. If Block is set to TRUE, the call will block until the end of the regroup if
  1615. the regroup was in progress on the moment of the call
  1616. */
  1617. {
  1618. DWORD MyNode;
  1619. if (pdwSelQuoOwnerId)
  1620. {
  1621. *pdwSelQuoOwnerId = MM_INVALID_NODE;
  1622. }
  1623. ClRtlLogPrint(LOG_NOISE,
  1624. "[MM] MmSetQuorumOwner(%1!u!,%2!u!), old owner %3!u!.\n", NodeId, Block, QuorumOwner
  1625. );
  1626. if (!rgp) {
  1627. // we are called on the form path before MM was initialized
  1628. QuorumOwner = NodeId;
  1629. return ERROR_SUCCESS;
  1630. }
  1631. MyNode = (DWORD)EXT_NODE(rgp->mynode);
  1632. RGP_LOCK
  1633. if ( !rgp_is_perturbed() ) {
  1634. QuorumOwner = NodeId;
  1635. RGP_UNLOCK;
  1636. return ERROR_SUCCESS;
  1637. }
  1638. //
  1639. // we have a regroup in progress
  1640. if (!Block) {
  1641. // caller doesn't want to wait //
  1642. ClRtlLogPrint(LOG_UNUSUAL,
  1643. "[MM] MmSetQuorumOwner: regroup is in progress, forcing the new value in.\n"
  1644. );
  1645. QuorumOwner = NodeId;
  1646. RGP_UNLOCK;
  1647. return ERROR_RETRY;
  1648. }
  1649. do {
  1650. if(rgp->OS_specific_control.ArbitrationInProgress && NodeId == MyNode ) {
  1651. // This is when MmSetQuorumOwner is called from within the regroup Arbitrate //
  1652. QuorumOwner = MyNode;
  1653. RGP_UNLOCK;
  1654. return ERROR_SUCCESS;
  1655. }
  1656. RGP_UNLOCK
  1657. ClRtlLogPrint(LOG_UNUSUAL,
  1658. "[MM] MmSetQuorumOwner: regroup is in progress, wait until it ends\n"
  1659. );
  1660. WaitForSingleObject(rgp->OS_specific_control.Stabilized, INFINITE);
  1661. RGP_LOCK
  1662. } while ( rgp_is_perturbed() );
  1663. // Now we are in the stablilized state with RGP_LOCK held//
  1664. // And we were blocked while regroup was in progress //
  1665. // somebody else might become an owner of the quorum //
  1666. // ArbitratingNode variable contains this information //
  1667. // or it has MM_INVALID_NODE if there was no arbitration during the regroup //
  1668. if (pdwSelQuoOwnerId)
  1669. {
  1670. *pdwSelQuoOwnerId = rgp->OS_specific_control.ArbitratingNode;
  1671. }
  1672. if (rgp->OS_specific_control.ArbitratingNode == MM_INVALID_NODE) {
  1673. // No arbitration was done during the last regroup
  1674. QuorumOwner = NodeId;
  1675. RGP_UNLOCK;
  1676. ClRtlLogPrint(LOG_UNUSUAL,
  1677. "[MM] MmSetQuorumOwner: no arbitration was done\n"
  1678. );
  1679. return ERROR_SUCCESS;
  1680. }
  1681. // Somebody arbitrated for the quorum
  1682. if (rgp->OS_specific_control.ArbitratingNode == MyNode
  1683. && NodeId == MM_INVALID_NODE) {
  1684. // We were asked to bring the quorum offline,
  1685. // but during the the regroup, we were arbitrating and won the quorum.
  1686. // Let's fail offline request
  1687. RGP_UNLOCK;
  1688. ClRtlLogPrint(LOG_UNUSUAL,
  1689. "[MM] MmSetQuorumOwner: offline request denied\n"
  1690. );
  1691. return ERROR_RETRY;
  1692. } else if (rgp->OS_specific_control.ArbitratingNode != MyNode
  1693. && NodeId == MyNode ) {
  1694. // We were going take bring the quorum online, but
  1695. // during the regroup somebody else got the disk
  1696. // Online. Let's fail online call in this case
  1697. RGP_UNLOCK;
  1698. ClRtlLogPrint(LOG_UNUSUAL,
  1699. "[MM] MmSetQuorumOwner: online request denied, %1!u! has the quorum.\n",
  1700. rgp->OS_specific_control.ArbitratingNode
  1701. );
  1702. return ERROR_RETRY;
  1703. }
  1704. QuorumOwner = NodeId;
  1705. RGP_UNLOCK;
  1706. ClRtlLogPrint(LOG_UNUSUAL,
  1707. "[MM] MmSetQuorumOwner: new quorum owner is %1!u!.\n",
  1708. NodeId
  1709. );
  1710. return ERROR_SUCCESS;
  1711. }
  1712. DWORD MMGetArbitrationWinner(
  1713. OUT PDWORD NodeId
  1714. )
  1715. /*++
  1716. Routine Description:
  1717. Returns the node that won the arbitration during the last regroup
  1718. or MM_INVALID_NODE if there was no arbitration performed.
  1719. Arguments:
  1720. NodeId - a pointer to a variable that receives nodeid of
  1721. arbitration winner.
  1722. Return Value:
  1723. ERROR_SUCCESS - success
  1724. ERROR_RETRY - Regroup was in progress when this function
  1725. was called.
  1726. */
  1727. {
  1728. DWORD status;
  1729. CL_ASSERT(NodeId != 0);
  1730. RGP_LOCK
  1731. *NodeId = rgp->OS_specific_control.ArbitratingNode;
  1732. status = rgp_is_perturbed() ? ERROR_RETRY : ERROR_SUCCESS;
  1733. RGP_UNLOCK;
  1734. return status;
  1735. }
  1736. VOID MMBlockIfRegroupIsInProgress(
  1737. VOID
  1738. )
  1739. /*++
  1740. Routine Description:
  1741. The call will block if the regroup is in progress.
  1742. */
  1743. {
  1744. RGP_LOCK;
  1745. while ( rgp_is_perturbed() ) {
  1746. RGP_UNLOCK
  1747. ClRtlLogPrint(LOG_UNUSUAL,
  1748. "[MM] MMBlockIfRegroupIsInProgress: regroup is in progress, wait until it ends\n"
  1749. );
  1750. WaitForSingleObject(rgp->OS_specific_control.Stabilized, INFINITE);
  1751. RGP_LOCK;
  1752. }
  1753. RGP_UNLOCK;
  1754. }
  1755. VOID MMApproxArbitrationWinner(
  1756. OUT PDWORD NodeId
  1757. )
  1758. /*++
  1759. Routine Description:
  1760. Returns the node that won the arbitration during the last regroup
  1761. that was doing arbitration.
  1762. The call will block if the regroup is in progress.
  1763. Arguments:
  1764. NodeId - a pointer to a variable that receives nodeid of
  1765. arbitration winner.
  1766. Return Value:
  1767. none
  1768. */
  1769. {
  1770. if (!rgp) {
  1771. // we are called on the form path before MM was initialized
  1772. *NodeId = MM_INVALID_NODE;
  1773. return;
  1774. }
  1775. RGP_LOCK;
  1776. while ( rgp_is_perturbed() ) {
  1777. RGP_UNLOCK
  1778. ClRtlLogPrint(LOG_UNUSUAL,
  1779. "[MM] MMApproxArbitrationWinner: regroup is in progress, wait until it ends\n"
  1780. );
  1781. WaitForSingleObject(rgp->OS_specific_control.Stabilized, INFINITE);
  1782. RGP_LOCK;
  1783. }
  1784. // Now we are in the stablilized state with RGP_LOCK held//
  1785. *NodeId = rgp->OS_specific_control.ApproxArbitrationWinner;
  1786. RGP_UNLOCK;
  1787. }
  1788. VOID MMStartClussvcClusnetHb(
  1789. VOID
  1790. )
  1791. /*++
  1792. Routine Description:
  1793. This routine would start clussvc to clusnet heartbeating.
  1794. Arguments:
  1795. Return Value:
  1796. none
  1797. */
  1798. {
  1799. MmStartClussvcToClusnetHeartbeat = TRUE;
  1800. }
  1801. VOID MMStopClussvcClusnetHb(
  1802. VOID
  1803. )
  1804. /*++
  1805. Routine Description:
  1806. This routine would stop clussvc to clusnet heartbeating.
  1807. Arguments:
  1808. Return Value:
  1809. none
  1810. */
  1811. {
  1812. MmStartClussvcToClusnetHeartbeat = FALSE;
  1813. }
  1814. #ifdef __cplusplus
  1815. }
  1816. #endif /* __cplusplus */
  1817. /* -------------------------- end ------------------------------- */