Quantcast
Channel: High Availability (Clustering) forum
Viewing all 3614 articles
Browse latest View live

Cluster Network Unavailable on one node

$
0
0

Hi,

We have 5 node cluster. Eache node have 4 NICs. NIC 1 nad 4 are teamed and presented to HyperV , NIC 2 is for CSV and is in separate VLAN  and NIC 3 is for Live Migrate and is also in separate VLAN (cluster name and cluster IP are in the same VLAN as NIC 3 and nodes are communicating with them through this network).

OS: 2008 R2 Datacenter.

Before we started testing it worked perfectly. Validation passed multiple times....

On a switch that nodes are connected we disabled port for NIC 3 on all nodes (automatically cluster name and cluster IP went offline), and then on switch we disabled one port that was mapped to NIC 2 on Node 1 .

As we guessed cluster service on Node 1 went down.

Then we enabled ports for NIC 2 and NIC 3 on Node 1 on switch . Tried to start cluster service on Node 1 but it failed .

At the same time in Failover Cluster Manager networks that represent NIC 2  and NIC 3 on Node 1 went from Failed to Unavailable state.

We enabled all ports on switch that were disabled , started cluster name and cluster IP , in FOC Manager all networks on all nodes beside Node 1 went to up state .

Ping and RDP that we tried to Node 1 through NIC2 and NIC 3 worked. At that time we have noticed that cluster service on Node 1 were crashing.

Cluster node command on Node 1 stated that all other nodes are down and that Node 1 is joining.

Cluster node command on all other nodes were stating that Node 1 is joining and all other nodes are Up.

Tried to start cluster service on Node 1 with /forcecluster and /ips stiches but that didnt solve the problem.

Node 1 reported that there is IP Address Confilict with Cluster IP (I guess that it was trying to take the control of this resource??? which was up at that time)

Then suddenly about 2h after the problems started , NIC 2 and NIC 3 on Node 1 went to Up state in FOC Manager, without any kind intervention from our side.

Does anyone have any kind of idea what happend? Is there somekind of timeout , and after it expired Node 1 tried to communicate again with the cluster resources?

Any help would be appreciated. Thanks in advance


Three node sql cluster across 2 sites

$
0
0
I would like to create a 3 node sql cluster. One active and One passive on one site and a passive located on another site. The 2nd passive server will be on a different subnet and heartbeat will also be on a different network, what are the recommendations for this?

2008R2 cluster to 2012 cluster

$
0
0

I asked this in the migration forum, but they pointed me to you.

I have read through the guide to migrate from 2003 or 2008 to 2008R2. I assume the same applies for 2008r2 to 2012.

http://technet.microsoft.com/en-us/library/ff182312%28v=ws.10%29.aspx

We would like to upgrade to server 2012 for our fileshare cluster. We will be using the same physical hardware, and same storage.  This file server has several shares, each with unique permissions.

The issue I have is that I need to use the same cluster name, same IPs, save virtual server names, and same machine names. the biggest issue being the cluster name and IPs.  I cannot do this with the migration path.

Can i do it this way?

1. Evict one node from the cluster.  Upgrade it to server 2012. Now I have a single node cluster on 2008R2, and a stand alone server running 2012.

2. For the remaining node, in place upgrade to server 2012. Now i have a stand alone cluster running server 2012, and a stand alone server running 2012.

Add the second node back into the cluster. Now i have a 2 node server 2012 cluster.

Is this supported? Can i do an in-place Os upgrade to a stand alone server running 2008R2 cluster services with fileshares?  Any documentation stating that it is supported?

2008 R2 CSV Acting like Redirected Access but not in Redirected Access Mode

$
0
0

We've been experiencing some issues with some of our hyper-v guests lately with disk latency.  On one of our busy clusters a VM could have an average read or write latency of up to 300-400ms with spikes as high as 2 seconds.

Doing some testing I noticed that traffic to a CSV from a node that is not the owner of that CSV causes one of the network adapters to spike to about 11-12% of the 1Gbps link and stay there until the transfer is finished.  Doing benchmarking I found it to be transferring at a much lower speed when the node didn't own the CSV.  About 125-140MB/s when the node owns the CSV and about 9-11MB/s when the node doesn't own the CSV.

All CSVs are not in redirected access mode and are acting normal besides this.  Also, this is a 2008 R2 setup and is happening in the two 2008 R2 clusters we have up and going for hyper-v.  However, our 2012 Hyper-V cluster (our testing one) does not have this issue and reads/writes at the same speed (almost) whether the node owns the CSV or not.  Unfortunately, due to workplace restrictions we cannot upgrade our other clusters to 2012 just yet so we'd like to get this fixed.  We've always had poor performance but only recently have we added more machines that show just how terrible the performance really is.

If anyone has any input it would be greatly appreciated.  Thanks in advance.

Server 2012 Enterprise Cluster Crashing

$
0
0

I have a two node Failover clustered DL380 G7 server setup (vs1 and vs2) with a p2000 between them to hold all the files. The problem we are having is that the vs2 will somewhat randomly (1 week ~1 Month) stop communicating with the Cluster. At this point the vs1 cluster management console fails to connect. Then when I try to RDP into vs2 it just says "securing Remote Connection" and fails. At this point the VM's are still running on VS2 and accessible thru RDP. Next within a few hours of me trying to RDP into VS2 it will reboot and all the VM's move to the VS1 and restart.

 Myself and our outsourced IT company have tried about everything to get this to work and have had no luck. They have worked extensively with Microsoft direct without any improvement. I am hoping someone has ran into this issue and resolved it. I am out of ideas so any help would be appreciated.  

Windows Storage Server 2008 R2 cluster network issue

$
0
0

Hello everybody,

I've a Network Attached Storage HP X5520 G2 with Windows Storage 2008 R2 and five clustered file servers with several issues; everydays, at 6.00 AM and 12.00 PM, there're many Network Manager errors like this:

Event 1129:

Cluster network 'Public' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Or Event 1126

Cluster network interface 'ClusterNode2 - 10 GbE Public 1' for cluster node 'ClusterNode2' on network 'Public' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

These errors cause cluster network downtime on both cluster nodes;

After a first analysis, I've found that at 6.00 AM and 12.00 PM a VSS of a very large file server (with a 40 TB LUN) occurs; It may be due to this fact? How can I solve this issue?

Thank you very much

Daniele

CSVs assigned to the Cluster Aware Update DNN

$
0
0

Dear all,

I have noticed that in our cluster our two CSVs are "assigned" to the cluster aware update' Distributed Network Name, i.e., browsing to Server Manager\File and Storage Services\Volumes the CSVs belong to the CAU DNN, while the Cluster name has "only" the quorum witness...

Looking at the following PS output, it seems like it is a clusterstorage$ share issue:

[hyperv1]: PS C:\Users\genovese\Documents> Get-SmbShare cl*

Name                              ScopeName                     Path                         
----                                   ---------                         ----                         
ClusterStorage$              CAUHYPERfpq               C:\ClusterStorage 


[hyperv1]: PS C:\Users\genovese\Documents> Get-SmbShare cl* | select *


PresetPathAcl         : System.Security.AccessControl.DirectorySecurity
ShareState            : Online
AvailabilityType      : ScaleOut
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           : Cluster Shared Volumes Default Share
EncryptData           : False
Name                  : ClusterStorage$
Path                  : C:\ClusterStorage
Scoped                : True
ScopeName             : CAUHYPERfpq
SecurityDescriptor    : D:(A;;FA;;;BA)
ShadowCopy            : False
Special               : True
Temporary             : True
Volume                :\\?\Volume{95a908ad-9327-11e2-93e8-806e6f6e6963}\
PSComputerName        :
CimClass              : ROOT/Microsoft/Windows/SMB:MSFT_SmbShare
CimInstanceProperties : {AvailabilityType, CachingMode, CATimeout, ConcurrentUserLimit...}
CimSystemProperties   : Microsoft.Management.Infrastructure.CimSystemProperties

Is that ok? shall I change the scope name to our cluster DNN?

The cluster works properly... Wise man says "If it ain't broken, don't try to fix it"... :)

Thanks for your kind attention

F

"Consider adding additional networks or network interfaces to the cluster" warning, although having redundant network connections

$
0
0

Dear all,

I would like to submit to your attention a warning that I can't fix in our two-node Windows Server 2012 Failover cluster :

We have 6 "NICs" per node:

- 2 NICs for iSCSI (PT0/1; 10.0.0.x and 10.0.2.x; intracluster communication not allowed)

- 2 teamed NICs and its relative vSwitch for intradomain communication (CIGS; internal allowed, clients can connect through this network);

- 1 teamed NIC (for possible further deployment) connected to the other node with a crossover cable for intracluster communication (Cluster; 192.168.1.x, IPv6 enabled; internal allowed, clientscannot connect through this network);

- 1 teamed NIC and its relative vSwitch for heartbeat and Hyper-V replica traffic towards a 3rd hypervisor which is not clustered (HB; 192.168.2.x, IPv6 enabled; internal allowed, clients can connect through this network).

Name    State Role AutoMetric Metric
----    ----- ---- ---------- ------
CIGS       Up    3      False  21000
Cluster    Up    1      False   1000
HB          Up    3      False   1100
iSCSI_0   Up    0      False   1200
iSCSI_2   Up    0      False   1300

According to cluster validation, the output states that, in current configuration, we have a single point of failure. Actually, AFAICT, we granted communication to the cluster to all the networks but the iSCSI one, although the validation report states that Cluster and Heartbeat networks are disable... all pings requests ended up properly.

Thanks for your kind attention, please find the validation report enclosed:

Validate Network Communication

    Description: Validate that servers can communicate, with acceptable latency, on all networks.
    Analyzing connectivity results ...
    Node HYPERV3.xxx and Node HYPERV1.xxx are connected by one or more communication paths that use disabled networks. These paths will not be used for cluster communication and will be ignored. This is because interfaces on these networks are connected to an iSCSI target. Consider adding additional networks to the cluster, or change the role of one or more cluster networks after the cluster has been created to ensure redundancy of cluster communication.
    The communication path between network interface HYPERV1.xxx - PT1 and network interface HYPERV3.xxx - PT1 is on a disabled network.
    The communication path between network interface HYPERV1.xxx - PT0 and network interface HYPERV3.xxx - PT0 is on a disabled network.
    The communication path between network interface HYPERV1.xxx - Cluster and network interface HYPERV3.xxx -Cluster is on a disabled network.
    The communication path between network interface HYPERV1.xxx - vEthernet (vSwitch_HB) and network interface HYPERV3.xxx - vEthernet (vSwitch_HB)is on a disabled network.
    Node HYPERV3.xxx is reachable from Node HYPERV1.xxx by multiple communication paths, but each path
    includes network interface HYPERV1.xxx - vEthernet (vSwitch_HB). This network interface may be a single point of
    failure for communication within the cluster. Please verify that this network interface is highly available or consider adding additional networks or network interfaces to the cluster.


Adding a printer to a cluster

$
0
0

I have just added print services to a cluster but I am having issues adding the printer. I connect to the cluster as\\clustername. I select View remote printers then Add a printer but I am getting Access denied. I am using Domain Admin credentials. I have confirmed credentials are in the Administrators group on each node, in Print Operators group and has Full Control in the security tab of the node in the Print Servers properties under Print and Document Services role. What am I missing?

Thanks,


- Gymmbo

Domain controller able to communicate with the member servers ,but member servers not able to communicate between them...what to do??

$
0
0
I try to establish failover clustering ,they said that two member servers not able to communicate with each other..when i tried to ping ,i was not able to establish successfull ping..please help me soon ..

cluster nodes deleted from AD

$
0
0

Recently we were applying Windows Updates to the nodes in our cluster. When we attempted to move an instance of SQL back to an updated node it failed. I found & fixed a problem with the domain account that SQL runs as on the node where we tried to move to.

When we tried to move SQL again, we got a different error an Event ID 1207 with 'Unable to obtain the Primary Cluster Name Identity token.' In researching this I came accros this Forum entry http://social.technet.microsoft.com/Forums/windowsserver/en-US/9e21b6e4-a5ca-4ee9-bcd6-681a8f32c824/cant-bring-network-name-on-line, which lead me to this as a possible solution https://blogs.technet.com/b/askcore/archive/2009/04/27/recovering-a-deleted-cluster-name-object-cno-in-a-windows-server-2008-failover-cluster.aspx?Redirected=true

So I troop-off to my domain controllers. DNS has the servers & virtual server entries. However, it appears someone deleted the 3 server (nodes) of my cluster from the AD Computers OU.

I restored the AD, but my oldest backup doesn't contain the servers, so now I'm stuck trying to figure out the easiset solution to the problem. These are production servers, so rebuilding the cluster could be beyond scope.

Can I just re-add the servers to the AD Computers OU? Or are there additional tasks that I need to undertake.

Thanks,


BigSam

Validating File and Print Sharing network binding error

$
0
0

Hello,

Hopefully somebody can assist me.

Problem:  I am trying to create a 2 node failover cluster using starwind and Windows Server 2012.  While conducting validation I get a failure in the storage area, with the message of "Unexpected error while validating File and Print Sharing network binding on node (Server 1 and 2): The server threw an exception."

Setup: 2 Dell R720s with 2 software mirrored drives for the OS, and 14 additional drives (Hardware Raid 0) for data (total of 11TB).  Starwind installed, and running correctly, with iSCSI drives recognized and syncing.  4 NICS (1 for local network, 1 for heartbeat, and 2 for iSCSI traffic).  When validating or trying to create the cluster, above failure indicated.

Steps taken so far: Rebooted several times.  Since it was a storage failure, I first contacted Starwind, who worked with me for 2 hours, but after verifying their software was working, indicated they had not seen the error before.  I have verified and played with the network binding orders, and messing with the Microsoft File and Sharing on the NICS.  Still have not been able to get past this error.

Any help in resolving this would be greatly appreciated.

Thank you,

Frank


Frank Snell

Hardware for Clustering

$
0
0

Hi All,

Would like to look at clustering and have a couple questions to clear up.

We have currently;

2x HP DL380 G5 Server

2x HP DL380 G4 Server

We would like to run 3-4 virtual machines using Hyper-V on the DL380's with clustering between the 2 servers with load balancing if available.

if we use 2x 300GB SAS in each server for the Host OS then we would like to either look at using our current DL380G4, or purchasing a DL380G5 or a QNAP NAS for the shared storage.

HP appears to only support Server 2012 on G6 servers except the G5 Storage Server which offers 2012 drivers;

http://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdHome/?sp4ts.oid=3239518&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3DswEnvOID%253D4138%257CswLang%253D%257Caction%253DlistDriver&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

Assuming we can source the extra DL380G5 for a fair amount cheaper then a 4+ Virtual Machine capable NAS could we please check through the required hardware.

So far i know that 2012 would allow the use of SMB 3.0 for the CSV which is a better option then iSCSI using 2008r2. Can you confirm this please?

Being that all 3 servers would be G5 HP servers I am thinking Server 2008r2 is the newest we can go.

1x Server 2008r2 VM for Database

1x Server 2008r2 for RDS

1x Server 2008r2 for various industry software.

2x VM for future growth.

I am thinking 2x DL380G5 with 2x 300GB SAS Raid-1. 16-24GB Ram. Server 2008r2 Enterprise.

1x DL380G5 with 4x 1TB or 8x 300GB SAS etc running Raid 10. 12Gb Ram. Storage Server 2008r2.

Besides a redundant network have i got the basics covered?

If we sourced a DL380G6 so we could use Storage Server 2012 would the Host being 2008r2 be able to use SMB 3.0?

Thank you!

2003 cluster: add node wizard reports wrong IP info

$
0
0

Hello all,

I am trying to add a 4th node to a 2003 cluster.  When I run the add node wizard the summary shows the public and private IP addresses for an existing node in the cluster (the one that is currently hosting the cluster resource) instead of the addresses for the server that I am trying to add.  I cancelled the wizard at this point for fear of breaking the cluster.

I have run "ipconfig /all" on both servers and everything looks correct.  The private NICs are connected to a separate switch and using 192. addresses, the public NICs are teamed and using 10. addresses.  the cluster name and server names all resolve properly through DNS.

Does anyone have any suggestions on where to go from here?

Thanks,
Darryl

Is NetShareAdd still valid for Windows Server 2012 within a Failover Cluster?

$
0
0

We are trying to use NetShareAdd to create a share on the system and have it recognized in the clustered resource.  So the clustered resource name is "ExileOnMainSt" and thus:

SHARE_INFO_503 si503 = { 0 };
si503.shi503_servername = L"EXILEONMAINST";
si503.shi503_netname = L"TSV"
si503.shi503_type = STYPE_DISKTREE;
si503.shi503_remark = L"";
si503.shi503_permissions = ACCESS_READ;
si503.shi503_max_uses = 0xFFFFFFFF;
si503.shi503_path = L"S:\\TSV";
DWORD dwParmErr = 0;
NET_API_STATUS eNetErr = NetShareAdd(L"EXILEONMAINST", 503, (LPBYTE) &si503, &dwParmErr);

I'm receiving ERROR_INVALID_PARAMETER but dwParmErr is zero when the spec says it should be one of the following:

#define SHARE_NETNAME_PARMNUM         1
#define SHARE_TYPE_PARMNUM            3
#define SHARE_REMARK_PARMNUM          4
#define SHARE_PERMISSIONS_PARMNUM     5
#define SHARE_MAX_USES_PARMNUM        6
#define SHARE_CURRENT_USES_PARMNUM    7
#define SHARE_PATH_PARMNUM            8
#define SHARE_PASSWD_PARMNUM          9
#define SHARE_FILE_SD_PARMNUM       501
#define SHARE_SERVER_PARMNUM        503

This works in Windows Server 2008 R2. Has the functionality changed?


Adam



Host in 2008 R2 NLB Cluster randomly switches to "Converging" state

$
0
0

I have two SharePoint 2010 WFE servers setup in a NLB Cluster in multicast mode.  The servers are 2008 R2 running in a VMWare environment and both servers have only 1 NIC.  Several times per week one of the servers goes into a "Converging" state and at which point some users are unable to access our SharePoint site.  I have to reboot the server to get it to successfully converge again.

Microsoft has a hotfix available in the following KB: http://support.microsoft.com/kb/978943/en-us but I'm not sure if I should apply it because it specifically mentions that the server should be recording an entry in the System Event Log with Event ID: 19.  However neither of my 2 servers are logging such an entry so not sure if this hotfix applies.  Has anyone else seen this before or have any suggestions on other things to check or try?

Thank You

Failovercluster BSOD

$
0
0

We just got our second BSOD on a failovercluster box in 2 weeks. The setup is 3 server 2012 boxes (dell r610) with a 10 gig nic for san and the onboard for the public.

the minidump shows netft.sys as the culprit 

ADDITIONAL_DEBUG_TEXT:  
You can run '.symfix; .reload' to try to fix the symbol path and load symbols.

MODULE_NAME: netft

FAULTING_MODULE: fffff802f5683000 nt

DEBUG_FLR_IMAGE_TIMESTAMP:  5010aa07

PROCESS_OBJECT: fffffa80ac6e46c0

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT_SERVER

BUGCHECK_STR:  0x9E

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff8800660d845 to fffff802f56dd440

STACK_TEXT:  
fffff880`0316a938 fffff880`0660d845 : 00000000`0000009e fffffa80`ac6e46c0 00000000`0000003c 00000000`00000000 : nt+0x5a440
fffff880`0316a940 00000000`0000009e : fffffa80`ac6e46c0 00000000`0000003c 00000000`00000000 00000000`00000000 : netft+0x2845
fffff880`0316a948 fffffa80`ac6e46c0 : 00000000`0000003c 00000000`00000000 00000000`00000000 fffff802`f5636dd4 : 0x9e
fffff880`0316a950 00000000`0000003c : 00000000`00000000 00000000`00000000 fffff802`f5636dd4 fffffa80`4cdecc30 : 0xfffffa80`ac6e46c0
fffff880`0316a958 00000000`00000000 : 00000000`00000000 fffff802`f5636dd4 fffffa80`4cdecc30 fffff880`0660d516 : 0x3c


STACK_COMMAND:  kb

FOLLOWUP_IP: 
netft+2845
fffff880`0660d845 ??              ???

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  netft+2845

FOLLOWUP_NAME:  MachineOwner

IMAGE_NAME:  netft.sys

Any thoughts on what i should do?

HDisk for clustering

$
0
0

How can I make my HDisk1 on each node acceptable for clustering? 

I created a cluster  with 2 nodes, each has 2 Hdisks Hdisk0, Hdisk1.  HDisk0 is the boot disk with the OS and all data.

I want to use Hdisk1 for the CA Failover clustering. 

Validate cluster on the storage gave the warning "Storport-miniport driver is the only driver certified by Microsoft", yet it said I can use the SCSIport-minidriver on the HDisk1 for clustering.  But Navigating to Storage to "Add disk" to the cluster, "No disk is seen that can be clustered". 

Any suggestion as to how to make use of the internal HDISK1 drives fro clustering will be appreciated.

Can't access NLB cluster file share by name

$
0
0

I setup a NLB cluster which contains 2 servers. Both of them have a file share to share static files. I can access the file share on each node by using node's name or IP. However when I tried to access file share using cluster name and IP, I can only do so successfully via cluster IP. When using cluster name, I got a pop up windows saying my credential is not good. I retype my user name and password which is the local admin on each node, but it just not going through.

Did I miss anything?

Thanks!

Node of 2008 R2 cluster fails join its cluster and causes bogus share witness errors and loss of protection.

$
0
0

Hello.

I’m  trying to get   cluster 2008 R2 Ent. running as part of exchange server with two nodes DAG and witness server.  The cluster constantly gets one node down at random intervals (30 Min to 1-2 days), and  it can stay this way from few minute to several days.  Stopping  and restarting  the cluster (whole cluster)usually help, but it is not a solution as stability  and automatic failover for a reason  are essential for an usable e-mail service. I have dug  through the problem for quite a while, so I  noticed  a way to replicate the situation, and have reached the point where I cannot find any more clue what causes the break up of process.

The first symptoms  are events of failure-to-arbitrate and witness-unreachable:

event 1564, File share witness resource 'File Share Witness (\\witness01.company.com\DAG01.company.com)' failed to arbitrate for the file share '\\witness01.company.com\DAG01.company.com'.

event 1069, “Cluster resource 'File Share Witness (\\witness01.company.com\DAG01.company.com)' in clustered service or application 'Cluster Group' failed”.

event1573:” Node 'node002' failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available”.


I monitored all traffic between nodes and  witness, and I found that NOTHING IS WRONG  with the witness. The node002 was trying to access and create folders and lock files in witness share, while the folders were actially present  and  files were still locked by node001. The node001 was keeping  majority  and locks of the  file witness and running the  remaining part of the cluster. HENCE THE KEY QUESTION BECAME:  WHY THE node002 TRIES TO TAKE OVER THE WITNESS INSTEAD OF JOINING THE CLUSTER.

I have downloaded the cluster log and found that  without any error node 001 and  node002  generate respectively “ DBG  [CHANNEL 192.168.2.22:~51189~] Close()”   and  “CHANNEL 192.168.1.11:~3343~] Close()” , later node002 registered an event “ INFO Shutdown lock acquired, proceeding with shutdown” and  then after some activities node002 starts knocking  to witness. May be the event at node001:  “WARN  [FTI][Initiator] Ignoring duplicate connection: usable route already exists”  is somehow related to the problem. This behavior  can be replicated with some  probability through stopping and starting cluster service on the node002. Sometimes the node002 joins the cluster in few seconds, but in many cases it takes longer (hours, and few times even days), and  then I know that  the node002 just went   hunting for file witness.

I found many  discussions regarding to “CHANNEL ….. graceful close, status” event, but those cases usually relay some errors  and most of them  are about  security and duplicate names of accounts. My log does not have any errors before the event and there is not any duplicated names. Actually, the fact that cluster periodically works fine suggests that there is not any permanent permission or name duplication problem.

This is a single network Exchange DAG cluster. It has   two nodes located in two different sites connected over VPN and a file witness located in a third site.  Logs from node002 and  a piece from node001 (goes after ----NODE001-----) are bellow. Log from node001 shows some inactivity periods around the moment of the “closing-the-channel”

I marked some key events with  ***  (including: 15:52:15.651 INFO***  Shutdown lock acquired, proceeding with shutdow).

Thanks  for any help.

 

 

-------------------------------------------------------NODE002-----------------------------------------------------------

000010d4.000015e4::2013/07/15-15:52:12.480 INFO  [IM] Route from 192.168.2.22:~3343~ to 192.168.1.11:~3343~ is already up, not sending report

000010d4.000016f4::2013/07/15-15:52:12.480 INFO  [NODE] Node 2: New join with n1: stage: 'Wait for Heartbeats on Initial NetFT Route'

000010d4.000016f4::2013/07/15-15:52:12.495 DBG   [FTW] NetFT address fe80::71f7:22a3:89fb:ab11:~3343~ is ready.

000010d4.000016f4::2013/07/15-15:52:12.495 INFO  [FTW] NetFT is ready after 0 msecs wait.

000010d4.000016f4::2013/07/15-15:52:12.495 INFO  [NODE] Node 2: New join with n1: stage: 'Wait for NetFT Duplicate Address Detection'

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [NETFTAPI] received NsiParameterNotification for fe80::71f7:22a3:89fb:ab11 (IpDadStatePreferred )

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [NETFTAPI] Signaled NetftLocalAdd event for fe80::71f7:22a3:89fb:ab11

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [NETFTEVM] FTI NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [NETFTEVM] TM NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [NETFTEVM] IM NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000016f4::2013/07/15-15:52:12.511 DBG   [WM] Filtering event NETFT_LOCAL_ADD? 1

000010d4.0000149c::2013/07/15-15:52:12.511 DBG   [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.00000d34::2013/07/15-15:52:12.511 DBG   [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000015e4::2013/07/15-15:52:12.511 DBG   [NETFTEVM] IM NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000015e4::2013/07/15-15:52:12.511 INFO  [IM] got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ added

000010d4.000016f4::2013/07/15-15:52:12.526 DBG   [NETFTAPI] Signaled NetftLocalConnect event for fe80::71f7:22a3:89fb:ab11

000010d4.000016f4::2013/07/15-15:52:12.526 DBG   [NETFTEVM] FTI NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.000016f4::2013/07/15-15:52:12.526 DBG   [NETFTEVM] TM NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.000016f4::2013/07/15-15:52:12.526 DBG   [NETFTEVM] IM NetFT event handler got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.000016f4::2013/07/15-15:52:12.526 DBG   [WM] Filtering event NETFT_LOCAL_CONNECT? 1

000010d4.0000149c::2013/07/15-15:52:12.526 DBG   [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.00000d34::2013/07/15-15:52:12.526 DBG   [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.000015e4::2013/07/15-15:52:12.526 DBG   [NETFTEVM] IM NetFT event dispatcher pushing event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.000015e4::2013/07/15-15:52:12.526 INFO  [IM] got event: Local endpoint fe80::71f7:22a3:89fb:ab11:~0~ connected

000010d4.0000163c::2013/07/15-15:52:12.901 INFO  [ACCEPT] :::~3343~: Accepted inbound connection from remote endpoint fe80::7964:c5c5:3b5:7833%13:~37758~.

000010d4.000016f4::2013/07/15-15:52:12.901 INFO  [SV] Route local (fe80::71f7:22a3:89fb:ab11%13:~3343~) to remote (fe80::7964:c5c5:3b5:7833%13:~37758~) exists. Forwarding to alternate path.

000010d4.000016f4::2013/07/15-15:52:12.901 INFO  [SV] Securing route from (fe80::71f7:22a3:89fb:ab11%13:~3343~) to remote (fe80::7964:c5c5:3b5:7833%13:~37758~).

000010d4.000016f4::2013/07/15-15:52:12.901 INFO  [SV] Got a new incoming stream from fe80::7964:c5c5:3b5:7833%13:~37758~

000010d4.000016f4::2013/07/15-15:52:12.901 DBG   [SM] SrvCtxt initialized with package Kerberos, MaxTokenSize = 12000, RequiredCtxAttrib = 165910, HandShakeTimeout = 30000

000010d4.00000a84::2013/07/15-15:52:12.901 DBG   [SM] Handling auth handshake posted by thread id 5876

000010d4.0000154c::2013/07/15-15:52:13.042 WARN  [API] s_ApiOpenGroupEx: Group Cluster Group failed, status = 70

000010d4.0000147c::2013/07/15-15:52:13.651 DBG   [JPM] Node 2: contacts size for node node001 is 1, current index 0

000010d4.0000147c::2013/07/15-15:52:13.651 DBG   [JPM] Node 2: Trying to connect to node node001 (IP: 192.168.1.11:~0~)

000010d4.0000147c::2013/07/15-15:52:13.651 DBG   [HM] Trying to connect to node001 at 192.168.1.11:~3343~

000010d4.0000147c::2013/07/15-15:52:13.776 INFO  [CONNECT] 192.168.1.11:~3343~: Established connection to remote endpoint 192.168.1.11:~3343~.

000010d4.0000147c::2013/07/15-15:52:13.776 INFO  [SV] Securing route from (192.168.2.22:~51189~) to remote node001 (192.168.1.11:~3343~).

000010d4.0000147c::2013/07/15-15:52:13.776 INFO  [SV] Got a new outgoing stream to node001 at 192.168.1.11:~3343~

000010d4.0000147c::2013/07/15-15:52:13.776 DBG   [SM] Joiner: Initialized with SPN = node001, Package = Kerberos, RequiredCtxAttrib = 83990, HandShakeTimeout = 30000

000010d4.0000127c::2013/07/15-15:52:13.776 DBG   [SM] Handling auth handshake posted by thread id 5244

000010d4.0000127c::2013/07/15-15:52:13.776 DBG   [SM] Joiner: ISC returned status = 590610 output Blob size 1578

000010d4.0000127c::2013/07/15-15:52:13.917 DBG   [SM] Joiner: Received SSPI blob from the Sponsor of size 156

000010d4.0000127c::2013/07/15-15:52:13.917 DBG   [SM] Joiner: ISC returned status = 0 output Blob size 0

000010d4.0000147c::2013/07/15-15:52:13.933 INFO  [SV] Authentication and authorization were successful

000010d4.0000147c::2013/07/15-15:52:13.933 DBG   [SM] Joiner: Initialized with SPN = node001, Package = Kerberos, RequiredCtxAttrib = 67586, HandShakeTimeout = 30000

000010d4.0000127c::2013/07/15-15:52:13.933 DBG   [SM] Handling auth handshake posted by thread id 5244

000010d4.0000127c::2013/07/15-15:52:13.933 DBG   [SM] Joiner: ISC returned status = 590610 output Blob size 1578

000010d4.0000127c::2013/07/15-15:52:14.073 DBG   [SM] Joiner: Received SSPI blob from the Sponsor of size 156

000010d4.0000127c::2013/07/15-15:52:14.073 DBG   [SM] Joiner: ISC returned status = 0 output Blob size 0

000010d4.0000147c::2013/07/15-15:52:14.073 INFO  [SV] Security Handshake successful while obtaining SecurityContext for NetFT driver

000010d4.0000147c::2013/07/15-15:52:14.073 INFO  [VER] Got new TCP connection. Exchanging version data.

000010d4.0000147c::2013/07/15-15:52:14.073 DBG   [VER] Calculated cluster versions: highest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1], lowest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1] with exclude node list: (1)

000010d4.0000147c::2013/07/15-15:52:14.073 INFO  [VER] Checking version compatibility for node node001 id 1 with following versions: highest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1], lowest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1].

000010d4.0000147c::2013/07/15-15:52:14.073 INFO  [VER] Version check passed: node and cluster highest supported versions match.

000010d4.0000147c::2013/07/15-15:52:14.198 INFO  [SV] Negotiating message security level.

000010d4.0000147c::2013/07/15-15:52:14.198 INFO  [SV] Already protecting connection with message security level 'Sign'.

000010d4.0000147c::2013/07/15-15:52:14.198 INFO  [FTI] Got new raw TCP/IP connection.

000010d4.0000147c::2013/07/15-15:52:14.339 INFO  [FTI][Follower] This node (2) is not the initiator

000010d4.0000147c::2013/07/15-15:52:14.339 DBG   [FTI] Stream already exists to node 1: false

000010d4.0000147c::2013/07/15-15:52:14.339 DBG***   [CHANNEL 192.168.1.11:~3343~] Close().

000010d4.0000147c::2013/07/15-15:52:14.339 INFO***  [CHANNEL 192.168.1.11:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)

000010d4.0000147c::2013/07/15-15:52:14.339 INFO***  [CORE] Node 2: Clearing cookie bfc27345-e777-418e-b69d-1f1b8fe89bcf

000010d4.0000147c::2013/07/15-15:52:14.339 DBG***   [CHANNEL 192.168.1.11:~3343~] Not closing handle because it is invalid.

 

000010d4.0000147c::2013/07/15-15:52:14.339 WARN***  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.168.1.11:~3343~ is closed'

 

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTAPI] received NsiParameterNotification for 169.254.171.17 (IpDadStateInvalid )

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTAPI] received NsiDeleteInstance for 169.254.171.17

000010d4.0000147c::2013/07/15-15:52:14.511 WARN  [NETFTAPI] Failed to query parameters for 169.254.171.17 (status 80070490)

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTAPI] Signaled NetftLocalAdd event for 169.254.171.17

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTEVM] FTI NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTEVM] TM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [NETFTEVM] IM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.511 DBG   [WM] Filtering event NETFT_LOCAL_ADD? 1

000010d4.0000147c::2013/07/15-15:52:14.526 WARN  [NETFTAPI] Failed to query parameters for 169.254.171.17 (status 80070490)

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTAPI] Signaled NetftLocalRemove event for 169.254.171.17

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] FTI NetFT event handler ignoring PnP remove event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] TM NetFT event handler ignoring PnP remove event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] IM NetFT event handler ignoring PnP remove event for IPv4 LinkLocal address 169.254.171.17:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [WM] Filtering event NETFT_LOCAL_REMOVE? 1

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTAPI] received NsiParameterNotification for 169.254.2.47 (IpDadStatePreferred )

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTAPI] Signaled NetftLocalAdd event for 169.254.2.47

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] FTI NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.2.47:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] TM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.2.47:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] IM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.2.47:~0~

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [WM] Filtering event NETFT_LOCAL_ADD? 1

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTAPI] Signaled NetftLocalConnect event for 169.254.2.47

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] FTI NetFT event handler got event: Local endpoint 169.254.2.47:~0~ connected

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] TM NetFT event handler got event: Local endpoint 169.254.2.47:~0~ connected

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] IM NetFT event handler got event: Local endpoint 169.254.2.47:~0~ connected

000010d4.00000d34::2013/07/15-15:52:14.526 DBG   [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint 169.254.2.47:~0~ connected

000010d4.0000147c::2013/07/15-15:52:14.526 DBG   [WM] Filtering event NETFT_LOCAL_CONNECT? 1

000010d4.000015e4::2013/07/15-15:52:14.526 DBG   [NETFTEVM] IM NetFT event dispatcher pushing event: Local endpoint 169.254.2.47:~0~ connected

000010d4.000015e4::2013/07/15-15:52:14.526 INFO  [IM] got event: Local endpoint 169.254.2.47:~0~ connected

000010d4.0000149c::2013/07/15-15:52:14.526 DBG   [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint 169.254.2.47:~0~ connected

000010d4.00000a84::2013/07/15-15:52:15.261 DBG   [SM] Sponsor: Received SSPI blob from the Joiner of size 1577

000010d4.00000a84::2013/07/15-15:52:15.261 DBG   [SM] Sponsor: SSPI ASC returned status = 0

000010d4.00000a84::2013/07/15-15:52:15.261 DBG   [SM] Sponsor: Sending SSPI blob of size 155 to Joiner

000010d4.00000a84::2013/07/15-15:52:15.261 DBG   [SM] Sponsor: Authentication handshake final Status 0

000010d4.000016f4::2013/07/15-15:52:15.401 INFO  [SV] Authentication and authorization were successful

000010d4.000016f4::2013/07/15-15:52:15.401 DBG   [SM] SrvCtxt initialized with package Kerberos, MaxTokenSize = 12000, RequiredCtxAttrib = 133122, HandShakeTimeout = 30000

000010d4.00000a84::2013/07/15-15:52:15.401 DBG   [SM] Handling auth handshake posted by thread id 5876

000010d4.00000a84::2013/07/15-15:52:15.401 DBG   [SM] Sponsor: Received SSPI blob from the Joiner of size 1577

000010d4.00000a84::2013/07/15-15:52:15.401 DBG   [SM] Sponsor: SSPI ASC returned status = 0

000010d4.00000a84::2013/07/15-15:52:15.401 DBG   [SM] Sponsor: Sending SSPI blob of size 155 to Joiner

000010d4.00000a84::2013/07/15-15:52:15.401 DBG   [SM] Sponsor: Authentication handshake final Status 0

000010d4.000016f4::2013/07/15-15:52:15.401 INFO  [SV] Security Handshake successful while obtaining SecurityContext for NetFT driver

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [SV] Incoming (second) connection from node001 is secure

000010d4.000016f4::2013/07/15-15:52:15.542 INFO  [ReM] Got stream info from fe80::71f7:22a3:89fb:ab11%13:~3343~ to fe80::7964:c5c5:3b5:7833%13:~37758~.

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [ReM] Exchanging local info.

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [ReM] Sending local info.

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [ReM] Local info sent, receiving remote info.

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [ReM] Remote info received from 1:node001.

000010d4.000016f4::2013/07/15-15:52:15.542 DBG   [ReM][Follower] I am the follower with n1.

000010d4.000013d4::2013/07/15-15:52:15.651 INFO  [DM] Node 2: Loaded

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] Form is called (lightweight form = true)

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] rcm::RcmAgent::Unload()

000010d4.000013d4::2013/07/15-15:52:15.651 INFO***  Shutdown lock acquired, proceeding with shutdown

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] orphan group handlers: requested=0, started=0, finished=0

000010d4.000013d4::2013/07/15-15:52:15.651 INFO  [GUM] Node 2: shutting down gum handling

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] Disabling API calls

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] Enabled API calls

000010d4.000013d4::2013/07/15-15:52:15.651 INFO  [GUM] Node 2: reenabling gum handling

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] rcm::RcmResType::InitializeFromDb()

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] Deleting stale key SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\Rhs\3b8c4a5e-220e-4753-859c-967862ba62d4

000010d4.000013d4::2013/07/15-15:52:15.651 DBG   [RCM] Deleting stale key SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\Rhs\b1750551-cf79-41c0-bfa7-b383cb5e40ec

000010d4.000013d4::2013/07/15-15:52:15.651 INFO  [RCM] Created monitor process 4464 / 0x1170

00001170.00000f74::2013/07/15-15:52:15.651 INFO  [RHS] Initializing.

000010d4.000013d4::2013/07/15-15:52:15.667 DBG   [RCM] Scheduling wait callback for monitor process 4464

00001170.00000da0::2013/07/15-15:52:15.667 DBG   [RHS] s_RhsRpcCreateResType(DFS Replicated Folder, dfsrclus.dll)

000010d4.000016f4::2013/07/15-15:52:15.683 INFO  [ReM][Follower] Got remote data from n1, epoch: 0, sn: 0, Fault Tolerant Session Id: 00000000-0000-0000-0000-000000000000

000010d4.000016f4::2013/07/15-15:52:15.683 DBG   [NODE] Node 2: To n1 getting epoch (currently 0)

000010d4.000016f4::2013/07/15-15:52:15.683 DBG   [ReM][Follower] Current state with n1, epoch: 0, sn: 0

000010d4.000016f4::2013/07/15-15:52:15.683 DBG   [ReM][Follower] Successfully sent current state to 1.

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(DHCP Service, clnetres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(Distributed File System, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(Distributed Transaction Coordinator, mtxclu.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(File Server, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(File Share Witness, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(Generic Application, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(Generic Script, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(Generic Service, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.714 DBG   [RHS] s_RhsRpcCreateResType(IP Address, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(IPv6 Address, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(IPv6 Tunnel Address, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(Microsoft iSNS, isnsclusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(MSMQ, mqclus.dll)

00001170.00000da0::2013/07/15-15:52:15.730 ERR   [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQ returned 21.'

000010d4.000013d4::2013/07/15-15:52:15.730 WARN  [RCM] Failed to load restype 'MSMQ': error 21.

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(MSMQTriggers, mqtgclus.dll)

00001170.00000da0::2013/07/15-15:52:15.730 ERR   [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQTriggers returned 21.'

000010d4.000013d4::2013/07/15-15:52:15.730 WARN  [RCM] Failed to load restype 'MSMQTriggers': error 21.

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(Network Name, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(NFS Share, nfssh.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(Physical Disk, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(Print Spooler, clusres.dll)

00001170.00000da0::2013/07/15-15:52:15.730 DBG   [RHS] s_RhsRpcCreateResType(Virtual Machine, vmclusres.dll)

00001170.00000da0::2013/07/15-15:52:15.745 DBG   [RHS] s_RhsRpcCreateResType(Virtual Machine Configuration, vmclusres.dll)

00001170.00000da0::2013/07/15-15:52:15.745 DBG   [RHS] s_RhsRpcCreateResType(Volume Shadow Copy Service Task, vsstask.dll)

00001170.00000da0::2013/07/15-15:52:15.745 DBG   [RHS] s_RhsRpcCreateResType(WINS Service, clnetres.dll)

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   [RCM] rcm::RcmGroup::InitializeFromDb()

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   [RCM] rcm::RcmDependency::InitializeFromDb()

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   [RCM] rcm::RcmResource::AddDependency(Cluster Name, 79eb390e-4ac4-4088-8057-aa778415e0b5)

000010d4.000013d4::2013/07/15-15:52:15.745 INFO  [API] Online read only

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   RcmGroup::TakeOwnershipOfAllGroups

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   [RCM] rcm::RcmGroup::TransitionToState: Available Storage: Offline->ClusterGroupChoosingOwner.

000010d4.000013d4::2013/07/15-15:52:15.745 DBG   [RCM] rcm::RcmGroup::TransitionToState: Cluster Group: Offline->ClusterGroupChoosingOwner.

000010d4.000013d4::2013/07/15-15:52:15.745 INFO  [RCM] Created monitor process 4168 / 0x1048

00001048.00001214::2013/07/15-15:52:15.761 INFO  [RHS] Initializing.

000010d4.000013d4::2013/07/15-15:52:15.776 DBG   [RCM] Scheduling wait callback for monitor process 4168

000010d4.000013d4::2013/07/15-15:52:15.776 DBG   [RCM] rpc binding handle for File Share Witness (\\witness01.company.com\DAG01.company.com): HDL(18e9e60)

000010d4.000013d4::2013/07/15-15:52:15.776 DBG   Sending control 1

000010d4.0000147c::2013/07/15-15:52:15.808 INFO  [NM] Received request from client address node002.

000010d4.0000147c::2013/07/15-15:52:15.808 DBG   [API] Authenticated client--Client: NT AUTHORITY\SYSTEM Interface: b97db8b2-4c63-11cf-bff6-08002be23f2f Server: (null) Level: RPC_C_AUTHN_LEVEL_PKT_PRIVACY Service: RPC_C_AUTHN_WINNT Protocol Sequence: ncalrpc Client Address: node002 Network Option: .

000010d4.0000147c::2013/07/15-15:52:15.808 DBG   [API] s_ApiClusterControl(GET_COMMON_PROPERTIES)

000010d4.0000147c::2013/07/15-15:52:15.808 DBG   [API] s_ApiClusterControl(GET_COMMON_PROPERTIES)

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [ReM][Follower] Got direction from 1. Epoch is now 1, will resume from SN 0.  Fault Tolerant Session ID is 58d685e0-1c0b-4027-959f-04b2fb2f8ecc

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [ReM] Sending connection down normal path.

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: New join with n1: stage: 'Update NetFT Route'

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [JPM] Received a new stream from node001

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: New join with n1: stage: 'Send Current Membership Status for Join Policy'

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [MM] Node 2: Adding a stream to existing node 1

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: n1 node object adding stream

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: n1 node object got a channel

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: Using new stream to n1, setting epoch to 1

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: Done closing stream to n1

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: My Fault Tolerant Session Id is now 58d685e0-1c0b-4027-959f-04b2fb2f8ecc

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: No reconnect in progress to n1, updating send queue based on new stream.

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: Treating stream with n1 as new connection because epoch (1) is <= 1.

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [MQ-node001] Clearing 0 unsent and 0 unacknowledged messages.

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: Highest version with n1 = Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1, lowest = Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [NODE] Node 2: Done processing new stream to n1.

000010d4.0000147c::2013/07/15-15:52:15.808 INFO  [PULLER node001] Just about to start reading from <refcounted count='2' typeid='.?AVBufferedStream@cxl@@'/>

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [CORE] Node 2: sending jpm/welcome to JPMA at node001

000010d4.000016f4::2013/07/15-15:52:15.808 INFO  [JPM] Node 2: Selected partition 802(1) as a target for join

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [JPM] Node 2: join attempt 802(1) was vetoed. Will retry

000010d4.000016f4::2013/07/15-15:52:15.808 DBG   [CORE] Veto Cancel Requested

000010d4.0000123c::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: just about to send a message of size 226 to 1

000010d4.0000123c::2013/07/15-15:52:15.808 DBG   [NODE] Node 2: message to node 1 sent

000010d4.0000154c::2013/07/15-15:52:15.823 DBG   [RCM] rpc binding handle for Cluster Name: HDL(18e9fe0)

000010d4.0000154c::2013/07/15-15:52:15.823 DBG   Sending control 3

00001048.00000eec::2013/07/15-15:52:15.823 INFO  [RES] Network Name <Cluster Name>: NetNameOpen Invoked

00001048.00000eec::2013/07/15-15:52:15.823 INFO  [RES] Network Name <Cluster Name>: Successful open of resid 3244336

000010d4.00000e30::2013/07/15-15:52:15.823 INFO  [NM] Received request from client address node002.

000010d4.00000e30::2013/07/15-15:52:15.823 DBG   [API] Authenticated client--Client: NT AUTHORITY\SYSTEM Interface: 299bc84a-de09-49e9-a240-8a1042d5d60a Server: (null) Level: RPC_C_AUTHN_LEVEL_PKT_PRIVACY Service: RPC_C_AUTHN_WINNT Protocol Sequence: ncalrpc Client Address: node002 Network Option: .

000010d4.00000e30::2013/07/15-15:52:15.823 INFO  [RCM] HandleMonitorReply: OPENRESOURCE for 'Cluster Name', gen(0) result 0.

00001048.00000eec::2013/07/15-15:52:15.823 INFO  [RES] Network Name <Cluster Name>: Getting a virtual computer account token.

00001048.00000eec::2013/07/15-15:52:15.839 INFO  [RES] Network Name <Cluster Name>: Resource object did not contain the cached AD Domain. Obtaining.

000010d4.0000147c::2013/07/15-15:52:15.948 INFO  [JPM] Node 2: Node 1 is in view 802(1) and hasQuorum = true

00001048.00000eec::2013/07/15-15:52:15.995 INFO  [RES] Network Name <Cluster Name>: Got new Logon Session.

000010d4.0000154c::2013/07/15-15:52:15.995 INFO  [RCM] HandleMonitorReply: OPENRESOURCE for 'File Share Witness (\\witness01.company.com\DAG01.company.com)', gen(0) result 0.

000010d4.000013d4::2013/07/15-15:52:15.995 INFO  [QUORUM] Node 2: setting quorum id to bd0940e0-a02c-4622-86e0-d9418f055e02 (storage-capable: false)

000010d4.000013d4::2013/07/15-15:52:15.995 DBG   [RCM] rcm::RcmAgent::SetQuorumResource(bd0940e0-a02c-4622-86e0-d9418f055e02)

000010d4.000013d4::2013/07/15-15:52:15.995 INFO  [QUORUM] Node 2: online quorum bd0940e0-a02c-4622-86e0-d9418f055e02

000010d4.000013d4::2013/07/15-15:52:15.995 DBG   [RCM] rcm::RcmAgent::Online(bd0940e0-a02c-4622-86e0-d9418f055e02)

000010d4.000013d4::2013/07/15-15:52:15.995 DBG   [RCM] rcm::RcmResource::IsReadyToGoOnline=> (File Share Witness (\\witness01.company.com\DAG01.company.com), true)

000010d4.000013d4::2013/07/15-15:52:15.995 INFO  [RCM] TransitionToState(File Share Witness (\\witness01.company.com\DAG01.company.com)) Offline-->OnlineCallIssued.

000010d4.000013d4::2013/07/15-15:52:15.995 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, ClusterGroupChoosingOwner --> Pending)

000010d4.000013d4::2013/07/15-15:52:15.995 DBG   [RCM] rcm::RcmResource::WaitForState(File Share Witness (\\witness01.company.com\DAG01.company.com), Online)

000010d4.00000e30::2013/07/15-15:52:15.995 DBG   [CM] mscs::CheckpointManager::PreOnline: File Share Witness (\\witness01.company.com\DAG01.company.com)

000010d4.00000e30::2013/07/15-15:52:15.995 DBG   [RCM] Issuing Arbitrate(File Share Witness (\\witness01.company.com\DAG01.company.com)) to RHS.

00001048.00000eec::2013/07/15-15:52:15.995 INFO  [RES] File Share Witness <File Share Witness (\\witness01.company.com\DAG01.company.com)>: Beginning arbitration ...

000010d4.0000154c::2013/07/15-15:52:16.042 DBG   [API] s_ApiMoveGroupToNode(Cluster Group, 1)

000010d4.0000154c::2013/07/15-15:52:16.042 INFO  [RCM] rcm::RcmApi::MoveGroup: (Cluster Group, 1)

000010d4.0000154c::2013/07/15-15:52:16.042 DBG   [RCM] rcm::RcmGroup::WaitForStableState(Cluster Group, MustBeOfflineOrFailed::No)

000010d4.0000154c::2013/07/15-15:52:16.042 DBG   [RCM] rcm::RcmGroup::WaitForStableState: Group Cluster Group is Pending; group is not moving.

000010d4.0000154c::2013/07/15-15:52:16.042 DBG   [RCM] rcm::RcmGroup::WaitForStableState: Resources which are not in stable state:

000010d4.0000154c::2013/07/15-15:52:16.042 DBG   [RCM] File Share Witness (\\witness01.company.com\DAG01.company.com): OnlineCallIssued,

000010d4.00000ddc::2013/07/15-15:52:16.808 INFO  [JPM] Node 2: Selected partition 802(1) as a target for join

000010d4.00000ddc::2013/07/15-15:52:16.808 DBG   [JPM] Node 2: join attempt 802(1) was vetoed. Will retry

00001048.00000eec::2013/07/15-15:52:17.261 INFO  [RES] File Share Witness <File Share Witness (\\witness01.company.com\DAG01.company.com)>: Opening file \\witness01.company.com\DAG01.company.com\bd0940e0-a02c-4622-86e0-d9418f055e02\Witness.log.

00001048.00000eec::2013/07/15-15:52:17.526 INFO  [RES] File Share Witness <File Share Witness (\\witness01.company.com\DAG01.company.com)>: Attempting to lock file \\witness01.company.com\DAG01.company.com\bd0940e0-a02c-4622-86e0-d9418f055e02\Witness.log, try 1 of 30.

000010d4.00000ddc::2013/07/15-15:52:17.808 INFO  [JPM] Node 2: Selected partition 802(1) as a target for join

000010d4.00000ddc::2013/07/15-15:52:17.808 DBG   [JPM] Node 2: join attempt 802(1) was vetoed. Will retry

000010d4.00000ddc::2013/07/15-15:52:18.808 INFO  [JPM] Node 2: Selected partition 802(1) as a target for join

000010d4.00000ddc::2013/07/15-15:52:18.808 DBG   [JPM] Node 2: join attempt 802(1) was vetoed. Will retry

000010d4.0000154c::2013/07/15-15:52:19.167 DBG   [RCM] File Share Witness (\\witness01.company.com\DAG01.company.com): OnlineCallIssued,

 

 

 

-------------------------------------------------------------NODE001-----------------------------------------------------------------

000019a0.00001924::2013/07/15-15:52:14.203 INFO  [VER] Got new TCP connection. Exchanging version data.

000019a0.00001924::2013/07/15-15:52:14.343 DBG   [VER] Calculated cluster versions: highest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1], lowest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1] with exclude node list: (2)

000019a0.00001924::2013/07/15-15:52:14.343 INFO  [VER] Checking version compatibility for node node002 id 2 with following versions: highest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1], lowest [Major 6 Minor 7601 Upgrade 7 ClusterVersion 0x00061DB1].

000019a0.00001924::2013/07/15-15:52:14.343 INFO  [VER] Version check passed: node and cluster highest supported versions match.

000019a0.00001924::2013/07/15-15:52:14.343 INFO  [SV] Negotiating message security level.

000019a0.00001924::2013/07/15-15:52:14.484 INFO  [SV] Already protecting connection with message security level 'Sign'.

000019a0.00001924::2013/07/15-15:52:14.484 INFO  [FTI] Got new raw TCP/IP connection.

000019a0.00001924::2013/07/15-15:52:14.484 INFO  [FTI][Initiator] This node (1) is initiator

 

000019a0.00001924::2013/07/15-15:52:14.484 WARN  [FTI][Initiator] Ignoring duplicate connection: usable route already exists

000019a0.00001924::2013/07/15-15:52:14.484 DBG   [CHANNEL 192.168.2.22:~51189~] Close().

000019a0.00001924::2013/07/15-15:52:14.484 DBG   [CHANNEL 192.168.2.22:~51189~]/send: Attempting to perform I/O on closed stream.

000019a0.00001924::2013/07/15-15:52:14.484 DBG   [CHANNEL 192.168.2.22:~51189~] Not closing handle because it is invalid.

000019a0.00001924::2013/07/15-15:52:14.484 INFO  [CHANNEL 192.168.2.22:~51189~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)

 

000019a0.00001924::2013/07/15-15:52:14.484 DBG   [CHANNEL 192.168.2.22:~51189~] Not closing handle because it is invalid.

 

000019a0.00001924::2013/07/15-15:52:14.484 WARN  mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.168.2.22:~51189~ is closed'

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [NETFTAPI] received NsiParameterNotification for 169.254.1.204 (IpDadStatePreferred )

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [NETFTAPI] Signaled NetftLocalConnect event for 169.254.1.204

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [NETFTEVM] FTI NetFT event handler got event: Local endpoint 169.254.1.204:~0~ connected

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [NETFTEVM] TM NetFT event handler got event: Local endpoint 169.254.1.204:~0~ connected

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [NETFTEVM] IM NetFT event handler got event: Local endpoint 169.254.1.204:~0~ connected

000019a0.00001924::2013/07/15-15:52:15.062 DBG   [WM] Filtering event NETFT_LOCAL_CONNECT? 1

000019a0.00001820::2013/07/15-15:52:15.062 DBG   [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint 169.254.1.204:~0~ connected

000019a0.000013a0::2013/07/15-15:52:15.062 DBG   [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint 169.254.1.204:~0~ connected

000019a0.000019d4::2013/07/15-15:52:15.062 DBG   [NETFTEVM] IM NetFT event dispatcher pushing event: Local endpoint 169.254.1.204:~0~ connected

000019a0.000019d4::2013/07/15-15:52:15.062 INFO  [IM] got event: Local endpoint 169.254.1.204:~0~ connected

000019a0.00001bfc::2013/07/15-15:52:15.390 DBG   [SM] Joiner: ISC returned status = 590610 output Blob size 1577

000019a0.00001bfc::2013/07/15-15:52:15.531 DBG   [SM] Joiner: Received SSPI blob from the Sponsor of size 155

000019a0.00001bfc::2013/07/15-15:52:15.531 DBG   [SM] Joiner: ISC returned status = 0 output Blob size 0

000019a0.000019a8::2013/07/15-15:52:15.546 INFO  [SV] Authentication and authorization were successful

000019a0.000019a8::2013/07/15-15:52:15.546 DBG   [SM] Joiner: Initialized with SPN = node002, Package = Kerberos, RequiredCtxAttrib = 67586, HandShakeTimeout = 30000

000019a0.00001bfc::2013/07/15-15:52:15.546 DBG   [SM] Handling auth handshake posted by thread id 6568

000019a0.00001bfc::2013/07/15-15:52:15.546 DBG   [SM] Joiner: ISC returned status = 590610 output Blob size 1577

000019a0.00001bfc::2013/07/15-15:52:15.671 DBG   [SM] Joiner: Received SSPI blob from the Sponsor of size 155

000019a0.00001bfc::2013/07/15-15:52:15.671 DBG   [SM] Joiner: ISC returned status = 0 output Blob size 0

000019a0.000019a8::2013/07/15-15:52:15.671 INFO  [SV] Security Handshake successful while obtaining SecurityContext for NetFT driver

000019a0.000019a8::2013/07/15-15:52:15.671 DBG   [SV] Incoming (second) connection from node002 is secure

000019a0.000019a8::2013/07/15-15:52:15.671 INFO  [ReM] Got stream info from fe80::7964:c5c5:3b5:7833%12:~37758~ to fe80::71f7:22a3:89fb:ab11%12:~3343~.

000019a0.000019a8::2013/07/15-15:52:15.671 DBG   [ReM] Exchanging local info.

000019a0.000019a8::2013/07/15-15:52:15.671 DBG   [ReM] Sending local info.

000019a0.000019a8::2013/07/15-15:52:15.671 DBG   [ReM] Local info sent, receiving remote info.

000019a0.000019a8::2013/07/15-15:52:15.812 DBG   [ReM] Remote info received from 2:node002.

000019a0.000019a8::2013/07/15-15:52:15.812 DBG   [ReM][Leader] I did not initiate connection, getting epoch from stream NodeObject.

000019a0.000019a8::2013/07/15-15:52:15.812 DBG   [NODE] Node 1: To n2 getting epoch (currently 0)

000019a0.000019a8::2013/07/15-15:52:15.812 DBG   [ReM][Leader] I am the leader, my epoch = 0, sn = 0

000019a0.000019a8::2013/07/15-15:52:15.953 DBG   [ReM][Leader] The follower's epoch = 0, SN = 0, Fault Tolerant Session ID = 00000000-0000-0000-0000-000000000000

000019a0.000019a8::2013/07/15-15:52:15.953 DBG   [ReM][Leader] My node did not initiate the connection.

000019a0.000019a8::2013/07/15-15:52:15.953 INFO  [ReM][Leader] Allowing new connection through to n2 (initiatorEpoch <0>, receiverEpoch <0>).

000019a0.000019a8::2013/07/15-15:52:15.953 INFO  [ReM] Sending connection down normal path.

000019a0.000019a8::2013/07/15-15:52:15.953 INFO  [JPM] Received a new stream from node002



Viewing all 3614 articles
Browse latest View live


Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>