Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 3614

Partition information lost on cluster shared disk

$
0
0

Hi everyone,


we've got a cluster virtual disk where the partition table and volume name broke. Has anyone experienced a simliar problem and got some hints on how to recover?


The problem occured last friday. I restarted node3 for windows updates. During the restart node1 had a bluescreen and also restarted. The failover cluster manager tried to bring online the cluster resources but failed several times. Finally the resource-swapping came to a rest on node1 which came up early after the crash. Many virtual disks were in an unhealthy state, but the repair process managed to repair all disks so they are now in a healthy state. We aren't able to explain why node1 crashed. Since the storage pool is in dual parity mode the disks should be able to work even if there are only 2 nodes running.

One virtual disk, however, lost its partition information.


Network config:

Hardware: 2x Emulex OneConnect OCe14102-NT, 2x Intel(R) Ethernet Connection X722 for 10GBASE-T

Backbone-Network: On the "right" Emulex network card (only members in this subnet are the 4 nodes)

Client-access teaming network: emulex "left" and intel "left" cards in team; 1 untagged network and 2 tagged networks


Software Specs:

    • Windows Server 2016
    • Cluster with 4 Clusternodes
    • Failover Cluster Manager + File Server Roles running on the cluster
    • 1 Storagepool with 36 HDDs / 12 SSDs (9HDD / 3 SSD on each node
    • Virtual disks are configured to use dual parity:
Get-VirtualDisk Archiv | get-storagetier | fl
  •    FriendlyName           : Archiv_capacity
  •    MediaType              : HDD
       NumberOfColumns        : 4
       NumberOfDataCopies     : 1
       NumberOfGroups         : 1
       ParityLayout           : Non-rotated Parity
       PhysicalDiskRedundancy : 2
       ProvisioningType       : Fixed
       ResiliencySettingName  : Parity

Hardware Specs per Node:

  • 2x Intel Xeon Silver 4110
  • 9HDDs à 4 TB and 3 SSD à 1 TB
  • 32GB RAM on each node

Additional information:

The virtualdisk is currently in Healthy state:

Get-VirtualDisk -FriendlyName Archiv

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach   Size

------------ --------------------- ----------------- ------------ --------------   ----
Archiv                             OK                Healthy      True           500 GB


The storagepool is also healthy:

PS C:\Windows\system32> Get-StoragePool
FriendlyName   OperationalStatus HealthStatus IsPrimordial IsReadOnly

------------   ----------------- ------------ ------------ ----------
Primordial     OK                Healthy      True         False
Primordial     OK                Healthy      True         False
tn-sof-cluster OK                Healthy      False        False


Since the incident the event log (of current master: Node2) has various errors for this disk like:

[RES] Physical Disk <Cluster Virtual Disk (Archiv)>: VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk13\ClusterPartition2\. Error: 1005.


Before the incident we also had errors that might indicate a problem:

[API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.


Our suspicions so far:

We did registry changes to: SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001 (to 0009) and set the value PnPCapabilities to 280 (disabling the checkbox "Allow the computer to turn off this device to save power") but not all network adapters support this checkbox so this may have had some side effects)



One curiosity: after the error we noticed that one of the 2 tagged networks had the wrong subnet on two nodes. This may have caused some of the failover role switches that occured on friday, but we're unsure about the reason since they were configured correctly some time before.

We've had a similar problem in our test environment after activating jumbo frames on the network interfaces. In that case we lost more and more filesystems after moving the file server role to another server. In the end all filesystems were lost and we reinstalled the whole cluster without enabling jumbo frames.

We now suspect that maybe two different network cards in the same network team may cause this problem.

What are your ideas? What may have caused the problem and how can we prevent this from happening again?

We could endure the loss of this virtual disk since it was only archive data and we have a backup, but we'd like to be able to fix this problem.

Best regards

Tobias Kolkmann



Viewing all articles
Browse latest Browse all 3614

Trending Articles