Windows 2016 GPO settings

September 16, 2019, 1:39 pm

≫ Next: SMB Signing breaks CSV access cross-node

≪ Previous: Storage Space Direct - Storage Jobs taking too long

I've been searching and just can't come up with much. Is there a site with what group policies need to be set? I'm by far an expert in knowing all these settings. I use the IRS SCSEM for Windows Server 2016 and it blocks failover clustering. I was hoping to get pointed in the right direction to fix this problem. Thx

↧

SMB Signing breaks CSV access cross-node

September 16, 2019, 7:29 pm

≫ Next: Adding node to cluster that is on different vlan

≪ Previous: Windows 2016 GPO settings

Hey all, couldn't find an article that answers my problem, so starting my own :).
Hopefully I put in enough detail.

Server 2012 R2 Hyper-V Failover Cluster environment.
2 nodes. 1 SAN via SAS.
Disks added as CSV. Hyper-V config and vhds on CSVs.
Each node has 12 NICs.
NIC 1 - Mgmt - Gateway IP, DNS IP - 192.168.0.X/24
NIC 2 - Live Migration - IP only, no Gateway, no DNS - 10.20.30.X/24
NIC 3 to 10 - Windows Teamed Interface - LACP on Switch, added as Virtual Switch, External network, does not share mgmt
NIC 12 - DMZ - added as Virtual Switch, External network, does not share mgmt

Everything is fine. Cluster works, live migration works.

Recently we're going through a security exercise, operating Tenable.io, and remediating results found.
One of them is SMB Signing. I have been enabing the Group Policy "Microsoft network server: Digitally sign communications (always)" across various servers, testing along the way.

Until I apply this to my nodes. My CSVs don't appear to like it. After a few days, when trying to access a CSV in C:\ClusterStorage that is owned by another node, I can't see the Space used, and when trying to access it, I get "you have been denied permission to access this folder".
Removing "Microsoft network server: Digitally sign communications (always)" on both instantly restores this communication.

After googling around, I have been witnessing a few Event Log errors in SMBClient, Event 30803 and 31010, but I'm not yet sure if it's related. I am still trying to monitor it without the policy change. This is an example:

[Event ID 30803]

The network connection failed.

Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.

Server name: fe80::e0a9:e45:5b2b:f594%25
Server address: 10.20.30.2:445
Connection type: Wsk

Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.

[Event ID 31010]

The SMB client failed to connect to the share.

Error: {Access Denied}
A process has requested access to an object, but has not been granted those access rights.

Path: \fe80::e0a9:e45:5b2b:f594%25\454b7f2d-4e6c-4332-ae29-5e4befc5ce5b-135266304$

So what am I missing? Is it something to do with SMB Signing trying to verify an identity, and CSVs are using SMB across the Live Migration network, 10.20.30.2, but these errors are showing IPv6 address as a server name?

↧

Adding node to cluster that is on different vlan

September 17, 2019, 6:49 am

≫ Next: Microsoft Network Load Balancing not working as expected

≪ Previous: SMB Signing breaks CSV access cross-node

I am trying to add a node to my cluster that is located in a different vlan. I have created some firewall rules to allow communication, they are: UDP 3343, 137, random port between 1024-65535 and Random port between 49152-65535 AND TCP 3343, 135. Am I missing any ports, as I am still unable to add the node to the cluster with message saying, "The node cannot be contacted. Ensure that the node is powered on and is connected to the network." I can confirm the server is up and running and connected to the network.

↧

Microsoft Network Load Balancing not working as expected

September 18, 2019, 2:45 am

≫ Next: Failover Cluster Manager bug on Server 2019 after .NET 4.8 installed - unable to type more than two characters in to the IP fields

≪ Previous: Adding node to cluster that is on different vlan

I wish to have a failover cluster for an IIS site in my domain.
I have configured the cluster on port 80, however only once the network of that specific node is down will the cluster detect that node is down.
If I stop the site through IIS manager that node is still considered healthy.
What am I doing wrong? Is this what do the product supposed to do? If not what other product can help me?

↧

Failover Cluster Manager bug on Server 2019 after .NET 4.8 installed - unable to type more than two characters in to the IP fields

September 20, 2019, 8:46 am

≫ Next: VMs located on one of CSV volumes stopped migrating on one of cluster nodes

≪ Previous: Microsoft Network Load Balancing not working as expected

We ran into a nasty bug on Windows Server 2019 and I can't find any KB articles on it. It's really easy to replicate.

1. Install Windows Server 2019 Standard with Desktop Experience from an ISO.

2. Install Failover Cluster Services.

3. Create new cluster, on the 4th screen, add the current server name. This is what it shows:

cluster services working correctly before .NET 4.8 is installed

4. Install .NET 4.8 from an offline installer. (KB4486153) and reboot.

5. After the reboot, go back to the same screen of the same Create Cluster Wizard and now it looks different:

cluster services broken afte.NET 4.8 is installed - unable to put in a 3-digit IP

Now we are unable to type in a 3 digit IP in any of the octet fields. It accepts a maximum of two characters.

Has anyone else encountered this? It should be really easy to reproduce.

↧

VMs located on one of CSV volumes stopped migrating on one of cluster nodes

September 24, 2019, 7:53 am

≫ Next: How to expand vhdx disk VM on failover cluster

≪ Previous: Failover Cluster Manager bug on Server 2019 after .NET 4.8 installed - unable to type more than two characters in to the IP fields

We have a 3 node cluster Windows 2016 with many VMs on 3 CSV volumes. At one moment (I'm not sure when) VMs located on first CSV volume stoped to migrate (live and quick) to fist node (only to first node). 1st volume is still visible from 1st node. Cluster validation didn't show any problem.
In event log Microsoft-Windows-Hyper-V-VMMS/Admin on 1st node:
EventID:16300
Cannot load a virtual machine configuration: The system cannot find file specified. (0x80070002) (Virtual machine ID ....)
EventID:21002
'VM name' Failed to create Planned Virtual Machine at migration destination:The system cannot find file specified. (0x80070002) (Virtual machine ID ....)

Any ideas how to fix this problem?

I would appreciate any help.

Thanks.

↧

How to expand vhdx disk VM on failover cluster

September 24, 2019, 7:56 am

≫ Next: Migrate Sql 2008 Cluster

≪ Previous: VMs located on one of CSV volumes stopped migrating on one of cluster nodes

Hi,

I want to know the right way in order to expand a VHDX disk of a VM running on a failover cluster of 2 nodes. The nodes and the guest OS are running Windows Server 2012 R2.

I know that is posible to expand it online (with the VM running) but when I open the VM settings configuration page from Hyper-V manager, it says "some settings cannot bemodified because the virtual machine wasrunning".

Thanks in advance.

Cristian L Ruiz

↧

Migrate Sql 2008 Cluster

September 26, 2019, 12:46 pm

≫ Next: Different amounts of RAM in Hyper-V Hosts

≪ Previous: How to expand vhdx disk VM on failover cluster

Hi Folks

I am kind of stuck with the below,

1) Procedure to migrate 2008 SQL Cluster VM (connected to Dell Equallogic 4120E ISCSI 1TB LUN) to SQL 2017 Hyper-v Cluster.

2) Procedure to migrate 2008 VM cluster to 2019 Hyper-v Cluster will be use Microsoft migration tool to migrate to 2012 and then from there perform online migration to 2019?

Appreciate if some one can help on this.

↧

Different amounts of RAM in Hyper-V Hosts

October 2, 2019, 1:22 am

≫ Next: Cluster Aware Update

≪ Previous: Migrate Sql 2008 Cluster

Hi there,

I have a client with a Windows Server 2016 Hyper-V failover cluster consisting of 2 DL 380's with 1024 GB RAM on each.

The client is running out of CPU resources and is considering buying a new server (another DL 380) to join to the cluster. Is it necessary to have the same amount of RAM (1024 GB) on the new host or can we install less RAM?

Will installing less RAM on the additional cause the cluster validation wizard to fail and will this configuration be supported by Microsoft? I cannot seem to find any official guidance.

Thanks.

↧

Cluster Aware Update

October 2, 2019, 5:29 am

≫ Next: if resources fails, attempt restart on current node

≪ Previous: Different amounts of RAM in Hyper-V Hosts

Hi,

I have Windows Server 2012 R2 Cluster having 3 nodes and 15 to 15 VMs over Hyper-V Cluster, Normally for Windows update we use local WSUS. Firstly we download update on each cluster machine, install updates and reboot if required and repeat same procedure step by step for each cluster node.

Can i use CLUSTER AWARE UPDATE mechanism to update my Cluster Nodes, please note that i install security updates, update roll-ups and etc.

Please comments

↧

if resources fails, attempt restart on current node

October 3, 2019, 5:38 am

≫ Next: WSSD vs. Azure Stack HCI certification

≪ Previous: Cluster Aware Update

Period For Restarts

Maximum Restarts in the specified period

I am struggling to find anything that explains what this functionality means.

If I set the maximum restarts to 3, then does the cluster try to start the affected service 3 times before failing over? Do these 3 restarts happen immediately after each other, or is there some wait time built in?

How does the Period for restarts impact on the activities?

↧

WSSD vs. Azure Stack HCI certification

October 4, 2019, 11:28 am

≫ Next: Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

≪ Previous: if resources fails, attempt restart on current node

A team member and I are having a debate. We want to know if it is "safe" to use the very recently released Lenovo SR635 or SR655 EPYC based servers, for building our own Win2019 Storage Spaces Direct cluster (all cluster components will be Windows certified).

The servers are listed in the Windows Server Catalog as Win2019 with Software-Defined Data Center (SDDC) Premium certification (SR635, SR655).

They are not listed in in the Azure Stack HCI Catalog.

He firmly believes that the systems needs to be in Azure Stack HCI catalog, in order to proceed

based on this PDF from Lenovo Certified Config for Microsoft S2D

I believe that we can use the servers

The S2D Hardware Requirements page, used to state of that only Software-Defined Data Center (SDDC) certification is required(this changed in August ;-[).
I look at the Lenovo doc as a list of configs that Lenovo will support (FYI, these servers were they released after the PDF was published)
The PDF is not a list of systems that can used for S2D, if we are the one supporting the cluster/solution.

So, which of us I "right"?

Regardless of "right", would you proceed anyways?

↧

Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

October 5, 2019, 10:16 am

≫ Next: NLB - only one hosts gets hit

≪ Previous: WSSD vs. Azure Stack HCI certification

Hello,

I have an odd one. While a node is live, without draining or removing from the cluster we do the following:

1. Reboot it

2. Upon coming back up, sign in

3. Within a minute itll bluescreen

4. Boot back up, sign in, everything is fine

The dump shows ntoskrnl.exe DRIVER_IRQL_LESS_THAN_OR_EQUAL 0x000000D1

If you check the cluster operational, youll see it start some GUM Process with GrantLock, Process Request lock. This happens over and over until it bluescreens. Subsequent reboot from bluescreen shows GUM but it only does the "processing locally". Events below:

Preceeding Bluescreen(these repeated over and over and were even suppressed per application log):

[GUM] Node 2: Processing RequestLock 4:595
[GUM] Node 2: Processing GrantLock to 4 (sent by 5 gumid: 20121)

Post Bluescreen (note these still showed pre-bluescreen above but rarely):

[GUM]Node 2: Executing locally gumId: 20121, updates: 1, first action: /dm/update

Before the bluescreen in the event viewer the following happens with the NIC. Keep in mind this NIC is apart of a team. 2 of the 4 team members are down (waiting to be plugged in if the others die) and 2 are live. This team is handled by the OS in Server Manager. We are using Intel drivers not system drivers. Latest.

Reboot - 9:51
Kernel Power Hardware Notifications upon boot up
Connectivity state in standby: Disconnected, Reason: NIC compliance - 9:54

both adapters come online - 9:54
Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.

and

Intel® Ethernet 10G 2P X520 Adapter #2
Network link has been established at 10Gbps full duplex.

===============================

NIC report disconnected

Intel® Ethernet 10G 2P X520 Adapter
Network link is disconnected.

Intel® Ethernet 10G 4P X520/I350 rNDC #2
Network link is disconnected.

MsLbfoSys

Member Nic {30793b81-07bd-4afe-85f6-6dd873581384} Connected.

NIC Disconnects again

Intel® Ethernet 10G 4P X520/I350 rNDC
Network link is disconnected.

NICs reconnect

Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.

MsLbfoSys

Member Nic {7947a925-563e-4bf8-b3c6-73c46ef2d4ed} Connected.

DNS Resolution and Domain Resolution fail - 9:55

lphplsvc reports that network is coming up - 9:55

At this point you can sign into the server and shortly there after itll bluescreen. I have not yet tested it but I believe it will also bluescreen without signing in(as was reported to me), im just relaying the recent event. This doesnt happen everytime but is a 50/50. Ill test in my lab this coming week to reproduce. Anything additional I should capture? As a note, this is reproducable across 10 similar physical servers, with a 2 cluster split of 5 each.

I see a hotfix for this issue 0x0D1 for server 2012 but this 2016. I have a feeling that the Network coming up causes Windows or the Cluster to grab the address space for the driver and then the opposite one tries for it upon network recovery above but it fails to release the address space. I am assuming the cluster is snagging it then windows is trying after, thus ntokrnl.exe being at fault.

Any input would be great, this is an odd one and im hoping to track it down. I understand that delaying the startup of SQL services might be a suggestion but I mixed reviews on doing that and being that it seems like cluster activity not so much SQL, im wondering if that is even an option here.

↧

NLB - only one hosts gets hit

October 8, 2019, 3:07 am

≫ Next: SCOM2016 monitoring S2D General File Server

≪ Previous: Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

Hello,

I'm struggling very hard to set up a VPN solution with NLB enabled using 2x Server 2k16.

For some reason, after I create the cluster, the VPN Client can only connect to one of the servers. If I stop it, the connection is not possible. The cluster is in Multicast mode and I have confirmed with the network team that on the perimeter firewall the MAC address of the cluster is in the ARP table with the correct cluster IP.

I've already tried stripping down and re-creating the cluster but always get the same results.

Does anyone have any ideas what I need to check?

Kind regards,

Wojciech

↧

SCOM2016 monitoring S2D General File Server

October 10, 2019, 8:42 am

≫ Next: Storage Spaces Direct on Server 2016 : SMB connections impossible from node

≪ Previous: NLB - only one hosts gets hit

We are running a SOFS and a GFS on a S2D cluster. We have SCOM2016 running with Storage Spaces Direct Management Pack and SCOM sees the SOFS shares with a lot of great information. I am not seeing the GFS in SCOM. Any ideas?

↧

Storage Spaces Direct on Server 2016 : SMB connections impossible from node

October 12, 2019, 12:51 pm

≫ Next: When we fail-over from Node1 to Node2. fail-over cluster manager shows Node2 as down. but SQL cluster services are running fine

≪ Previous: SCOM2016 monitoring S2D General File Server

We've got a 4-node Storage Space Direct (S2D) cluster, working for more than 1.5 year without any issue. The OS is Windows Server 2016.

Firewall down for all profiles
No antivirus installed, Windows Defender OFF
Active Directory delegations untouched
No change in the network infrastructure has been reported
RDMA was disabled 1 year ago, as we found out the NIC didn't fully support it

Two days ago, we noticed a lot or error messages in the cluster event log, and the backup jobs of all Hyper-V VM hosted on the cluster failed (made via VEEAM).

Investigation quickly showed there is are many issue with the SMB connections.

Any of the 4 hosts :

can ping other resources in the network
can't connect any shared folders
NTP sync fails (net time \\server fails, so is w32tm /monitor)

Obviously, the File Share Witness fails as well, and some issue with Domain services to be reported...

We tried to reboot the nodes separately, and after a reboot the SMB connections are just fine... for a few minutes/hours, and then the issue arise again.

The impact on the cluster, along with the File Share Witness beeing offline, is we can't easily perform a Live Migration of the VMs between the nodes (succeeds randomly). A Quick Migration happens like a charm, though. As SMB connections are not possible, we can't move the VM to another cluster or standalone host.

We fear the cluster will go haywire if a node fails uncontrollably. Even though the VM are stable, we still can't perform a backup (we could perform an export).

Have any of you heard about that issue with S2D or the Microsoft Failover cluster role ? It might also be unrelated to the cluster itself...

What can be done to find the root cause of this issue ? Any idea to help me troubleshoot this ?

Here are samples of the logs found in the cluster role, and in the event logs for SMBClient :

From the Cluster console:

Cluster network name resource 'Cluster Name' encountered an error enabling the network name on this node. The reason for the failure was: 'Unable to obtain a logon token'.
The error code was '1311'.
You may take the network name resource offline and online again to retry.

Event with ID 30803 :

Failed to establish a network connection.
Error: {Device Timeout} The specified I/O operation on %hs was not completed before the time-out period expired.
Server name: server.domain.com
Server address: x.x.x.x:445 Connection type: Wsk
Guidance: This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA adapter, can also cause this issue.

Another one, ID 30804 :

A network connection was disconnected.
Server name: \server.domain.com Server address: x.x.x.x:445 Connection type: Wsk
Guidance: This indicates that the client's connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.

↧

When we fail-over from Node1 to Node2. fail-over cluster manager shows Node2 as down. but SQL cluster services are running fine

October 13, 2019, 10:31 pm

≫ Next: Precarious Situtaion Hyper-V Guest O/S

≪ Previous: Storage Spaces Direct on Server 2016 : SMB connections impossible from node

When we fail-over from Node1 to Node2. fail-over cluster manager shows Node2 as down.

but SQL cluster services are running fine. It is happening on windows server 2012 R2 operating system.

Once I restart the cluster services in services.msc. It will up and join in the cluster.

Please check and help on this

↧

Precarious Situtaion Hyper-V Guest O/S

October 15, 2019, 11:47 am

≫ Next: Can we configure Windows Failover Clustering without Central Storage(iSCSI)

≪ Previous: When we fail-over from Node1 to Node2. fail-over cluster manager shows Node2 as down. but SQL cluster services are running fine

I have 22 Virtual Machines each have their own LUN and are clustered amongst 4 nodes

Recently there was a power failure and we had to reboot the hosts

Now the VMs couldn't start because the DRIVE LETTER was not assigned and the VM was sitting on the DRIVE which went down

I had to manually go and assign a drive letter from diskmgmt.msc to make it all work

Figuring it which drive letter was assigned to which VM was a challenge since the config was a UNICODE file

How can I prevent this situation in future????

Thanks

↧

Can we configure Windows Failover Clustering without Central Storage(iSCSI)

October 16, 2019, 3:50 am

≫ Next: S2D quorum question

≪ Previous: Precarious Situtaion Hyper-V Guest O/S

Hi,

I am completely new to Windows Server Failover Clustering subject, i have few doubts please clarify me.

I setup a Virtual lab with a DC+iSCSI installed and 2 File Server Nodes, configured Failover Clustering at both the nodes and verified the working.

When i shutdown the 1st Node dynamically the 2nd Node takes place, here my question is cant we create a Failover Cluster for the Existing File Server share folder??? instead have to configure SAN(iSCSI) for storage here both the nodes HDD space goes unused when all the data gets save in that SAN storage.

The same for SQL Failover too, when we have SQL DB server running in a server when i install Failover Clustering again on the same server for data storage have to select SAN, here the failover clustering only helps in keeping the Server Roles up and running, what will happen when Storage SAN device goes down??

I can understand SAN(iSCSI) is required such that both the Nodes get access to the storage device only then when one goes down 2nd one act behalf of that and takes the request to SAN server.

Any help please.

Mohammed...

↧

S2D quorum question

October 16, 2019, 3:58 am

≫ Next: Windows Server 2016 and High Availability Remote Desktop Services with thin clients

≪ Previous: Can we configure Windows Failover Clustering without Central Storage(iSCSI)

Hi All

I´d truly appreciate your help to clarify the quorum operative in an S2D environment. I´ve read a bunch about Cluster and pool quorum, I know how these two quorum works, but I dont clearly see which one applies for Storage spaces Direct.

Quorum scheme should apply , so every disk should have a vote, and the owner should have another.

But I see a failover cluster supporing this S2d cluster, and this leevrages a File Witness (for example)

Which one takes effect?

Best Regards

↧