T-SQL Tuesday is back around again and this time it’s the first one of 2018!
The topic Arun has chosen is:
Write about and share with the world a time when you faced a technical challenge that you overcame and you can go really technical with both the issue and solution if you like.
In the years I faced lots of technical challenges and with me lots of other DBAs have. But one that still sticks with me the most is the story of the failing cluster setup.
It was back in 2008/2009 and I was a consultant in a hospital for a job to setup a fail-over cluster with SQL Server 2005. If you’re wondering why SQL Server 2005, it’s because SQL Server 2008 was just released, and the medical industry is not that keen on being an early adopter if it comes to technology.
My task was to setup a fail-over cluster on a couple of HP servers. They were already prepared by the system administrators an installed with Windows Server. Seems simple enough, I had done that many times before and should give my any trouble. Oh boy was I wrong about that!
Having setup all the different parts for Windows Server, I started the installation of SQL Server 2005 for the first node. Everything went smooth and the installation was succesfull.
Now comes the installation of the second node. I start the installation of the node to add it to the cluster and it installs to about 90% and stops. It gives me a general error that something in the installation went wrong and that I should check the installation logs to find out what went wrong.
I checked the log files and didn’t find anything special that could explain why the installation failed. Let’s try this again and go through all the settings an try again. Was it that I entered the wrong credential or some other setting?
After starting the installer it again fails at the 90% mark. The installation rolls back the entire installation and I’m not sure what happened.
Let’s check the installation logs again, hopefully this time it shows me what went wrong. Unfortunately it didn’t show me anything.
I wasn’t going to give up. I tried different variations of the installation settings to see if I made some sort of mistake.
I checked the hardware for any defects, replaced memory, swapped the CPUs of the machines.
Nothing there that could point out why the setup failed.
At this point it’s a good idea to get some higher power involved and we created a ticket with Microsoft support. It even got a few levels higher and even they weren’t able to figure out why the installation failed. Their answer was; “Just use SQL Server 2008, that should work”.
Besides the fact that was not the answer I was looking for, it was not possible for me to install SQL Server 2008 because it was not yet proven technology.
So as the consultant I was kind of in a pickle. How was I going to get the job done because my reputation was on the line. I’m supposed to be the one with the knowledge and there I was struggling for over a week to get a cluster setup.
Because at this point I had to start thinking outside the box and I was looking at the hardware again. The hardware was all approved and supported hardware by Microsoft. We didn’t need any third party drivers or anything like that.
Looking at the hardware, besides the motherboard power supply and some other parts , we only able to control the CPU, memory and the hard disks. The servers were identical with the following configuration:
CPU: 4 x 6 core
Memory: 64 GB
System disk – 40 GB
Application disk – 30 GB
Data disk: 300 GB
Log disk : 100 GB
I first examined the disks. There was nothing special about them. The system and application disk were the only local disks and the data and log disk were LUNs.
The memory was nothing special either and was setup in 4 x 16GB modules.
The CPU was something I had not seen yet and what caught my attention was the 6 core specification. In total we had 24 cores and for some reason that number was odd to me.
Let me explain why the number 24 is an odd number for me. It’s because when you’re used to working with computers you are used to work with number in the power of 2. We have 1,2,4,8,16,32,64 etc but not the number 24. That’s right between 16 and 32.
It’s possible to let Windows believe there is a different amount of processors in your server. You can change this setting in the MSCONFIG utility by setting the amount of processors in the BOOT.INI tab.
At this point I was good for any new idea so I changed the value for the amount of processors to 2. The number 2 is in the power of 2 so let’s see what happens.
I restarted the server and looked into the task manager and I saw two processors. That all worked now let’s run the setup of the SQL Server node again.
I would not have believed it if this didn’t happen to me but for some reason the setup was successful. At this point it could’ve been a fluke and I wanted to be sure. I removed the node from the cluster, reset the amount of processors to the original version and ran the setup again. This time it failed as I had seen before.
This is still not good enough for me but I was getting closer to a conclusion. Let’s see what happen if we change the amount of processors to 8. I changed the amount, rebooted the server and started the setup again. Again this time it’s successful.
My conclusion was that for some reason the setup couldn’t handle an amount of processors that was not in the power of 2. In my mind this was insane because I had never encountered such a problem before.
Now that everything is installed I can again set the amount of processors back to 24 and get on installing the updates like service pack 3. I changed the amount of processors, rebooted the server and as soon as the server booted up, my cluster installation failed. It couldn’t be the amount of processors right? Again no clear error and because I knew what went wrong I changed the amount of processors back to 8 and rebooted it. This time the server and cluster booted up.
I contacted Microsoft support and told them about my findings. Their reply was that indeed something odd was fixed in SQL Server and it the fix was released in service pack 3.
Back with SQL Server 2005 you weren’t able to slipstream your service packs. You first had to install the initial release and than apply the service pack.
The next step was to install SP 3 with the 8 cores. I started the setup and of course things didn’t go as expected. The installation failed and gave no clear reason why.
Again I was stuck. I had a uncompleted cluster install because we couldn’t use the full amount of resources, I couldn’t install the right update to fix it.
I started to dig into the updates that would applied in SP 3. I noticed some changes to the system databases especially the mssqlresource database.
The mssqlresource database is a system database that’s not visible for the user and is only used internally by SQL Server.
I ran the setup again and this time I wanted to see what happened during the install with the system databases. I saw the master database being replaced by a new version, and the msdb database as well and as soon as the installation wanted to finish the update of the mssqlresource database it disappeared, the original version was placed back and the installation failed.
That’s it! That’s the reason the service pack was failing. By a little research and some gut feeling I saw what happened. But why did it happen. Was the installation not able to rename the database?
To be sure I did something rigorous and don’t do this in production ever! I went into the SQL Server settings and I updated the physical name of the mssqlresource database. I shut down the instance and I renamed the files for the database on disk.
I started the instance sucessfully. I checked the name of the database and it showed the renamed files.
I started the installation and watched the system database files being changed. First the master database, than msdb and finally the mssqlresource database. I see a new mssqlresource database being created next to the original one, the old one being removed and the installation finishes!
That was the problem for SP3, it was not able to do an in place upgrade of the mssqlresource database.
We’re almost done. The only thing I have to change back is the amount of processors and reboot the system. I changed the setting, rebooted the server and everything booted up successfully.
This one hell of a journey that gave the grey hairs I have now. This took about 2 weeks of my life what could’ve been an afternoon.
This was and experience that changed my way of thinking as a DBA and I learned a lot along the way.
The moral of the story is that there is always a solution. Go and dig into the material. Think outside the box and do not let any challenge get the best of you.