Category Archives: Windows Server

ThinIO facts and figures, Part 3: RDS and Ram caching.

logoWelcome back to the third instalment of this blog series focusing on our new technology ThinIO!

To recap, below you will find the previous articles:

Off topic note:

two years ago at an E2EVC event, the concept behind ThinIO was born with just a mad scientist idea amongst peers.

If you are lucky enough to be attending E2EVC this weekend, David and I will be there presenting ThinIO and maybe, just maybe there will be an announcement. Our session is on Saturday at 15:30 so pop by, you won’t be disappointed.

Back on topic:

So here’s a really interesting blog post. Remote Desktop Services (XenApp / XenDesktop hosted shared) or whatever you like to call it. RDS really presents a fun caching platform for us, as it allows us to deal with a much higher IO volume and achieve deeper savings.

We’ve really tested the heck out of this platform for how we perform on Microsoft RDS, Horizon View RDS integration and Citrix XenSplitPersonality with Machine Creation Services.

The figures we are sharing today are based on the following configuration and load test:

  • Logo_Login_VSI_TransparentCitrix XenDesktop 7.6
  • Windows Server 2012 r2
  • Citrix User Profile Manager.
  • 16gb of Ram.
  • 4 vCpu.
  • LoginVSI 4.1 medium workload 1 hour test.
  • 10 users.
  • VMFS 5 volume.

Fun figures!

Diving straight in, lets start by looking at the volume of savings across three cache types.

image001

 

Continue reading

The curious case of the major XenApp outage.

000_Ica_h32bit_256

Here’s a really strange and interesting issue I faced late last week that resulted in a few head scratching moments and late nights.

An Issue began two weeks ago intermittently in a XenApp 4.5 farm used for hosted desktops, intermittently NonPagedPool bytes would shoot through the roof, the event logs would become inundated with event 333 errors and the servers would lock up completely.

The running sessions could no longer open new applications, performance was extremely poor and RDP’ing to the XenApp server would result in an RPC error. Disconnecting the sessions remotely would also result in an RPC error or TSAdmin was completely incapable of connecting to the server. We had no choice but to dump the servers using NMI and pray for a crash dump.

No changes had been made to the environment in a number of weeks and the last change to the environment was a “Citrix Ready” printer driver from Lexmark. As the days progressed the issue got worse and worse with more servers going offline day by day. Although we did initially catch a number of crash dumps, we hit a bad run of luck with most of them being corrupt on restart.

By day six, 9 servers went offline throughout the day, I was pulled in to assist resolve this massive issue.

 

 

I fired up the windows debugging tools and managed to get a look at a crash dump fresh from a locked up server.

Using !vm i pulled the virtual memory statistics at the point of the crash:


!vm


So we had a serious non paged pool leak as we suspected, but what exactly was chewing up all that nonpaged?

Running !poolused 2, i was able to drill down into the drivers using nonpagedpool and view which driver tag was using the largest amount of the pool as below:


poolused


reviewing the list, i was immediately alarmed by the amount of ram in use by the “Ica” pool tag. Having reviewed 100’s of memory dumps I had never seen the Ica pool tag listed in the top 20, never mind using 99721328 bytes (~95mb).

The Ica pool tag is fairly obvious as to who owns it, but just to be on the safe side and to drill down to the owning driver, I ran the following command on the drivers folder to find references to the “Ica” pool tag.

findstr /m /l Ica *.sys

findstr


So we got quite a few hits off the Ica pool tag. Quite a number of the above drivers are Microsoft’s, which is not suprising in the grand scheme of things as we all know the origination of the RDP protocol.

So with a little more information to hand, I set about googling this chain of events to see if it’s either documented, or hotfixed. A search yielded quite alot of articles including a private hotfix and a Rollup pack.

Drilling down into the technotes to see if I could find a potential cause for this issue, I was left a little wanting with the lack of information available:

Servers with Hotfix Rollup Pack 6 installed can run out of non-paged pool memory in the ICA Pool tag. The issue occurs when users log on to the server and its frequency increases with the amount of time since the most recent restart and the number of logons since the restart.

What irked me here, was the lack of information and the fact that these servers had been running HFRP 6 for roughly 18 months with no issues similar to this.

Why all of a sudden are we losing servers all over the place to such an issue?



I dug further into the hotfix notes with help from my good friend and all round cool Citrite James Denne, the hotfix specifically noted:

When a server is in low memory condition the <Redacted>() spins in an infinite loop by constantly allocating an OUTBUF and trying to write it on the stack. This problem happen when system is going in and out in low memory condition.

So there’s a great explanation of the issue from the horses mouth, but again there was a niggling problem in the back of my head…

These servers weren’t spinning in and out of low memory, our pool usage reporting would have caught this?



I was satisfied to see a hotfix was available, but in the back of my head I was concerned about the change that may have caused this issue, it’s still unclear what is causing this low memory condition to spin the infinite loop and why we couldn’t see the low memory scenario before it happens. Being an massive issue, we had to make a quick turn around here. We had a choice of going to HFRP 7 or using the private hotfix available. I chose the private hotfix, for two reasons:

  • Mass Deploying a roll up pack to fix one problem is like tapping a nail in with a sledge hammer.
  • My experience with HotFix Rollup Packs is they fix your issues, but introduce at least one new one.

We took all the servers offline for emergency maintenance that night and cautiously waited for the morning to come and see if our issue was resolved.

and so we patiently waited…



Once hotfixed and rebooted, we arrived at the office early to watch as the user sessions began to filter in to the Farm. All was quiet for the first hour or so, but then the phones started.

once the user load hit 15-16 users per XenApp session, a number of servers began to again log a number of eventlog 333’s as below:


333.

Dammit.



Frantically we connected to the console of a server, to check the paged pool states but again no alerts on pagepool size? as below the ICA pool tag was nowhere to be seen:


healthypoolmon


And the ica tag was at a much more respectable / expected value as below:


icatag


Our next clue came in the form of the following, when users were logging in they were getting the following error:


proferror

So we’ve fixed our Ica memory leak, now what the hell is happening?



If memory usage for the pools are ok but we’re still getting errors about flushing to the registry, and now new user profiles can’t load their profiles, my hunch was there had to be something wrong with the registry hives…

I used command prompt to open the “Documents and Settings” folder and ran the following command:

dir /s /a ntuser.dat

With a quick glance, i found the following:


ntuser


The “Citrix Print Manager Service” user account had a registry hive of over 50mb? What in the name of superman is hiding in that registry hive?

To rectify the issue immediately, we stopped the above print manager service and forced the hive to be unloaded with delprof. Once we had done this, the user profiles began to load again on each affected server. But we’re now unable to use client redirected printing.

To regedit!



I mounted the registry of a profile that had failed to delete and drilled down to see what all the fuss was about. As this was  now firmly in the printing land, I went looking for keys to match the Lexmark driver change from a number of weeks ago.

What I found was extremely interesting, for each client redirected printer ever mapped with the Lexmark driver, we had an entry under both the PCLPlugin and PSPlugIn keys:


pclplugin


Although this was a really questionable practice from lexmark, I examined the keys under each entry for the PclPlugin key and they consisted of just a few small binary files of which were no more than a KB or two.

Upon looking at the same keys under PSPlugin, I found a new key, called GDL. This GDL key was absolutely massive and there was one for each and every time a client printer had been redirected using the Lexmark V2 driver.


gdl


I exported both both the users hive, and the psplugin key to text and the comparison is below:


comparison


The GDL key itself was over 3mb per redirected printer!?!?:


gdl

So there we have it.



The route cause was as follows:

This Lexmark driver has a weird tendency to store an unbelievable amount of crap in the registry for users.

The Citrix print manager service also suffers this faith when it maps a redirected printer.

As more and more users were testing in production (GRRRR!) / beginning to use a new printing solution on a customer site, this registry file began to grow and grow ultimately flooding the maximum registry size of 30% of the paged pool ram.

As the registry hive size was growing out of control, the Ica driver ran into a low memory situation and ultimately caused the infinite loop.

The Ica loop and nonpaged saturation was masking the printer driver bloat in the registry.

As the days went on, more and more servers began to saturate the Maximum registry size and go offline.

Corrective actions:

  • Enforce a policy to not allow native drivers, in any way, shape or form when redirecting printers where possible.
  • Obtain the latest driver from Lexmark is you have lexmark printers.
  • Give lexmark an earful for not testing their drivers.

Lessons Learned:

  • Don’t test things in production.
  • Don’t trust a vendor with “Citrix Ready”, it’s up to them to test these things and they regularly don’t.
  • Create a monitor for registry size (perfmon > system > % Registry quota in use)
  • install the debugging tools on the XenApp 4.5 servers as this issue is going to become more prevalent. *

* This isn’t going to get any better.

As Vendors move further and further towards 64 bit architectures they can and will forget about the extremely restrictive memory sizes available in 32 bit versions of windows, 64bit windows has so much memory available for the pools they can be as sloppy as they want without much concern.

Server 2003, Windows XP and XenApp 4.5’s death bells are knelling and have been for some time.

You are going to see pagepool’s floods and other such nasties more and more in the coming months before you finally decommission your old server 2003 environment. My advice to you is to:

  • get very comfortable with the following tools:
    • PoolMon.
    • Process explorer.
    • Windows debugging tools.
  • Have a good read of the following article: 333, read it, read it again.
  • Never be afraid to have a look at a dump file yourself.
  • Throw an issue at every vendor possible during troubleshooting, it’s in their interest to prove it’s not their software at fault.
  • Understand your pagepool sizes and limitations.
  • Never trust a printer driver.
  • Never, ever, ever trust a Vendor to behave accordingly or follow the Citrix Ready standards.
  • If you absolutely, positively need to run something in server 2003 or XP, consider using XenDesktop hosted apps to isolate the problem to a singular kernel away from the bulk of your task workers.

Viewing open files on a file server from powershell.

/wp-content/uploads/2011/03/windows_powershell_icon.png?w=58&h=58&h=58So this is a situation you should all be aware of in an SBC / VDI environment, despite all warnings, you’ve redirected folders to your network drive and your file servers are screaming in agony?

Having been in this situation recently, I needed to audit and report on the types of files open on the file server, my hunch was a certain select number of users were running applications (like *gulp* lotus notes) from the network share.

Disappointed with the powershell scripts on the interwebs, I decided to write my own function to perform this task:

[sourcecode language=”powershell”]
function get-openfiles{
param(
$computername=@($env:computername),
$verbose=$false)
$collection = @()
foreach ($computer in $computername){
$netfile = [ADSI]"WinNT://$computer/LanmanServer"

$netfile.Invoke("Resources") | foreach {
try{
$collection += New-Object PsObject -Property @{
Id = $_.GetType().InvokeMember("Name", ‘GetProperty’, $null, $_, $null)
itemPath = $_.GetType().InvokeMember("Path", ‘GetProperty’, $null, $_, $null)
UserName = $_.GetType().InvokeMember("User", ‘GetProperty’, $null, $_, $null)
LockCount = $_.GetType().InvokeMember("LockCount", ‘GetProperty’, $null, $_, $null)
Server = $computer
}
}
catch{
if ($verbose){write-warning $error[0]}
}
}
}
Return $collection
}
[/sourcecode]

The function above (get-openfiles) has been written to accept an array of servers to the command line and it will return the following items:

  • The ID of the open file.
  • The server it’s open from.
  • The username who has the file open.
  • The amount of locks the file has.

A couple of quick examples for using this command are below:


Retrieving open files from server1:


full

[sourcecode language=”powershell”]get-openfiles -computername server1 | select server,itempath,lockcount[/sourcecode]



Retrieve a count of open files that end with the nsf file type (Lotus Notes):


count

[sourcecode language=”powershell”](get-open files -computername server1,server2 | ? {$_.itempath -like "*.nsf*"}).count()[/sourcecode]



Retrieve a report of total open files on a number of file servers:


report

 

[sourcecode language=”powershell”]get-openfiles -computername server1,server2,server3,server4,server5 | group -property server[/sourcecode]

 

Monitoring Storage disk queue’s and IO with PowerShell

/wp-content/uploads/2011/03/windows_powershell_icon.png?w=58&h=58&h=58Here’s one that used to bother me alot. The problem usually went as follows:

“Your XenApp servers have very high disk queue’s and IO”

“What’s causing it?”

“dunno…”

With Server 2008, the task manager’s resource monitor feature will help you find these items. But in server 2003 this was a perilous task. The specific details for disk io per process are stored in performance monitor under each specific process running. Trying to analyse each process was a massive pain, but powershell can do some very clever work to help alleviate this!

I wrote two quick functions which act similar to “top” in linux for giving an on screen view, updating at interval of what exactly is creating IO activity. These two functions are:

get-IODataBytes:

storageio

Get-IODataOperations

storageioops

The code for these functions are below:

[sourcecode language=”powershell”]
function get-iodatabytes{
$result=(get-counter -counter "Process(*)IO Data Bytes/sec" -ea 0).countersamples | ? {$_.cookedvalue -gt 0} | select instancename,@{Name="SessionID";Expression={if ($_.path.contains("#")){($_.path.split("#)"))[1]}else{"0"}}},@{Name="IO Data Bytes/sec";Expression={[math]::Round($_.cookedvalue,0)}},@{Name="IO Data KBytes/sec";Expression={[math]::Round($_.cookedvalue / 1024,0)}} | sort -Descending "IO Data Bytes/sec" | ft
$currentqueue=(((get-counter -counter "PhysicalDisk(0 C:)Current Disk Queue Length" -ea 0).countersamples) | select cookedvalue).cookedvalue
clear
write-warning "Hit [CTRL] + [C] to exit live capture"
write-host "Current Disk queue: $currentqueue"
return $Result
}

FUnction get-IODataOperations {
$result=(get-counter -counter "Process(*)IO Data Operations/sec" -ea 0).countersamples | ? {$_.cookedvalue -gt 0} | select instancename,@{Name="SessionID";Expression={if ($_.path.contains("#")){($_.path.split("#)"))[1]}else{"0"}}},@{Name="IO Data Operations/sec";Expression={[math]::Round($_.cookedvalue,0)}} | sort -Descending "IO Data Operations/sec" | ft
$currentqueue=(((get-counter -counter "PhysicalDisk(0 C:)Current Disk Queue Length" -ea 0).countersamples) | select cookedvalue).cookedvalue
clear
write-warning "Hit [CTRL] + [C] to exit live capture"
write-host "Current Disk queue: $currentqueue"
return $Result
}

[/sourcecode]

if you wish to loop one of these functions, simply use the following code:

[sourcecode language=”powershell”]
while ($true){
get-iodataoperations
start-sleep 1
}
[/sourcecode]

On IOPS, shared storage and a fresh idea. (Part 3) tying it all together in the stack

Note: This is part three, have a read of part one and two.

Hello there, and thank you for dropping back for part 3…

I suppose I should start with the disappointing news that I have yet to test this option for VDI in a box. And despite Aaron Parker’s suggestions it wasn’t due to lack of inspiration, it was down to lack of time! This series has gathered allot of interest from both community and storage vendors alike, and I feel I should set the record straight before I got any further:

  1. This isn’t a production idea, you would be crazy to use this idea in a live environment.
  2. Throughout this entire project, we’re focusing on pooled stateless. Stateful desktops would be a separate post entirely.
  3. This wasn’t an attack on products in this market space, merely a fresh view on an old problem.
  4. If i had the skills or funds necessary to run this project to a production solution, I wouldn’t have posted it. I would already be hard at work creating a reasonably priced product!

Now that my declarations are out of the way, I’d first like to talk about the moral of the story. This isn’t an unfamiliar expression:

IOPS mitigation is not about read IOPS it’s about WRITE IOPS!

VMware, Citrix and Microsoft have similar but very different solutions for read IOPS negotiation. Similar in the sense that they try to negate storage read IOPS. But the key difference with XenServer is the local disk cache via Intellicache has the out of box functionality to cache the majority of read’s to local disk (think SSD*) without the baked in soft limit of 512 MB for Microsoft HyperV and VMware respectively.

Long story short, VMware and Microsoft’s solution is about 512mb of recognizable read IOPS negation un-tuned, but enabled. Of course this value can be tuned upwards, but the low entry point of cache would suggest, at least to me, that tuning up will have an upsetting affect on the host.

This to me is why IntelliCache has the upperhand in the (value add product) VDI space for read IOPS negation and they even throw in the Hypervisor as part of your XenDesktop licensing, so win win, but what about those pesky write IOPS?

Continue reading