Kapil Ramlal (KappA) Escalation Engineer Troubleshooting Tools and Methodology in a Citrix XenApp 5.0 Environment
Dec 24, 2015
Kapil Ramlal (KappA)Escalation Engineer
Troubleshooting Tools and Methodology in a Citrix XenApp 5.0 Environment
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Agenda
The right tool, right place at the right time
Troubleshooting scenarios
Top utilities
Case studies
Additional resources/Q&A
XenApp troubleshooting
Understanding the infrastructure
The anatomy of a XenApp farm
• Information: Static and Dynamic
• Components: Where to focus troubleshooting
Understanding what happens from logon to launch
• Types of issues: Denial of service, bottlenecks
• Troubleshooting: Medevac, performance monitoring, CDF…
Types of Information
• Dynamic Store
• Constantly changing information
• Load management
• Information required for application launch
Dynamic
• Data Store
• Does not change frequently
• Farm configuration
• Changes made in the Management Console
Static
LHC
DATA STORE
Logon to launch
Zone Data Collector
Data Store
Active Directory
Least Loaded Server
XML Broker
Web InterfaceClient
MedEvac (CTX107935)
• The XML Broker tests• Verifies that the XML Service is able to respond to an XML / client request
• XML is able to contact the Zone Data Collector
• Zone Data Collector tests• Verifies that the ZDC can provide the address of the least loaded server for the requested app
• The IMA Service is able to respond
• The IMA Service can read the Local Host Cache
• The IMA Service can read it’s Dynamic Store
• Least Loaded Server tests• Verifies that Terminal Service is able to respond
• Verifies that the RPC Service is able to respond
How to Monitor Farm Health using MedEvac?
• See knowledge center article CTX119899
Monitoring
Zone Data Collector
Active DirectoryXML Broker
Web InterfaceClient
IMA Work Item Queues
IMA %CPU time
Zone Elections Won
ASP Requests
XML Threads RSOP
CDF
CDF
Citrix Counter Description Threshold Server to monitor
Application Resolution Time (MS) Time to resolve LLS Determine baseline All XML Brokers
Data Store Connection Failure
Number of minutes the server
has been disconnected from
the Data Store
Determine threshold
considering scheduled
reboots and maintenance
All XennApp servers
Number of Busy XML Threads
Number of XML requests
currently being processed
(Max=16)
16 sustained for 1 min or
longerAll XML Brokers
WorkItem Queue Ready Count
Number of work items that are
ready and waiting to be
processed by IMA
Sustained above 0 for 1 min
or longer
All XML brokers
Most Preferred and
Preferred Data Collectors
Resolution WorkItem Queue
Ready Count
number of work items (related
to application launches)
waiting to be processed by
IMA
Sustained above 0 for 1 min
or longer
All XML brokers
Most Preferred and
Preferred Data Collectors
Zone Elections WonNumber of times this server
won an election
if this counter increments by
2 in a 1 hour period
Most Preferred and
Preferred Data Collectors
XenApp 5.0 Health Monitoring and Recovery
• Enterprise & Platinum Editions of XenApp• Performs tests to monitor state and identify health risks
• Terminal Services tests
• XML Service test
• Citrix IMA Service test
• Logon Monitor test
• Check DNS test
• Local Host Cache test
• XML threads test
• Citrix Print Manager Service test
• Microsoft Print Spooler test
• ICA Listener test
• See page 307 of the XenApp 5.0 Administrator’s Guide (CTX115519) for information
Large Farm Tips
• Limit additional roles on Zone Data Collectors
• Limit the number of zones in the environment
• Do not run management consoles on or pointed to the ZDCs
• Read the Key Infrastructure Tuning article: CTX116492
Free the ZDC!
The evolution continues!
• Citrix XenApp 5.0 opens the door for delivering resources on Windows Server 2008
• Clients are also adopting more Windows Vista users
• Say hello to the next generation troubleshooting artillery for the XenApp 5 environment
• Existing tools have been updated, and new tools introduced
• The evolution continues!
The right tool, right place at the right time
• DON'T• Use troubleshooting tools just because you can
• Recommend tools that are not relevant to the problem
• Use troubleshooting tools without understanding their impact of the environment
• DO• Use tools to help automate time consuming tasks
• Use tools at the right time, such as when the problem is occurring and not afterwards
• Understand what the tool is trying to accomplish, so that the right data is obtained
• Use tools with a clear purpose
• Maintain a local toolkit, so that the right tools are always available in times of crisis
CDF Tracing & CDFControl 2.5
Common Diagnostic Facility (CDF)
• Provides the ability to collect traces for problem diagnosis on Citrix binaries without disrupting the services or users
• Citrix’s standard debug tracing facility
• Efficient and non-intrusive data collection process
• Enabled without stopping and starting services
• Faster & easier tracing for retail modules
• Flexible & customizable troubleshooting facility
• Consistency across most Citrix products
CDF Basics
• To better understand what a CDF trace message is, let’s look at the following pseudo code example
• In the example, the function belongs to a service, which can be considered to be a Trace Provider (more on this later)
The moral of the story
• We could capture a CDF trace to determine if the CitrixFeatureDLL.dll loaded successfully
• How difficult it would be to debug without having this tracing?
• You need special symbol files to be able to read the trace messages (TMF files)
• This allows certain information to remain private as needed (similar to .pdb files)
• You get more by default!
CDF Internals
• To better understand CDF, let’s take a quick overview at how the Operating System supports Event Tracing (ETW)
CONTROLLER
Enable/Disable Buffers
Trace File
Events
Events
Events
CONSUMER
Events
Events
CDFCONTROL
CDM.sysRadeSvc.exe WFShell.exe
ETW Components
Providers:
• Modules containing tracing, that can be enabled or disabled
• Example: MF_Driver_Cdm (Cdm.sys)
Controllers:
• Enables/Disables a provider
• Configures trace capture settings
• Starts/Stops a trace
Consumer:
• Reads trace events from log file
• Reads trace events real-time from a trace session
CDFControl v2.5
• CDFControl is a hybrid controller and consumer
• It can start/stop/enable and configure an ETW/CDF trace session
• It can consume (read) trace events from a log file, or from a live real-time trace session
• The original version operated only as a ETW Controller, and was published under CTX111961
CDFControl 2.5 Demo
Troubleshooting Scenarios
Troubleshooting scenarios
• Application Streaming
• Seamless/Multi-Monitor
• 3rd Party Applications
• CPU Spikes
• Deadlocks/Hangs
• Database
• Network
• Black Hole Effect
• XenApp Plugin (PNA)
• Debugging
Application Streaming
1. End user launches app from WI or PN Agent2. RAD file is downloaded3. RAD file launches client Application Isolation Environment (AIE)4. RAD file instructs streaming client to download:
• Manifest file | AIE rules | Application executable | Pre and post execution scripts5. Streaming client launches executable according to instructions in manifest file and AIE rules including pre
and post execution scripts and registers with the ctxsbx.sys (redirector)6. Application is available to user7. Streaming Client requests additional files as required, checking first in the client cache, then if necessary,
downloading additional files from the file server
What happens on the client side?
Network File Servers
RAD file
Streaming Client and AIEEnd User
• manifest file• executable• AIE rules
• .dll’s• data files• other .exe’s
• .dll’s• data files• other .exe’s
• .dll’s• data files• other .exe’s
Application Streaming
• Isolate the Issue• When?
• Profiling
• Publishing
• Streaming
• How?• Streaming to Server
• Streaming to Client
• Versions?• WI 4.5, 5.0
• License server 4.5,5.0
• Client
Application Streaming
Streaming Client Troubleshooting:
• Client installation is required on workstations
• Verify the Citrix Streaming Service is started or restart
• Reference CTX116483 – required permissions
• Enable debug console• HKEY_LOCAL_MACHINE\Software\Citrix\Rade• REG_DWORD: “EnableDebugConsole”• Value: 1 to switch on, 0 to switch off
Application Streaming
Leverage realtime CDF tracing!
• Run CDFControl on the client (where client is installed)
• Choose the Application Streaming category
• Enable realtime tracing
• Provide a TMF path (CTX106233)
• Start tracing and reproduce the launch failure
Seamless/Multi-Monitor
Winlogon Default
ICA Client
winlogon.exe
TWIWorker
TWISysTrayAgent
TWIReader
icast.exe
wfshell.exeseamls20.dll
icactls.dllsehook20.dll
sehook20.dll
SEAMLESS HOST COMPONENTS
Seamless/Multi-Monitor
wfica32.exe
vdtwin30.dll vdtwn.dllctxsrcc.lib
GAI
LVB
SEAMLESS CLIENT COMPONENTS
Seamless/Multi-Monitor
Multi-Monitor
• An optional component
• Client provides a monitor layout via thinwire channel which is shared by all process loading mmhook.dll via shared memory
• Work area change is always posted to host. This could be due to change in work area of the existing area or change in virtual screen size due to addition /deletion of monitors.
• API hooks are controlled by flags and can be customized per process. Refer to CTX115637 for various configuration options
Seamless/Multi-Monitor
• Shift F2 to change to Full Screen mode
• Reconnect as fixed size window session
• Set global flags, 0x26DEA7, to see if it fixes the issue.
• This is combination of following flags (See CTX101644 for details of each bit)
• 0x1 (Disable session sharing), 0x2 (Disable modality check), 0x4 (Disable AA hook)
• Analyze CDF trace for MF_DLL_CTXNOTIF and MF_SESSION_TWI
• Analyze window information using SPY++/Window History/Message History
• Try per-window exception flags
• Analyze application logic (API flow) using TracePlus utility
Seamless/Multi-Monitor
• Get the Window class name which is exhibiting the problem
• Collect the CDF traces for concerned module ONLY
• CTXNOTIF, MMHOOK, TWCDS, TWI, TWI_HOOK
• Analyze the behavioral aspect that could be affected by hooks???
• Enable disable/ Does it happen on single monitor too? If yes, chances are very little. Disable mmhook and see what happens?
• Compare the window styles at host and client
• For seamless specific issue, verify if it happens in ICA Desktop/RDP also.
3rd Party Applications
• How does the application work?• Is it Native, or does it run on a Framework, such as .NET or Java?
• Do you have the right versions of the Framework installed?
• Are the correct dependencies present, and does it work at the console?
• Does it require certain file and registry access? (Does it need Write permissions etc. ?)
• Does it require component registration?
• Inspect core functionality• View the application/process under an analysis tool such as ProcessExplorer or WinDbg
• Inspect all loaded modules (DLLs) by the application
• Validate any dependencies (missing DLL's?)
• Inspect named events and handle usage (synchronization/resource problems?)
• Validate file and registry access using ProcessMonitor
• Run application under the AppVerifier utility to check for a multitude of issues
3rd Party Applications
• Leverage the Global Flags for user-mode applications using the Gflags utility
• Set 3rd party application to run under Image File Executions
• Configure a debugger to invoke the application (such as WinDbg)
• When the application launches, the debugger will automatically attach to the process and halt its execution!
• This gives the opportunity to explore all application threads from process initialization (~*kb)
• From here the internals of the application can be understood at the Native Windows API level (i.e. Which Windows API's are being used)
3rd Party Applications
• Use ProcessExplorer to view the loaded modules for a process, and check for the presence of any hook modules (hooking DLL's)
• Hook modules can alter the natural behavior of applications, which can sometimes cause problems
• Try excluding the problem application from all Citrix hooks (CTX107825)
CPU Spikes
• Try to define a pattern (leverage perfmon)
• Determine offending Thread ID causing the spike
(Process Explorer, QSlice)
• Obtain userdump of offending process immediately after (Userdump.exe, WinDbg.exe)
• Check CDF trace for repeated (looping) messages (if Citrix component)
• Use application spy to look at what the application is doing (TracePlus, Logger)
Deadlocks
• Windows Vista and Server 2008 offer the new Wait Chain Traversal (WCT) API!
• This offers applications a mechanism to check internally for wait conditions, and also allows for custom tools to be created which can also check for application hangs – LIVE!
• No cool WCT tools available? The debugger is your friend!
• Attach to hung process/service and generate a dump for post-mortem analysis:• .dump /ma c:\PathToDump\DeadlockedApp.dmp
• Manually inspect thread states, and get the debugger's opinion with:• !analyze -hang -v
THE WINDOWS TASK MANAGER CAN CAPTURE USER DUMPS IN
VISTA & 2008!!!
Slow logons
• Understand the logon process and Identify the slowdown!
• Validate via network trace that the connection between server to client is good
• If the connection makes it to the server, check which processes exist
• Use TaskManager and sort by session ID
• Gather userdumps for each process for the slow session to try to identify any synchronization problems, such as LPC and ALPC wait chain conditions
• Ensure Terminal Services is running (svchost.exe) and that the thread count appears normal
• Ensure critical Citrix processes are okay, such as IMA, CpSvc and XML
The XenApp client
• PNAgent.exe starts up and communicates with PNAMain.exe to share application launch, and shortcut details
• PNAMain.exe initiates communication with the Web Server for application requests and config.xml settings
• WFCRun32.exe works with WFICA32.exe to launch an application
• Best to use a live-debug approach as there is no inherent tracing readily available on the client
The XenApp client
For single sign-on problems ensure:• PNSSON is at the top of the network provider list
• SSONSVR is running
• Nothing is causing any logon delays (such as 3rd party monitoring applications etc.) as this would cause the SSON ticket to expire, therefore causing SSONSVR to exit
• Enable a default debugger to look out for any unexpected termination of the client processes
Debugging
• User Mode versus Kernel Mode
• The Windows operating system can be conceptually divided into 2 parts:• User Space (User Mode)
• Kernel Space (Kernel Mode)
• Applications run in User Mode
• System drivers run in Kernel Mode (Privileged Mode)
USER MODE
USER SPACE
KERNEL SPACE
USER APPLICATION
USER APPLICATION USER
APPLICATIONUSER
APPLICATIONUSER
APPLICATIONUSER
APPLICATION USER APPLICATION
USER APPLICATION
USER APPLICATION
keyboard.syswin32k.systcpip.sys
rusb2w2k.sys
[…]
Debugging
• Windows Vista and Server 2008 does not rely on the boot.ini for debug settings anymore
• Say hello to the BCDEDIT utility!• (http://technet.microsoft.com/en-us/library/cc721886.aspx)
• To do a live local debug, you need to first enable debugging on the server• Bdcedit /debug on (requires reboot)
Debugging
In the event of a system crash (BSOD), ensure that:
1. The Pagefile (pagefile.sys) is configured to run on the system drive (where Windows is installed)
2. The Pagefile is larger that the amount of physical RAM on the server
3. Startup and recovery options are set for a kernel or complete memory dump
4. Enough space exists to write the dump file
Debugging
• To debug application crashes, configure a default application debugger to handle fatal application errors!
• Dr.Watson is gone in Vista and Server 2008
• Manually configure a default application debugger (CTX105888)
• Use the TestDefaultDebugger tool to ensure that server is able to capture userdumps (CTX111901)
Debugger Basics
• NTSD –pn ProcessName (attaches to running process)
• ~*kb Lists all running threads
• x *!*Symbol* Searches for a symbol matching the one specified
• bp Sets a breakpoint (typically used with symbol)
• kb Dumps callstack of current thread
• !analyze –vScans for exceptions
• !analyze –hang –v Scans for wait chains
Debugger Basics (The Call Stack)Thread
# PID TID Function Parameters
First Parameter Second Parameter
Module Name
Function Name
Offset
Switch to thread 4
First Parameter off stack
Case Studies
Introducing the Citrix Symbol Server
• #1 feedback during SMART post incident reviews• Traditional data collection/upload/analysis cycle takes too long
• Live debugging while problem is occuring• Significant delays introduced when waiting on large uploads to complete
• Resources are strained during CritSits – keep focus on issue resolution
• 64-bit adoption increasing• Full system dump files will get larger
• Significantly longer upload times
Citrix Symbol Server – The Payoff – A Case Study
• A critical Citrix service is crashing on startup• Users unable to connect
• Debugger attached to process at startup• Crash caused by heap corruption
• Full page heap enabled• New stack trace points to root cause
• Case archives reveal that problem is resolved with an existing hotfix
• Time to resolve• With symbol server: less than 1 hour
• Estimated time without symbol server: more than 1 business day
Using the Citrix Symbol Server
• Products supported• Citrix Presentation Server 3.0, 4.0 and 4.5 – all languages / hotfixes
• XenApp 5.0 – all languages / hotfixes
• Location• Add http://ctxsym.citrix.com/symbols to your symbol path
• Questions / Feedback• Article CTX118622 on Citrix Knowledge Base (http://support.citrix.com)
• Send additional feedback to [email protected]
Case Study – CDFControl Realtime Tracing Demo
Questions?