Jump to content

Sometimes connection server service will noresponse


Go to solution Solved by lkgsak2,

Recommended Posts

Posted

Hi All,

Sometimes connection server service will no response and horizon client can't connect it,

we need to restart the service, and it will take 3-5 minutes before we can connect normally.
Has anyone encountered it before?

vsphere 7.0u3, 

Connection-Server-x86_64-8.12.1-23574797(windows)

windows-Agent-x86_64-2312.1-8.12.1-23507832

 

  • Replies 11
  • Created
  • Last Reply

Top Posters In This Topic

Posted
1 hour ago, Neo178 said:

Hi All,

Sometimes connection server service will no response and horizon client can't connect it,

we need to restart the service, and it will take 3-5 minutes before we can connect normally.
Has anyone encountered it before?

vsphere 7.0u3, 

Connection-Server-x86_64-8.12.1-23574797(windows)

windows-Agent-x86_64-2312.1-8.12.1-23507832

 

Have you tried deploying a replica and replacing the connection server?

What do the logs say?

Network issues?

Stephen Wagner (President, Digitally Accurate Inc.)

VMware vExpert (vExpert Pro, vSphere, vSAN Awards), Omnissa Tech Insider, NVIDIA NGCA Advisor, VMUG Leader, and Director (Board of Directors) at World of EUC

Check out my Tech Blog: https://www.StephenWagner.com

  • Solution
Posted (edited)

We encountered this too some time ago after we updated from 2312 to 2312.1. We also opened a case regarding this topic.

tl;dr: Just update to 2406 and the issue is gone for good.

 

Support said, since it was not officially mentioned in release notes, that they had other customers as well with this issue and we should try to update, they were convinced that this is fixed with 2406. 

Since we had this issue in production and lab environment, we updated the lab asap, waited for 1 1/2 weeks (latest point where the issue would come back after reboot for us) and pushed it to production since we had serious issues with this bug regarding duplicated sessions and problematic machines. At first we didn't directly updated the agent on golden images, only the connection servers, which worked for us. (we updated the agents in our next planned maintenance window) 

Details what we encountered with two connection servers behind nlb on 2312.1 after one of the services got this issue.

- Sometimes in dashboard you would see that "Service status not recognized " (or something like that). If you logon to both connection servers via direct url, you see that the counters on the dashboard are different. Also you will see that one of the connection servers says it cannot recognize the status of itself

- If you schedule an image push in this condition, you will encounter errors and failed push operations. (which results in problems if users start logging in because of max pool size configurations and no free machines left) 

- Since the servers seem to no synchronize correctly it could happen that in automated pools where users only can reconnect to session and not create new ones, users are sporadically able to create duplicated sessions. The old sessions will not be flagged as "already used" or cleaned up. 

- If you reboot both servers to close to each other while this issue is persistent, the misalignement in agent status gets worse.

Only workaround for us was to reboot both servers with a gap of 15min to each other, after that the dashboard is synchronized again and you will find a few "already used" or "agent unreachable" machines. Let that sit for around another 20mins and start cleaning up the remaining problematic machines.

 

Edit: clarified configuration

Edited by lkgsak2
Posted

Hi @lkgsak2,

So you updated to version 2406 to solve the issue of no response from the connection server, but there were also problems with NLB in this version?
If NLB is not used, should version 2406 be okay?

Posted

No, I just added the information that we had two connection servers (one replica) behind nlb, and an additional symptom is that they are out of sync if the service stopps working correctly.

From my understanding this has nothing to do with nlb or dns round robin. 

2406 is working fine since we updated, and as I said it does even work without updating the agent in the first place if its urgent to get rid of the problem. We did not run into any other problems after updating. 

 

  • Employee
Posted

Hello, 

You can start by doing an overall health check of the environment. 

First, check replication of ADAM DB as per https://kb.omnissa.com/s/article/1021805

Check JMS communication status under Global Settings> Security. It has to be 'Enhanced' otherwise it is a problem. 

Make sure IIS and BranchCache are not present in the server. 

Check if issue persist without AV active. 

Posted
12 hours ago, Victor León said:

Hello, 

You can start by doing an overall health check of the environment. 

First, check replication of ADAM DB as per https://kb.omnissa.com/s/article/1021805

Check JMS communication status under Global Settings> Security. It has to be 'Enhanced' otherwise it is a problem. 

Make sure IIS and BranchCache are not present in the server. 

Check if issue persist without AV active. 

For us this checks were all positive with no findings. We did not even found a trace of this error in the respective logs (info log level). 

The case I'm referring to was "00701348" here is the quote from the final solution:

Quote

There has been similar reported incidents with respect to connection server issues which will be fixed after restarting the connection servers. Such issues has been addressed in the latest release 2406.

We request you to upgrade to the latest version and see if we are still experiencing the issue and share us the feedback.

But even with the issue present before a restart, we were not able to find any other indicator of the root cause or something we could actively monitor via external tools. Since we had a lab environment we tried several things. ADAM Database was functional and replicating without errors, services were up and running, admin website was working, login into machines was working, logging showing only informational messages with no warnings or errors.  It seemed that it only affected the communication between connection servers and agents (possibly only jms). In the end only a staged reboot of the connection servers was successful. 

This error never appeared under 2312 or before and never since 2406. It seems that this is a known issue for 2312.1, but I got no information from support if there are any special conditions under which this issue occurs, nor does it appear in the official documents (release notes/known issues).

  • Employee
Posted

Hi lkgsak2, unfortunately the logs in your case are not in 'debug' mode and missing some files. Hard to tell why it is failing with only 'info' level information. 

Posted

I've also been running into random issues with 2312.1 suddenly declining new sessions. I'm planning to upgrade to 2406 tonight and I'll report back if the same issues are being seen. 

Essentially for us we've had where randomly we're not able to accept new users to our connection servers without anything declining the sessions they just eventually time out. No errors show on the connection server admin console or on the debug log with getting the session to start. It just won't complete the handshake and seems to force us to restart the connection server service. After restarting the service everything is back to normal until it randomly decides to end it again. 

The issues I generally see are 

Item enqueued on "Outbound JMS Responder" but there are no workers available to process it. Busy workers = 1, queue length = 10

and the only way to fix this I've found is to restart the service.

Posted

I've found that the update to 2406 did help with the Outbound JMS Responder being hung up. 

The licensing switch did impact me a bit as I upgraded my load balanced connection servers not realizing my non-load balanced connection servers would lose their licensing. I'd recommend if anyone is in a similar situation where you are deploying a load balancer but are still in the testing phase to upgrade all of your non-load balanced servers first when no one is utilizing them. I'm using the edge appliance to get my server licensing, so perpetual might differ from my experience.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...