Weird Portal behaviour from 2- 5am Service Request Show only new Status but running in scsm console

Jason_Tayler · October 2018

WOndering if anyone can help we are seeing some strange behavoir in the Service requests my requests show service requests as New however in the SCSM Console they are running
this only seems to occur overnight from 2-5am, During this time some SCSM Connectors are running however nothing has changed in this area, only over the last 2 weeks are we seeing this

we do see in the webconole log
Connecting to server failed: The client has been disconnected from the server. Please call ManagementGroup.Reconnect() to reestablish the connection.

and in the event log for the cireson portal server
ASP.net Event

Event code: 3005

Event message: An unhandled exception has occurred.

WebPortal event

System.TimeoutException: The requested operation timed out. ---> System.TimeoutException: This request operation sent to net.tcp://imaserverandwotnot.domain.something.com:5724/DispatcherService did not receive a reply within the configured timeout (00:30:00). The time allotted to this operation may have been a portion of a longer timeout. This may be because the service is still processing the operation or because the service was unable to send a reply message. Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.

Server stack trace:

at System.ServiceModel.Dispatcher.DuplexChannelBinder.SyncDuplexRequest.WaitForReply(TimeSpan timeout)

at System.ServiceModel.Dispatcher.DuplexChannelBinder.Request(Message message, TimeSpan timeout)

at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)

at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)

at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]:

at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)

at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

at Microsoft.EnterpriseManagement.Common.Internal.IDispatcherService.DispatchUnknownMessage(Message message)

at Microsoft.EnterpriseManagement.Common.Internal.ConnectorFrameworkConfigurationServiceProxy.ProcessDiscoveryData(Guid discoverySourceId, IList`1 entityInstances, IDictionary`2 streams, ObjectChangelist`1 extensions)

--- End of inner exception stack trace ---

at Microsoft.EnterpriseManagement.Common.Internal.ExceptionHandlers.HandleChannelExceptions(Exception ex)

at Microsoft.EnterpriseManagement.Common.Internal.ConnectorFrameworkConfigurationServiceProxy.ProcessDiscoveryData(Guid discoverySourceId, IList`1 entityInstances, IDictionary`2 streams, ObjectChangelist`1 extensions)

at Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalDiscoveryData.CommitInternal(EnterpriseManagementGroup managementGroup, Guid discoverySourceId, Boolean useOptimisticConcurrency)

at Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalDiscoveryData.CommitForUserDiscoverySource(EnterpriseManagementGroup managementGroup, Boolean useOptimisticConcurrency)

at Cireson.ServiceManager.ManagementService.ManagementService.<>c__DisplayClass8_0.<InvokeCommand>b__0(EnterpriseManagementGroup emg)

at Cireson.ServiceManager.ManagementService.ManagementService.InvokeCommand[T](Func`2 func, Boolean invokeAsService)

at System.Threading.Tasks.Task.Execute()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at Cireson.ServiceManager.ManagementService.ManagementService.<InvokeCommandAsync>d__7.MoveNext()

--- End of stack trace from previous location where exception was thrown ---

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

at Cireson.ServiceManager.Services.Projection.<UpdateNewProjectionAsync>d__13.MoveNext()

or Commit new threw an exception is another one

anyone got any ideas the environment has been very static over last 6 months so nothing has changed and not sure if this is a MS SCSM or a Cireson problem at this stage. but given the requests are running in scsm im leaning towards a cireson problem at the moment.

SCSM = 2012 R2 CU9

Current Portal Version: 8.4.3.2012

Management Pack Version: 7.7.2012.185

Brian_Wiest · October 2018

net.tcp://imaserverandwotnot.domain.something.com:5724/DispatcherService

That is your console connections. Basically the Cache builder is reporting that the call to the management service is timing out. Now if it during the time your connectors are running it could very well be that the database is overloaded and cannot respond to the additional traffic.

You should review a few things.

Is the portal server also a management server or a separate server that your have a different value than localhost for the cache builder to connect to? If separate review network bottlenecks between the two points.

Review your primary management server operations manager logs.Betting you will find warnings for

Source: SMCMDB Subscription Data Source Module

Event ID: 33610

This shows that workflows are responding slower than expected due to load/response of the sql server.

Now those warnings during the overnight connectors sync's is not uncommon. But if seeing them thru out the day is a concern.

You should also review your disk latency on the databases.

In this post https://community.cireson.com/discussion/comment/15512#Comment_15512

I share a SQL query that can show you your latency. (I forgot to note in the post that the query is related to the server/instance last reboot/restart so if you have stopped the SQL agent and restarted the metrics will reset and report from that moment forward)

HTH

Tom_Hendricks · October 2018

Adding to this--your description sounds eerily familiar. The history (ECL/RECL) grooming kicks off at a few different times throughout the night/morning (server time) and the one that has given me the most trouble by far kicks off at exactly 2:00 A.M. server time.

These jobs are undocumented and do not show up anywhere in an interface that I am aware of. They also will not throw any errors that I have seen, other than errors generated by everything else that cannot get a DB connection while the system is locked up. Even if you use different management servers for your portal servers (and you should!) your sites will grind to a halt at that time.

The management pack that controls this behavior cannot be edited, but the workflows can be overridden--one of the benefits of sharing so much code with SCOM. A simple way to know if this is your issue is to override the jobs to disable them. If your problem goes away, you know that was it. Then you can address it. If not, I would look for any other jobs running around that time that you could re-schedule (too many at once?) or optimize.

In my case, I worked with MS support and we did 2 things:

created and imported an override MP to completely disable the grooming jobs that ran when the trouble was occurring
Extracted the SQL scripts that those jobs were executing and modified them to run smaller batches, for longer amounts of time (can be run as a scheduled SQL Agent job or by other means)

In my case and possibly yours, the reason the jobs were having so much trouble a few years after our initial implementation is that they were timing out due to the amount of data being groomed (not actually that much, but enough to cause the problem...) which made the stack bigger for the next run instead of smaller, meaning it got worse over time. Moving it out of a workflow meant that there would be no timeout, so it could run as long as it needed to, and reducing the batch size lessened the effect on users while it runs (we are 24/7/365, so recurring jobs cannot bring the system down, ever).

This is the only blog I have ever seen that gets into this, but it is a great start: https://www.concurrency.com/blog/w/service-manager-change-log-grooming-%E2%80%93-part-2-custo

Jason_Tayler · October 2018

Cheers guys i will get on to premier to get them to look at these jobs and see what can be done.

Weird Portal behaviour from 2- 5am Service Request Show only new Status but running in scsm console

Answers

CIRESON COMMUNITY WEB SITE