Home Analyst Portal

Weird Portal behaviour from 2- 5am Service Request Show only new Status but running in scsm console

Jason_TaylerJason_Tayler Customer IT Monkey ✭
WOndering if anyone can help we are seeing some strange behavoir in the Service requests my requests show service requests as New however in the SCSM Console they are running 
this only seems to occur overnight from 2-5am, During this time some SCSM Connectors are running however nothing has changed in this area, only over the last 2 weeks are we seeing this 

we do see in the webconole log
Connecting to server failed: The client has been disconnected from the server. Please call ManagementGroup.Reconnect() to reestablish the connection.

and in the event log for the cireson portal server
ASP.net Event
Event code: 3005 
Event message: An unhandled exception has occurred. 

WebPortal event
System.TimeoutException: The requested operation timed out. ---> System.TimeoutException: This request operation sent to net.tcp://imaserverandwotnot.domain.something.com:5724/DispatcherService did not receive a reply within the configured timeout (00:30:00).  The time allotted to this operation may have been a portion of a longer timeout.  This may be because the service is still processing the operation or because the service was unable to send a reply message.  Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.

Server stack trace: 
   at System.ServiceModel.Dispatcher.DuplexChannelBinder.SyncDuplexRequest.WaitForReply(TimeSpan timeout)
   at System.ServiceModel.Dispatcher.DuplexChannelBinder.Request(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]: 
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   at Microsoft.EnterpriseManagement.Common.Internal.IDispatcherService.DispatchUnknownMessage(Message message)
   at Microsoft.EnterpriseManagement.Common.Internal.ConnectorFrameworkConfigurationServiceProxy.ProcessDiscoveryData(Guid discoverySourceId, IList`1 entityInstances, IDictionary`2 streams, ObjectChangelist`1 extensions)
   --- End of inner exception stack trace ---
   at Microsoft.EnterpriseManagement.Common.Internal.ExceptionHandlers.HandleChannelExceptions(Exception ex)
   at Microsoft.EnterpriseManagement.Common.Internal.ConnectorFrameworkConfigurationServiceProxy.ProcessDiscoveryData(Guid discoverySourceId, IList`1 entityInstances, IDictionary`2 streams, ObjectChangelist`1 extensions)
   at Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalDiscoveryData.CommitInternal(EnterpriseManagementGroup managementGroup, Guid discoverySourceId, Boolean useOptimisticConcurrency)
   at Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalDiscoveryData.CommitForUserDiscoverySource(EnterpriseManagementGroup managementGroup, Boolean useOptimisticConcurrency)
   at Cireson.ServiceManager.ManagementService.ManagementService.<>c__DisplayClass8_0.<InvokeCommand>b__0(EnterpriseManagementGroup emg)
   at Cireson.ServiceManager.ManagementService.ManagementService.InvokeCommand[T](Func`2 func, Boolean invokeAsService)
   at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Cireson.ServiceManager.ManagementService.ManagementService.<InvokeCommandAsync>d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Cireson.ServiceManager.Services.Projection.<UpdateNewProjectionAsync>d__13.MoveNext()

or Commit new threw an exception is another one

anyone got any ideas the environment has been very static over last 6 months so nothing has changed and not sure if this is a MS SCSM or a Cireson problem at this stage. but given the requests are running in scsm im leaning towards a cireson problem at the moment.


SCSM = 2012 R2 CU9 

Current Portal Version: 8.4.3.2012
Management Pack Version: 7.7.2012.185

Answers

  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    net.tcp://imaserverandwotnot.domain.something.com:5724/DispatcherService
    That is your console connections. Basically the Cache builder is reporting that the call to the management service is timing out. Now if it during the time your connectors are running it could very well be that the database is overloaded and cannot respond to the additional traffic.
    You should review a few things.
    Is the portal server also a management server or a separate server that your have a different value than localhost for the cache builder to connect to? If separate review network bottlenecks between the two points.
    Review your primary management server operations manager logs.Betting you will find warnings for
    Source:        SMCMDB Subscription Data Source Module
    Event ID:      33610
    This shows that workflows are responding slower than expected due to load/response of the sql server.
    Now those warnings during the overnight connectors sync's is not uncommon. But if seeing them thru out the day is a concern.
    You should also review your disk latency on the databases.
    I share a SQL query that can show you your latency. (I forgot to note in the post that the query is related to the server/instance last reboot/restart so if you have stopped the SQL agent and restarted the metrics will reset and report from that moment forward)
    HTH
  • Tom_HendricksTom_Hendricks Customer Super IT Monkey ✭✭✭✭✭
    Adding to this--your description sounds eerily familiar.  The history (ECL/RECL) grooming kicks off at a few different times throughout the night/morning (server time) and the one that has given me the most trouble by far kicks off at exactly 2:00 A.M. server time. 

    These jobs are undocumented and do not show up anywhere in an interface that I am aware of.  They also will not throw any errors that I have seen, other than errors generated by everything else that cannot get a DB connection while the system is locked up.  Even if you use different management servers for your portal servers (and you should!) your sites will grind to a halt at that time.

    The management pack that controls this behavior cannot be edited, but the workflows can be overridden--one of the benefits of sharing so much code with SCOM.  A simple way to know if this is your issue is to override the jobs to disable them.  If your problem goes away, you know that was it.  Then you can address it.  If not, I would look for any other jobs running around that time that you could re-schedule (too many at once?) or optimize.

    In my case, I worked with MS support and we did 2 things:
    • created and imported an override MP to completely disable the grooming jobs that ran when the trouble was occurring
    • Extracted the SQL scripts that those jobs were executing and modified them to run smaller batches, for longer amounts of time (can be run as a scheduled SQL Agent job or by other means)
    In my case and possibly yours, the reason the jobs were having so much trouble a few years after our initial implementation is that they were timing out due to the amount of data being groomed (not actually that much, but enough to cause the problem...) which made the stack bigger for the next run instead of smaller, meaning it got worse over time.  Moving it out of a workflow meant that there would be no timeout, so it could run as long as it needed to, and reducing the batch size lessened the effect on users while it runs (we are 24/7/365, so recurring jobs cannot bring the system down, ever).

    This is the only blog I have ever seen that gets into this, but it is a great start: https://www.concurrency.com/blog/w/service-manager-change-log-grooming-%E2%80%93-part-2-custo
  • Jason_TaylerJason_Tayler Customer IT Monkey ✭
    Cheers guys i will get on to premier to get them to look at these jobs and see what can be done.

Sign In or Register to comment.