Informal Poll: Number of Times Workflows Crash

Adam_Dzyacky · March 2017

It recently struck me that while blogs, Microsoft, and even the community acknowledge that sometimes workflows crash - there really isn't any hard data around this. So as an incredibly informal poll...

How often have your workflows crashed in:

a week
a month
six months
a year

that require a flush of the health service?

Brian_Wiest · March 2017

We have to flush the health service almost weekly. Max we made this is 2 weeks.
I think it depends a lot on system design and the load on the workflows.

Brad_McKenna · April 2017

Curious what behavior others are seeing that may necessitate flushing of the health service.

Currently I/we have not been flushing the health service state, aside from 'patching'.

I am beginning to think I/we may need to do so more often as a possible means to remediating some errors we are seeing in the WebConsole logs.

Example issues:

* https://community.cireson.com/discussion/2262/user-creating-service-requests-w-no-activities

* 'WorkItem Save Call Failed' errors that we have failed to track down root-causes for

Adam_Dzyacky · May 2017

I've recently been peforming a rather exhaustive investigation of the root cause of workflow failures across a few different environments and I've come back with some rather interesting results:

The SQL MP (but it could just as easily be any other MP with a Repeat Count property) from SCOM, triggering Incidents in Service Manager updates the Repeat Count based on the type of alert. Depending on the nature/source of the alert, I managed to create updates on the Incidents every 15 minutes triggered by SCOM. I could significantly exascerbate the issue if the Incident had an Assigned User. This leads to unbelievable bloat of the EntityChangeLog table (I managed to generate about 2 million rows of data doing this). Related to this, as it got worse I had daily grooming failures (event 10880). To be fair, this probably has more to do with Console performance and less to do with workflow failure
Active Directory Connectors run secondary schedules during business (i.e. group expansion) - You can't change this time anywhere in the console, it can only be done by exporting the "Service Manager Linking Framework Configuration" unsealed MP", editing the time, incrementing the version, and then reimporting the MP. The Rule you'll be looking for within the MP starts with the name of "ADGroupExpansionWorkflow" and contains a property called "SyncTime" that you can edit. If it isn't clear, it takes the value of 24 hour military time
I found a few custom connectors that were scheduled to be run every hour. These only took seconds to run (this can be seen via the workflows -> status window). But they were always, always, always followed by a rash of 33610 warnings on the workflow server. To spare onlookers the search, this event is something to the effect of "The database subscription query is longer than expected. Check the database or simplify the database subscription criteria."

Adam_Dzyacky · May 2017

Here's another question - when workflows crash and assuming you run Travis' subscription query on the SCSM server, what is the leading workflow that is behind?

I'm curious if we all have the same workflow, similar workflows, or radically different.

Informal Poll: Number of Times Workflows Crash

Comments

CIRESON COMMUNITY WEB SITE