SCSM Workflows just stopped. NO events being created :(

Pierre_Smit · February 2018

Hi All,

I have an issue where all workflows have stopped being created - 2 days ago. So no email notifications and CRs being stuck at pending. Apparently no changes has been made the the SQL side or the VM with console on.

Does anyone have a script to check the amount of failed workflows?

thanks,

Nicholas_Velich · February 2018

Also, you can try "flushing the health service state" by doing the following:

On your workflow server, stop your 3x SCSM Services (OMSDK, OMCFG, and the Microsoft Monitoring Agent)
Go to C:\Program Files\Microsoft System Center\Service Manager and rename the "Health Service State" folder to "Health Service State - old" (or something like that)
Restart your Workflow server, which will also restart those services

This will pull down a fresh copy of the workflow state from the database, and can help if there is some local issue on the workflow server.

Brian_Wiest · February 2018

@Pierre_Smit Do you only have one server? It is best the clear the Health Service State on all Management servers include the portal server at the same time.

Nicholas_Velich · February 2018

Hi Pierre,

You can check out this post here on troubleshooting workflows: https://blogs.technet.microsoft.com/servicemanager/2013/01/14/troubleshooting-workflow-performance-and-delays/

Thanks,
Nick

Nicholas_Velich · February 2018

Also, you can try "flushing the health service state" by doing the following:

On your workflow server, stop your 3x SCSM Services (OMSDK, OMCFG, and the Microsoft Monitoring Agent)
Go to C:\Program Files\Microsoft System Center\Service Manager and rename the "Health Service State" folder to "Health Service State - old" (or something like that)
Restart your Workflow server, which will also restart those services

This will pull down a fresh copy of the workflow state from the database, and can help if there is some local issue on the workflow server.

Pierre_Smit · February 2018

Thanks Nicholas. I assume after renaming the Health Service State folder a new one will be created?

Justin_Workman · February 2018

Pierre_Smit said:

Thanks Nicholas. I assume after renaming the Health Service State folder a new one will be created?

Yep. Once you restart the SCSM services, it will create a new Health Service State folder.

Adam_Dzyacky · February 2018

Pierre,

The link @Nicholas_Velich provides is the absolute, must, go-to, starting point for addressing this issue within SCSM. However if this is your first foray into workflow troubleshooting it can be a little overwhelming, confusing, and/or you may end up down a lot of different rabbit holes trying to address a root cause. On top of this, there is no silver bullet solution here given the variables that exist across SCSM environments (other System Center products being used/synced, various connector's schedules start/end times, how many connectors, Cireson's Asset management workflows, custom management packs/workflows, SQL backend, disk subsystem, etc.)

Here's some things I would be checking just as an initial diagnosis:

How many connectors do you have whose schedules overlap?
How long do those connectors run? Minutes? Hours?
Do any of the stock connectors overlap with core SCSM data warehouse processing? (12am, 2am)
Do any long running Cireson workflows overlap with core SCSM data warehouse processing? (12am, 2am)
How many other System Center products are being used (and in turn, synced) with SCSM? SCOM and SCCM generate a lot of really useful and valuable data, but the SCCM connector does have an issue when it comes to machines in Azure/Hyper-V that in turn leads into some issues with Cireson Asset Management workflows
I feel like there are two SQL queries that don't get enough attention in SCSM performance/workflow troubleshooting and those are the following. The first will tell you the noisiest objects in the ServiceManager db or to be more verbose, what are the objects that are getting the most changes applied to them. In essence, is there a lot of unnecessary background noise happening in the SCSM Database which in turn competes with workflows for resources on SQL? That said, the observation to be made from the query isn't necessarily "large numbers are bad" but more so, is there a insanely large gap between the first couple of items listed and the rest? The second query will tell you what are the largest tables in the ServiceManager DB. This query isn't anything new because it's taken directly from troubleshooting SCOM performance from Kevin Holman's "useful scom sql queries". The observation to be made from this query is really "EntityChangeLog" table should not be your largest table in ServiceManager. If it is, again this points to a lot of background noise happening in SCSM that could be competing for SQL resources with workflows.

--Loudest Objects in SCSM
 SELECT TOP 50 BME.FullName, COUNT(1)
 FROM EntityChangeLog AS ECL WITH(NOLOCK)
 JOIN BaseManagedEntity AS BME WITH(NOLOCK)
    ON ECL.EntityId = BME.BaseManagedEntityId
 WHERE RelatedEntityId IS NULL
 GROUP BY BME.FullName
 ORDER BY COUNT(1) DESC

and

--Largest Tables in SCSM
SELECT TOP 1000
a2.name AS [tablename], (a1.reserved + ISNULL(a4.reserved,0))* 8 AS 'reserved (KB)', 
a1.rows as row_count, a1.data * 8 AS 'data (KB)', 
(CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data THEN (a1.used + ISNULL(a4.used,0)) - a1.data ELSE 0 END) * 8 AS 'index size (KB)', 
(CASE WHEN (a1.reserved + ISNULL(a4.reserved,0)) > a1.used THEN (a1.reserved + ISNULL(a4.reserved,0)) - a1.used ELSE 0 END) * 8 AS 'unused (KB)', 
(row_number() over(order by (a1.reserved + ISNULL(a4.reserved,0)) desc))%2 as l1, 
a3.name AS [schemaname] 
FROM (SELECT ps.object_id, SUM (CASE WHEN (ps.index_id < 2) THEN row_count ELSE 0 END) AS [rows], 
SUM (ps.reserved_page_count) AS reserved, 
SUM (CASE WHEN (ps.index_id < 2) THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count) 
ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count) END ) AS data, 
SUM (ps.used_page_count) AS used 
FROM sys.dm_db_partition_stats ps 
GROUP BY ps.object_id) AS a1 
LEFT OUTER JOIN (SELECT it.parent_id, 
SUM(ps.reserved_page_count) AS reserved, 
SUM(ps.used_page_count) AS used 
FROM sys.dm_db_partition_stats ps 
INNER JOIN sys.internal_tables it ON (it.object_id = ps.object_id) 
WHERE it.internal_type IN (202,204) 
GROUP BY it.parent_id) AS a4 ON (a4.parent_id = a1.object_id) 
INNER JOIN sys.all_objects a2  ON ( a1.object_id = a2.object_id ) 
INNER JOIN sys.schemas a3 ON (a2.schema_id = a3.schema_id) 
WHERE a2.type <> N'S' and a2.type <> N'IT'
order by row_count desc

This particular discussion is one I feel like I could write a book on, so to spare your mouse wheel from so much scrolling. I'll wrap this up for now. Best of luck troubleshooting!

Pierre_Smit · February 2018

Thanks all for your quick responses. The suggestion of renaming the "Health Service State" folder and then restarting the server has resolved the issue!

Thank again.

Brian_Wiest · February 2018

@Pierre_Smit Do you only have one server? It is best the clear the Health Service State on all Management servers include the portal server at the same time.

SCSM Workflows just stopped. NO events being created :(

Best Answers

Answers

SCSM Workflows just stopped. NO events being created :(

Best Answers

Answers

CIRESON COMMUNITY WEB SITE