SCSM DB Meltdown

Rob_Simon · February 2019

Yesterday all was humming alone fine and then we received some reports of slowness for the Portal. This quickly turned out to be a complete work stoppage for all Portal users and even the Console.

Our DBA found something blocking and after killing the process many times, we finally got back to normal. Tried to find some sort if info via MS but no luck so since this Community is filled with very smart people

who here can decode the following:

Application "SC DAL—SDKInstanceProcessing" was trying to execute the following transaction that cause the blocking:

INSERT INTO #CurrentTxnRelatedEntityChangeLog

(

[EntityChangeLogId],

[EntityId],

[EntityTypeId],

[RelatedEntityId],

[EntityTransactionLogId],

[LastModified],

[RECLRowForInstance]

)

SELECT

( SELECT max(EntityChangeLogId)

FROM dbo.[EntityChangeLog]

WHERE EntityId = TME.[BaseManagedEntityId]

AND EntityTypeId = TME.[ManagedTypeId]

AND [EntityTransactionLogId] <= @TransactionId

AND [ChangeType] <= 2

),

TME.[BaseManagedEntityId],

TME.[ManagedTypeId],

NULL,

@TransactionId,

@LastModified,

1

FROM dbo.[TypedManagedEntity] TME

JOIN #CurrentTxnEntityChangeLog ECL

ON ECL.[EntityId] = TME.[BaseManagedEntityId]

JOIN #RelationshipTypeToLog RTL

ON RTL.[RelationshipTypeId] = ECL.[EntityTypeId]

WHERE TME.[IsDeleted] = 0

Brett_Moffett · February 2019

Hi @Rob_Simon

The entity change log is an interesting beast. It is the piece of SCSM that is responsible for the History of any work item. Every time a value is changed in a Work Item (Say the support group is changed from Service Desk to Network Support) the Entity Change log (ECL) records what the old value was, what the new value is, who changed it and when.

If there is a block in this then I'd suggest that 2 services tried to change the same value at the same time.
Now, normally this is not allowed and we end up with those annoying messages in the console that someone else has made changes to the record. This error message is there to prevent the issues you are seeing.

However, if you have any automation or scripts that are running, they may be setting the same value multiple times in the one command and therefore causing the block.

I've never seen a block in the ECL while using the product as intended, so I can only assume its an automation task that is causing it.

Hope this helps.

Brett

Rob_Simon · February 2019

Thanks @Brett_Moffett this does lead us to a possible cause. Will close in on the culprit!

Adam_Dzyacky · February 2019

The entity change log is an interesting beast. - @Brett_Moffett

EntityChangeLog table has got to be one of the most complex topics within SCSM because so much depends on it. Brett is spot on in that it's responsible for all History of Work Items. Every change is a row in this table which means depending on how much work is happening in the system you can see a lot of activity on this table.

Not to mention, it's not just the day to day use of SCSM that will contribute to growth in ECL but also your Connectors (AD, SCOM, SCCM, SCO, SCVMM, Exchange, and Cireson Asset Management). What's more, given how much relies on this table a poorly maintained ECL table will lead to performance woes across different (Microsoft and Cireson) aspects of SCSM. But if the idea of "maintaining" a single table within a SQL database for an application makes you say "That's ridiculous" then fret not as the good news is there is a host of things you can do. Here's some tips for ensuring an autonomous and well maintained ECL table.

1. Connector scheduling: Make sure connectors are spaced out and run during the quietest times of SCSM operations. This ensures the volume of data getting loaded isn't competing with Analysts for resources.
2. Connector optimization: One setting that can be overlooked is that ALL of your connectors if possible should have "Do not write null values for properties that are not set in..." checked off. Why all of them? Because if you're using the AD connector and SCOM CI connector - it's possible one is writing values while the other is nulling them out. This leads to noise in the DB and unnecessary growth of the ECL table. Regardless of the connectors you're using, make sure all these have the value checked.
3. Cireson Asset Connectors - These pull a lot of information out of SCCM and as such have the potential to hit their configured max run time. Strongly advise to run these after hours.
4. History retention: Administration -> Settings -> Data Retention Settings. This is a balancing act between what departments using SCSM reasonably expect to see when searching items in the console and what you can afford to purge and then ultimately search in the DW. That said, you want numbers to be as low is as acceptable for your deployment. The "History" section generally has a fairly direct correlation with ECL table size.
5. SCOM Alerts: If you've integrated SCOM Alerts to fire SCSM Incidents, some Alerts feature a Repeat count metric. Depending on how you've configured the Alert in SCOM, it's possible SCOM could be writing a lot of nonstop updates into SCSM to a single Work Item.
6. Custom Workflows: This is Brett's original point, it sounds like automation or some workflow maybe causing this. It's worth checking out any custom workflows in the admin pane, their last run times, and their status

Next, there are a few SQL queries you can use against the ServiceManager database for troubleshooting.

Loudest Objects in SCSM - This is great because you can possibly identify what is making the most noise in SCSM. To analyze this data keep in mind that large numbers aren't necessarily bad, so what you should be focused on is if you have a huge gap between objects. For example, if the first is 1000 but the second is 30...

<div>SELECT TOP 25 BME.FullName, COUNT(1)
FROM EntityChangeLog AS ECL WITH(NOLOCK)
JOIN BaseManagedEntity AS BME WITH(NOLOCK)
    ON ECL.EntityId = BME.BaseManagedEntityId
WHERE RelatedEntityId IS NULL
GROUP BY BME.FullName
ORDER BY COUNT(1) DESC
</div>

Largest Tables in SCSM - This is straight from Kevin Holman's Useful SCOM queries as SCSM/SCOM share a near identical backend. This will show you largest tables by row count, to say the least the goal here is that ECL needs to be as low in the list as possible. So if it's in your top 5 the avenues above are worth investigating. While the link takes you to the relevant query, it isn't ordering by anything so it's possible you could be misled by the results for your SCSM deployment when running. As such I made a very simple modification to Kevin's query to sort by Row Count, a rather important metric when trying to gauge your ECL health. Further down this page I've posted all of the queries including the modified one I'm referring to here.

These and a few other classic queries I also posted over here on building a SCSM Dashboard in the Portal so you don't have to run queries all the time to check in on this and instead just have a quick Web UI to check things out in.

Rob_Simon · February 2019

This is great! Really appreciate it.

Leigh_Kilday · February 2019

We had exactly the same behavior after upgrading to 8.12.3. The primary node in the SQL Server AG was at 95%+ CPU due to transactions with the server running the cache builder. Restarting the services/website on this server provided temporary reprieve, however it soon came back.

Our DBAs identified two issues.
1. The insert statement on the ServiceManagement.dbo.Access_CI$User_WorkItem table requires a lot of processing, and this was the primary cause of our issue.
2. The grid view query was causing excessive CPU load so we made some tweaks to improve performance.

During resolution of the incident, I rolled forward to portal 8.12.6 and it was resolved, however this has its own bugs that we have raised with Cireson.

We are in a unique state where we cannot archive our old WIs from the OLTP DB due to the inadequacies of the SCSM DW to maintain the record in its entirety (legislative requirement), its inability to be easily queried, and its vulnerability to corruption (Microsoft has often told us to rebuild the DW when we've experienced issues). This means the ServiceManagement.dbo.Access_CI$User_WorkItem table is in the high hundreds of millions of rows, resulting in the ScopedAccess process in the cache builder to take 1.5 hours on a full refresh.

Until we have developed a custom ETL system and DW, we cannot enable grooming, further perpetuating the issue.

Gerhard_Goossens · February 2019

@Adam_Dzyacky

Sorry to jump in here but the past month we have been experiencing the exact same issue.
Running the queries mentioned in the post I got the following.

Loudest Objects

Image: https://us.v-cdn.net/6026663/uploads/editor/ji/e7wp86zutmh6.png

Largest Table

Image: https://us.v-cdn.net/6026663/uploads/editor/kx/e0fn65tibgom.png

It seems like the ECL is the largest table so it should be a concern. Do you have any suggestions on what can be done?

I'm not sure how to interpret the loudest objects. maybe it just shows the Analysts that work the hardest?

Brian_Wiest · February 2019

This killed us as well for quite some time. Even opened MS cases to resolve. In the end it ended up being a simple check box.

The entity change log can be large but that is not the impact of the locked query/task. We found when the issue would happen if we did a exec sp_updatestats it would help clear the lock faster but didn't really explain the issue.

Trimming down the History value in settings would help trim down the entity change log size but the locking would still occur.

This locking would always occur at the same time of day. But if you go thru all your schedules, timer jobs, workflows, etc nothing would match the time frame.

So while Adam provide the overall troubleshooting he is missing one item to review. AD Connector

The AD connector everyone setups to run at a specific time of day to replicate AD into SCSM. What is not clearly documented is the meanings of all the options the connector has. The one being the checkbox causing all our locking issues for months.

"Automatically add user of AD Groups imported by this connector"

Image: https://us.v-cdn.net/6026663/uploads/editor/nk/k4re7njf269n.png

If I check off this option it places a heavy processing burden on SCSM. When it does it tells SCSM to query each group it imported from AD for all members and make sure those members are imported as SCSM users.

So when we first setup SCSM we checked it off thinking yes we want to make sure all group memberships are in SCSM.

So in a large environment were we have thousands of groups the SDK gets swamped querying each AD group for members and then checking for the members in SCSM. Along with in our environment we have groups larger then the 5k AD allows in a query response. This includes checking Domain Users)

Here is the kicker. If you have the AD connector set to ALL there is 0 need to have this check box enabled as the connector would have already ingested all the AD information.

The checkbox is intended for use cases where you select a subset of your AD structure (example just an OU containing just the SCSM groups) and want the system to make sure all the AD users are imported.

Now the icing on the cake. You have the AD connector setup to run at 2am off hours as to not impact your community. What does all this have to do with the locking in the middle of the day.

This workflow has its own set run time and this time is not configurable. It runs every 24 hours from the date/time of the AD connector create day.

So we created our AD connector during business hours when we built the farm at 2:50pm (of course our high point time frame) so every day during peak time the SQL locks were occurring and w could not match it to the workflows.

Just unchecking this option made all the difference.

HTH

Rob_Simon · February 2019

@Brian_Wiest Wow, just when I thought I'd heard of every weird thing MS designed in SCSM, this one sure take the cake.

Brian_Wiest · February 2019

Was it matching your issue?

Rob_Simon · February 2019

We have this option currently turned off for the 4 AD connectors we currently have (different users/groups within the tree structure)

Gerhard_Goossens · February 2019

All mine is also unchecked for "Automatically Add Users..." and ticked for "Do not write..."

Adam_Dzyacky · February 2019

Bah! I always forget that AD Group Expansion one and I forget because the checkbox Brian cites didn't make an appearance until SCSM 2016.

In SCSM 2012 the box wasn't there and as such the only thing you could do is export the "Service Manager Linking Framework Configuration" unsealed management pack. Then once you open it you can do a CTRL+F for "ADGroupExpansionWorkflow" which leads to you something that looks like the following:

You'd have an Expansion rule for each AD Connector you have and then you could control not only the time it runs but also its Enabled property.

Regarding @Gerhard_Goossens's question on "How to interpret all the users, are they working the hardest?" Well to be honest, you aren't that far off. The question here is really what are those users doing that's causing a write to the ECL that causes them to be there that often. For example - do you have users constantly re-assigning Work Items? Do you have a template that X default users are set on and that template is being called frequently? I know when troubleshooting high counts on user objects I use the following PowerShell:

$fullname = ""
<div><div>$mgmtServer = ""

<div>$userClass = get-scsmclass "system.domain.user$" -computername $mgmtServer
<div><div>$user = get-scsmobject -class $userclass -Filter "DisplayName -eq '$fullname'" -ComputerName $mgmtServer
Get-SCSMObjectHistory -Object $user -ComputerName $mgmtServer
</div></div></div></div><div><br></div></div>

So in this case, you're looking up the History of a single user in SCSM (e.g. go after the "loudest" one per your above results). The results will absolutely demand their own specific investigation that is going to be unique to your environment. But it's a way to possibly catch if there is some Workflow that is performing some change frequently on users.

Brian_Wiest · February 2019

Its most likely matching @Gerhard_Goossens issue seeing the screenshots

@Rob_Simon attached are some SQL queries to help review.

WorkflowstatusSQL.txt

This will provide you a run down of your workflows status from the SQL DB.

You want to see Minutes behind at 1/2 as all default workflows process every one minute. EventCount is how many need processing.

This is a per environment judgement call but helps to see if there is a workflow locking up the system doing an unexpected task or taking to long to process the changes.

(Note AND S.IsPeriodicQueryEvent = 0)

I turn it off to always view as it shows my periodic subscriptions. You want as few as possible of these, as they take a larger hit on the system.

ActiveTransactions.txt

This shows active transaction on the database that may not be something triggered by workflow (example scheduled powershell task)

Again this is a per environment judgement call

DBDisk.txt

Provides a table of how SQL sees the disk performance. The lower the better.

Note these stats clear anytime the instance is restarted or failed over.

Brian_Wiest · February 2019

Weird system is blocking all attachments for files

Now its now allowing be to paste into post. Website must really not like the queries

Adam_Dzyacky · February 2019

@Brian_Wiest that's why I couldn't paste the Large Table Query directly in here yesterday. Vanilla doesn't like some kind of aspect of SQL queries.

Gerhard_Goossens · February 2019

@Brian_Wiest @Adam_Dzyacky

Can you please send the queries via mail if you don't mind.
Gerhard.Goossens@nwu.ac.za

Jonathan_Boles · February 2019

@Brian_Wiest and @Adam_Dzyacky

Same here, would you mind email the queries? We've been having a devil of a time with some issues surrounding the ECL.

jonathan.boles@stantec.com

Adam_Dzyacky · February 2019

Queries work again @Brian_Wiest !
I ended up having to create separate responses for each query, otherwise it threw the error again. Code block formatting in the community seems to have some quirky issues. Never the less, it should run just fine.

From Kevin Holman's "Useful SCOM Queries" - Largest Tables. Modified to Sort By Row Count

<div><div>--This is the SAME query used against SCOM and SCOM DW, but it also works against SCSM. This lists the top 1000</div><div>-- largest tables in the ServiceManager DB sorted by row count.</div><br><div>SELECT TOP 1000</div><div>    a2.name AS [tablename], (a1.reserved + ISNULL(a4.reserved,0))* 8 AS 'reserved (KB)',</div><div>    a1.rows as row_count, a1.data * 8 AS 'data (KB)',</div><div>    (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data THEN (a1.used + ISNULL(a4.used,0)) - a1.data ELSE 0 END) * 8 AS 'index size (KB)',</div><div>    (CASE WHEN (a1.reserved + ISNULL(a4.reserved,0)) > a1.used THEN (a1.reserved + ISNULL(a4.reserved,0)) - a1.used ELSE 0 END) * 8 AS 'unused (KB)',</div><div>    (row_number() over(order by (a1.reserved + ISNULL(a4.reserved,0)) desc))%2 as l1,</div><div>    a3.name AS [schemaname]</div><div>FROM (SELECT ps.object_id, SUM (CASE WHEN (ps.index_id < 2) THEN row_count ELSE 0 END) AS [rows],</div><div>        SUM (ps.reserved_page_count) AS reserved,</div><div>        SUM (CASE WHEN (ps.index_id < 2) THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count) </div><div>ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count) END ) AS data,</div><div>        SUM (ps.used_page_count) AS used</div><div>    FROM sys.dm_db_partition_stats ps</div><div>    GROUP BY ps.object_id) AS a1</div><div>    LEFT OUTER JOIN (SELECT it.parent_id,</div><div>        SUM(ps.reserved_page_count) AS reserved,</div><div>        SUM(ps.used_page_count) AS used</div><div>    FROM sys.dm_db_partition_stats ps</div><div>        INNER JOIN sys.internal_tables it ON (it.object_id = ps.object_id)</div><div>    WHERE it.internal_type IN (202,204)</div><div>    GROUP BY it.parent_id) AS a4 ON (a4.parent_id = a1.object_id)</div><div>    INNER JOIN sys.all_objects a2 ON ( a1.object_id = a2.object_id )</div><div>    INNER JOIN sys.schemas a3 ON (a2.schema_id = a3.schema_id)</div><div>WHERE a2.type <> N'S' and a2.type <> N'IT'</div><br><div>--order by a1.data desc</div><div>order by row_count desc</div></div>

Adam_Dzyacky · February 2019

Failed Grooming Jobs - Shows when grooming jobs in core SCSM are potentially having issues grooming objects out. It's important to note here that this has nothing to do with the DW, this is core SCSM just trying to accomplish its grooming/purge jobs at 12am and 2am.

  -- Run against the ServiceManager database. It will return
 -- a list of all internal SCSM grooming jobs that have failed. While jobs may fail from time to time,
 -- as long as they show a successful execution after a failure there is little cause for concern.

SELECT
   InternalJobHistoryId,
   Command,
   TimeStarted,
   timefinished,
   statuscode
FROM InternalJobHistory WITH(NOLOCK)
WHERE
   TimeFinished IS NULL AND
   StatusCode <> 1
order by timestarted desc

--alternatively, you could run the same query without the WHERE clause to return all jobs

/* SELECT
   InternalJobHistoryId,
   Command,
   TimeStarted,
   timefinished,
   statuscode
FROM InternalJobHistory WITH(NOLOCK)
order by timestarted desc
*/

Adam_Dzyacky · February 2019

DCM Volume - If you're using SCCM and DCM, it's possible you could be generating a lot of noise. This query helps show those items off

-- Run against the ServiceManager database. It will return the volume of DCM data within SCSM.
-- But if you're not using SCCM and DCM then you should expect almost no results for this query

SELECT Count (1) AS DeviceHasWSUSUpdates FROM [LFXSTG].[v_Cached_CMv5_DeviceHasWSUSUpdates]
SELECT Count (1) As DCMNonCompliantConfigItems FROM [LFXSTG].[v_Cached_CMv5_DCMNonCompliantConfigItems]
SELECT Count (1) AS DCMConfigItemRelations FROM [LFXSTG].[v_Cached_CMv5_DCMConfigItemRelations]
SELECT Count (1) AS DCMConfigItems FROM [LFXSTG].[v_Cached_CMv5_DCMConfigItems]
SELECT Count (1) AS AdminCategories FROM [LFXSTG].[v_Cached_CMv5_AdminCategories]

Adam_Dzyacky · February 2019

Loudest Objects in SCSM - This is great because you can possibly identify what is making the most noise in SCSM. To analyze this data keep in mind that large numbers aren't necessarily bad, so what you should be focused on is if you have a huge gap between objects. For example, if the first is 1000 but the second is 30...Ultimately you shouldn't expect to see Work Items/Users at the top, if you do it raises the question of what is hammering changes against that specific Work Item/Users.

SELECT TOP 25 BME.FullName, COUNT(1)
FROM EntityChangeLog AS ECL WITH(NOLOCK)
JOIN BaseManagedEntity AS BME WITH(NOLOCK)
    ON ECL.EntityId = BME.BaseManagedEntityId
WHERE RelatedEntityId IS NULL
GROUP BY BME.FullName
ORDER BY COUNT(1) DESC

Brian_Wiest · February 2019

Attachments were fixed by the vendor. Attached are the queries I was referring too.

(well was told fixed. still not working.)

Rob_Simon · February 2019

Well we had a talk with MS and will be performing some DB performance improvements but we found the root cause as to why things were bombing.

MS noticed there was a large amount of work items which were going to be groomed all at once in the near future. This was likely caused by implementing the Cireson Advanced Request Offering tool, which moved the last modified date on all work items to the day which we imported the Management Pack. Without being addressed, this would likely take the SCSM application down as it would be attempting to groom a much larger amount of data than usual. After discussing multiple options to address this, MS agreed that the best option would be a controlled one-time grooming of work items to move past those large groups of work items.

Rob_Simon · February 2019

And as for performance, MS recommended the following:

1. Change compatibility level to 110. We have seen this improve performance of SCSM running on SQL 2014.

2. Change DB owner to service account instead of SQL SA account. Currently, the local SA SQL account owns the SCSM DB. Since you have the database engine in Windows Only authentication mode, this can cause problems with certain Authorization Manager operations.

3. Max degree of parallelism needs to be to 4 (half procs up to 8).

4. Cost threshold for par needs to be set to 25 (A number that I have arrived at over the years to prevent parallelism in fragmented query scenarios and such).

5. Highly recommend setting memory limitations on both SSAS and the database engine (these services will compete for resources).

Adam_Dzyacky · February 2019

Rob_Simon said:

And as for performance, MS recommended the following:
Max degree of parallelism needs to be to 4 (half procs up to 8).

Microsoft Support, SCSM folks, SCOM people, SQL DBAs - everyone seems to have a different opinion and I'm not sure I've ever been convinced there is a consensus. In fact, the only thing I'd say seems to be certain amongst parties is a "try it and see what happens" mentality. I'm not even being critical of that attitude but I don't know if I'm convinced there is a definitive answer on this.

Tom_Hendricks · February 2019

Rob_Simon said:

...

MS noticed there was a large amount of work items which were going to be groomed all at once in the near future. This was likely caused by implementing the Cireson Advanced Request Offering tool, which moved the last modified date on all work items to the day which we imported the Management Pack. Without being addressed, this would likely take the SCSM application down as it would be attempting to groom a much larger amount of data than usual. After discussing multiple options to address this, MS agreed that the best option would be a controlled one-time grooming of work items to move past those large groups of work items.

We have had a very similar experience, but it required more than just a one-time intervention. Yes, this will help you, but you could end up having the same problem next time you upgrade that management pack (or others that update the last modified on so many objects). Microsoft had provided me with a more long-term fix, which has made it into the docs site, albeit extremely briefly and without much helpful detail provided.

In our case, we also had a problem where the grooming job would consume all available resources and prevent even itself from running, which halted the whole system for our offices on the other side of the world (during their afternoon) and also compounded the problem because the job could never clear the backlog of items in need of grooming. Because of the 24x7x365 requirement for operation, we had to have the job run in smaller batches over longer periods of time so that the system could continue to run. What this also did was even things out to where the items could finally get groomed properly and there was a performance lift.

You cannot configure the grooming job from any known interface (including PS or the C# SDK that I know of) but the SQL queries used and the job schedules are contained in a system management pack that can be overridden (like a SCOM MP). We run the queries with a SQL agent job that can take all the time it wants to finish, and that has made the difference. Running low (items per batch) and slow has been the key.

As for the compatibility level, I am told that the reason why 110 works better is because of the "legacy cardinality estimation" that it uses. SCSM does not perform as well with the newer method for some reason. This can also be set with a trace flag on the server, or I believe it can be chosen directly from the DB options in 2016/2017.

Rob_Simon · February 2019

Tom_Hendricks said:

You cannot configure the grooming job from any known interface (including PS or the C# SDK that I know of) but the SQL queries used and the job schedules are contained in a system management pack that can be overridden (like a SCOM MP). We run the queries with a SQL agent job that can take all the time it wants to finish, and that has made the difference. Running low (items per batch) and slow has been the key.

That's exactly what our DBAs are planning!

SCSM DB Meltdown

Answers

CIRESON COMMUNITY WEB SITE