SCSM DB Meltdown
Our DBA found something blocking and after killing the process many times, we finally got back to normal. Tried to find some sort if info via MS but no luck so since this Community is filled with very smart people who here can decode the following:
Application "SC DAL—SDKInstanceProcessing" was trying to execute the following transaction that cause the blocking:
INSERT INTO #CurrentTxnRelatedEntityChangeLog
(
[EntityChangeLogId],
[EntityId],
[EntityTypeId],
[RelatedEntityId],
[EntityTransactionLogId],
[LastModified],
[RECLRowForInstance]
)
SELECT
( SELECT max(EntityChangeLogId)
FROM dbo.[EntityChangeLog]
WHERE EntityId = TME.[BaseManagedEntityId]
AND EntityTypeId = TME.[ManagedTypeId]
AND [EntityTransactionLogId] <= @TransactionId
AND [ChangeType] <= 2
),
TME.[BaseManagedEntityId],
TME.[ManagedTypeId],
NULL,
1
FROM dbo.[TypedManagedEntity] TME
JOIN #CurrentTxnEntityChangeLog ECL
ON ECL.[EntityId] = TME.[BaseManagedEntityId]
JOIN #RelationshipTypeToLog RTL
ON RTL.[RelationshipTypeId] = ECL.[EntityTypeId]
WHERE TME.[IsDeleted] = 0
Answers
The entity change log is an interesting beast. It is the piece of SCSM that is responsible for the History of any work item. Every time a value is changed in a Work Item (Say the support group is changed from Service Desk to Network Support) the Entity Change log (ECL) records what the old value was, what the new value is, who changed it and when.
If there is a block in this then I'd suggest that 2 services tried to change the same value at the same time.
Now, normally this is not allowed and we end up with those annoying messages in the console that someone else has made changes to the record. This error message is there to prevent the issues you are seeing.
However, if you have any automation or scripts that are running, they may be setting the same value multiple times in the one command and therefore causing the block.
I've never seen a block in the ECL while using the product as intended, so I can only assume its an automation task that is causing it.
Hope this helps.
Brett
EntityChangeLog table has got to be one of the most complex topics within SCSM because so much depends on it. Brett is spot on in that it's responsible for all History of Work Items. Every change is a row in this table which means depending on how much work is happening in the system you can see a lot of activity on this table.
Not to mention, it's not just the day to day use of SCSM that will contribute to growth in ECL but also your Connectors (AD, SCOM, SCCM, SCO, SCVMM, Exchange, and Cireson Asset Management). What's more, given how much relies on this table a poorly maintained ECL table will lead to performance woes across different (Microsoft and Cireson) aspects of SCSM. But if the idea of "maintaining" a single table within a SQL database for an application makes you say "That's ridiculous" then fret not as the good news is there is a host of things you can do. Here's some tips for ensuring an autonomous and well maintained ECL table.
1. Connector scheduling: Make sure connectors are spaced out and run during the quietest times of SCSM operations. This ensures the volume of data getting loaded isn't competing with Analysts for resources.
2. Connector optimization: One setting that can be overlooked is that ALL of your connectors if possible should have "Do not write null values for properties that are not set in..." checked off. Why all of them? Because if you're using the AD connector and SCOM CI connector - it's possible one is writing values while the other is nulling them out. This leads to noise in the DB and unnecessary growth of the ECL table. Regardless of the connectors you're using, make sure all these have the value checked.
3. Cireson Asset Connectors - These pull a lot of information out of SCCM and as such have the potential to hit their configured max run time. Strongly advise to run these after hours.
4. History retention: Administration -> Settings -> Data Retention Settings. This is a balancing act between what departments using SCSM reasonably expect to see when searching items in the console and what you can afford to purge and then ultimately search in the DW. That said, you want numbers to be as low is as acceptable for your deployment. The "History" section generally has a fairly direct correlation with ECL table size.
5. SCOM Alerts: If you've integrated SCOM Alerts to fire SCSM Incidents, some Alerts feature a Repeat count metric. Depending on how you've configured the Alert in SCOM, it's possible SCOM could be writing a lot of nonstop updates into SCSM to a single Work Item.
6. Custom Workflows: This is Brett's original point, it sounds like automation or some workflow maybe causing this. It's worth checking out any custom workflows in the admin pane, their last run times, and their status
Next, there are a few SQL queries you can use against the ServiceManager database for troubleshooting.
Loudest Objects in SCSM - This is great because you can possibly identify what is making the most noise in SCSM. To analyze this data keep in mind that large numbers aren't necessarily bad, so what you should be focused on is if you have a huge gap between objects. For example, if the first is 1000 but the second is 30...
These and a few other classic queries I also posted over here on building a SCSM Dashboard in the Portal so you don't have to run queries all the time to check in on this and instead just have a quick Web UI to check things out in.
Our DBAs identified two issues.
1. The insert statement on the ServiceManagement.dbo.Access_CI$User_WorkItem table requires a lot of processing, and this was the primary cause of our issue.
2. The grid view query was causing excessive CPU load so we made some tweaks to improve performance.
During resolution of the incident, I rolled forward to portal 8.12.6 and it was resolved, however this has its own bugs that we have raised with Cireson.
We are in a unique state where we cannot archive our old WIs from the OLTP DB due to the inadequacies of the SCSM DW to maintain the record in its entirety (legislative requirement), its inability to be easily queried, and its vulnerability to corruption (Microsoft has often told us to rebuild the DW when we've experienced issues). This means the ServiceManagement.dbo.Access_CI$User_WorkItem table is in the high hundreds of millions of rows, resulting in the ScopedAccess process in the cache builder to take 1.5 hours on a full refresh.
Until we have developed a custom ETL system and DW, we cannot enable grooming, further perpetuating the issue.
Sorry to jump in here but the past month we have been experiencing the exact same issue.
Running the queries mentioned in the post I got the following.
Loudest Objects
Largest Table
It seems like the ECL is the largest table so it should be a concern. Do you have any suggestions on what can be done?
I'm not sure how to interpret the loudest objects. maybe it just shows the Analysts that work the hardest?
In SCSM 2012 the box wasn't there and as such the only thing you could do is export the "Service Manager Linking Framework Configuration" unsealed management pack. Then once you open it you can do a CTRL+F for "ADGroupExpansionWorkflow" which leads to you something that looks like the following:
You'd have an Expansion rule for each AD Connector you have and then you could control not only the time it runs but also its Enabled property.
Regarding @Gerhard_Goossens's question on "How to interpret all the users, are they working the hardest?" Well to be honest, you aren't that far off. The question here is really what are those users doing that's causing a write to the ECL that causes them to be there that often. For example - do you have users constantly re-assigning Work Items? Do you have a template that X default users are set on and that template is being called frequently? I know when troubleshooting high counts on user objects I use the following PowerShell:
Can you please send the queries via mail if you don't mind.
Gerhard.Goossens@nwu.ac.za
Same here, would you mind email the queries? We've been having a devil of a time with some issues surrounding the ECL.
jonathan.boles@stantec.com
I ended up having to create separate responses for each query, otherwise it threw the error again. Code block formatting in the community seems to have some quirky issues. Never the less, it should run just fine.
From Kevin Holman's "Useful SCOM Queries" - Largest Tables. Modified to Sort By Row Count
MS noticed there was a large amount of work items which were going to be groomed all at once in the near future. This was likely caused by implementing the Cireson Advanced Request Offering tool, which moved the last modified date on all work items to the day which we imported the Management Pack. Without being addressed, this would likely take the SCSM application down as it would be attempting to groom a much larger amount of data than usual. After discussing multiple options to address this, MS agreed that the best option would be a controlled one-time grooming of work items to move past those large groups of work items.
1. Change compatibility level to 110. We have seen this improve performance of SCSM running on SQL 2014.
2. Change DB owner to service account instead of SQL SA account. Currently, the local SA SQL account owns the SCSM DB. Since you have the database engine in Windows Only authentication mode, this can cause problems with certain Authorization Manager operations.
3. Max degree of parallelism needs to be to 4 (half procs up to 8).
4. Cost threshold for par needs to be set to 25 (A number that I have arrived at over the years to prevent parallelism in fragmented query scenarios and such).
5. Highly recommend setting memory limitations on both SSAS and the database engine (these services will compete for resources).
In our case, we also had a problem where the grooming job would consume all available resources and prevent even itself from running, which halted the whole system for our offices on the other side of the world (during their afternoon) and also compounded the problem because the job could never clear the backlog of items in need of grooming. Because of the 24x7x365 requirement for operation, we had to have the job run in smaller batches over longer periods of time so that the system could continue to run. What this also did was even things out to where the items could finally get groomed properly and there was a performance lift.
You cannot configure the grooming job from any known interface (including PS or the C# SDK that I know of) but the SQL queries used and the job schedules are contained in a system management pack that can be overridden (like a SCOM MP). We run the queries with a SQL agent job that can take all the time it wants to finish, and that has made the difference. Running low (items per batch) and slow has been the key.
As for the compatibility level, I am told that the reason why 110 works better is because of the "legacy cardinality estimation" that it uses. SCSM does not perform as well with the newer method for some reason. This can also be set with a trace flag on the server, or I believe it can be chosen directly from the DB options in 2016/2017.
That's exactly what our DBAs are planning!