SCSM DB Meltdown

Rob_SimonRob_Simon Customer IT Monkey ✭
Yesterday all was humming alone fine and then we received some reports of slowness for the Portal. This quickly turned out to be a complete work stoppage for all Portal users and even the Console.

Our DBA found something blocking and after killing the process many times, we finally got back to normal. Tried to find some sort if info via MS but no luck so since this Community is filled with very smart people :smile: who here can decode the following:

Application "SC DAL—SDKInstanceProcessing" was trying to execute the following transaction that cause the blocking:

INSERT INTO #CurrentTxnRelatedEntityChangeLog

    (

        [EntityChangeLogId],

        [EntityId],

        [EntityTypeId],

        [RelatedEntityId],

        [EntityTransactionLogId],

        [LastModified],

        [RECLRowForInstance]

    )

    SELECT

        ( SELECT max(EntityChangeLogId)

          FROM dbo.[EntityChangeLog]

          WHERE EntityId = TME.[BaseManagedEntityId]

          AND EntityTypeId = TME.[ManagedTypeId]

          AND [EntityTransactionLogId] <= @TransactionId

          AND [ChangeType] <= 2

        ),

        TME.[BaseManagedEntityId],

        TME.[ManagedTypeId],

        NULL,

        @TransactionId,

        @LastModified,

        1

    FROM dbo.[TypedManagedEntity] TME

    JOIN #CurrentTxnEntityChangeLog ECL

        ON ECL.[EntityId] = TME.[BaseManagedEntityId]

    JOIN #RelationshipTypeToLog RTL

        ON RTL.[RelationshipTypeId] = ECL.[EntityTypeId]

    WHERE TME.[IsDeleted] = 0

Answers

  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    Hi @Rob_Simon

    The entity change log is an interesting beast. It is the piece of SCSM that is responsible for the History of any work item. Every time a value is changed in a Work Item (Say the support group is changed from Service Desk to Network Support) the Entity Change log (ECL) records what the old value was, what the new value is, who changed it and when.

    If there is a block in this then I'd suggest that 2 services tried to change the same value at the same time.
    Now, normally this is not allowed and we end up with those annoying messages in the console that someone else has made changes to the record. This error message is there to prevent the issues you are seeing.

    However, if you have any automation or scripts that are running, they may be setting the same value multiple times in the one command and therefore causing the block.

    I've never seen a block in the ECL while using the product as intended, so I can only assume its an automation task that is causing it.

    Hope this helps.

    Brett
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    Thanks @Brett_Moffett this does lead us to a possible cause. Will close in on the culprit!
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    This is great! Really appreciate it.
  • Leigh_KildayLeigh_Kilday Member Ninja IT Monkey ✭✭✭✭
    We had exactly the same behavior after upgrading to 8.12.3. The primary node in the SQL Server AG was at 95%+ CPU due to transactions with the server running the cache builder. Restarting the services/website on this server provided temporary reprieve, however it soon came back.

    Our DBAs identified two issues.
    1. The insert statement on the ServiceManagement.dbo.Access_CI$User_WorkItem table requires a lot of processing, and this was the primary cause of our issue.
    2. The grid view query was causing excessive CPU load so we made some tweaks to improve performance.

    During resolution of the incident, I rolled forward to portal 8.12.6 and it was resolved, however this has its own bugs that we have raised with Cireson.

    We are in a unique state where we cannot archive our old WIs from the OLTP DB due to the inadequacies of the SCSM DW to maintain the record in its entirety (legislative requirement), its inability to be easily queried, and its vulnerability to corruption (Microsoft has often told us to rebuild the DW when we've experienced issues). This means the ServiceManagement.dbo.Access_CI$User_WorkItem table is in the high hundreds of millions of rows, resulting in the ScopedAccess process in the cache builder to take 1.5 hours on a full refresh.

    Until we have developed a custom ETL system and DW, we cannot enable grooming, further perpetuating the issue.
  • Gerhard_GoossensGerhard_Goossens Customer Adept IT Monkey ✭✭
    @Adam_Dzyacky

    Sorry to jump in here but the past month we have been experiencing the exact same issue.
    Running the queries mentioned in the post I got the following.

    Loudest Objects 


    Largest Table


    It seems like the ECL is the largest table so it should be a concern. Do you have any suggestions on what can be done?

    I'm not sure how to interpret the loudest objects. maybe it just shows the Analysts that work the hardest?
  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    This killed us as well for quite some time. Even opened MS cases to resolve. In the end it ended up being a simple check box.

    The entity change log can be large but that is not the impact of the locked query/task. We found when the issue would happen if we did a exec sp_updatestats it would help clear the lock faster but didn't really explain the issue.
    Trimming down the History value in settings would help trim down the entity change log size but the locking would still occur.
    This locking would always occur at the same time of day. But if you go thru all your schedules, timer jobs, workflows, etc nothing would match the time frame.
    So while Adam provide the overall troubleshooting he is missing one item to review. AD Connector
    The AD connector everyone setups to run at a specific time of day to replicate AD into SCSM. What is not clearly documented is the meanings of all the options the connector has. The one being the checkbox causing all our locking issues for months.
    "Automatically add user of AD Groups imported by this connector"


    If I check off this option it places a heavy processing burden on SCSM. When it does it tells SCSM to query each group it imported from AD for all members and make sure those members are imported as SCSM users.
    So when we first setup SCSM we checked it off thinking yes we want to make sure all group memberships are in SCSM.
    So in a large environment were we have thousands of groups the SDK gets swamped querying each AD group for members and then checking for the members in SCSM. Along with in our environment we have groups larger then the 5k AD allows in a query response. This includes checking Domain Users)
    Here is the kicker. If you have the AD connector set to ALL there is 0 need to have this check box enabled as the connector would have already ingested all the AD information.
    The checkbox is intended for use cases where you select a subset of your AD structure (example just an OU containing just the SCSM groups) and want the system to make sure all the AD users are imported.

    Now the icing on the cake. You have the AD connector setup to run at 2am off hours as to not impact your community. What does all this have to do with the locking in the middle of the day.
    This workflow has its own set run time and this time is not configurable. It runs every 24 hours from the date/time of the AD connector create day.
    So we created our AD connector during business hours when we built the farm at 2:50pm (of course our high point time frame) so every day during peak time the SQL locks were occurring and w could not match it to the workflows.

    Just unchecking this option made all the difference.
    HTH

  • Rob_SimonRob_Simon Customer IT Monkey ✭
    @Brian_Wiest Wow, just when I thought I'd heard of every weird thing MS designed in SCSM, this one sure take the cake.
  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    Was it matching your issue?
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    We have this option currently turned off for the 4 AD connectors we currently have (different users/groups within the tree structure)
  • Gerhard_GoossensGerhard_Goossens Customer Adept IT Monkey ✭✭
    All mine is also unchecked for "Automatically Add Users..." and ticked for "Do not write..."
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    edited February 13
    Bah! I always forget that AD Group Expansion one and I forget because the checkbox Brian cites didn't make an appearance until SCSM 2016.

    In SCSM 2012 the box wasn't there and as such the only thing you could do is export the "Service Manager Linking Framework Configuration" unsealed management pack. Then once you open it you can do a CTRL+F for "ADGroupExpansionWorkflow" which leads to you something that looks like the following:



    You'd have an Expansion rule for each AD Connector you have and then you could control not only the time it runs but also its Enabled property.

    Regarding @Gerhard_Goossens's question on "How to interpret all the users, are they working the hardest?" Well to be honest, you aren't that far off. The question here is really what are those users doing that's causing a write to the ECL that causes them to be there that often. For example - do you have users constantly re-assigning Work Items? Do you have a template that X default users are set on and that template is being called frequently? I know when troubleshooting high counts on user objects I use the following PowerShell:

    $fullname = ""
    $mgmtServer = ""
    
    $userClass = get-scsmclass "system.domain.user$" -computername $mgmtServer
    $user = get-scsmobject -class $userclass -Filter "DisplayName -eq '$fullname'" -ComputerName $mgmtServer
    Get-SCSMObjectHistory -Object $user -ComputerName $mgmtServer
    

    So in this case, you're looking up the History of a single user in SCSM (e.g. go after the "loudest" one per your above results). The results will absolutely demand their own specific investigation that is going to be unique to your environment. But it's a way to possibly catch if there is some Workflow that is performing some change frequently on users.
  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    edited February 13
    Its most likely matching @Gerhard_Goossens issue seeing the screenshots

    @Rob_Simon attached are some SQL queries to help review.
    WorkflowstatusSQL.txt
    This will provide you a run down of your workflows status from the SQL DB.
    You want to see Minutes behind at 1/2 as all default workflows process every one minute. EventCount is how many need processing.
    This is a per environment judgement call but helps to see if there is a workflow locking up the system doing an unexpected task or taking to long to process the changes.
    (Note AND S.IsPeriodicQueryEvent = 0)
    I turn it off to always view as it shows my periodic subscriptions. You want as few as possible of these, as they take a larger hit on the system.

    ActiveTransactions.txt
    This shows active transaction on the database that may not be something triggered by workflow (example scheduled powershell task)
    Again this is a per environment judgement call

    DBDisk.txt
    Provides a table of how SQL sees the disk performance. The lower the better.
    Note these stats clear anytime the instance is restarted or failed over.

  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    Weird system is blocking all attachments for files

    Now its now allowing be to paste into post. Website must really not like the queries
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    @Brian_Wiest that's why I couldn't paste the Large Table Query directly in here yesterday. Vanilla doesn't like some kind of aspect of SQL queries.
  • Gerhard_GoossensGerhard_Goossens Customer Adept IT Monkey ✭✭
    @Brian_Wiest @Adam_Dzyacky

    Can you please send the queries via mail if you don't mind.
    [email protected]
  • Jonathan_BolesJonathan_Boles Customer Ninja IT Monkey ✭✭✭✭
    @Brian_Wiest and @Adam_Dzyacky

    Same here, would you mind email the queries? We've been having a devil of a time with some issues surrounding the ECL.

    [email protected]
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    edited February 18
    Queries work again @Brian_Wiest
    I ended up having to create separate responses for each query, otherwise it threw the error again. Code block formatting in the community seems to have some quirky issues. Never the less, it should run just fine.


    From Kevin Holman's "Useful SCOM Queries" - Largest Tables. Modified to Sort By Row Count
    --This is the SAME query used against SCOM and SCOM DW, but it also works against SCSM. This lists the top 1000-- largest tables in the ServiceManager DB sorted by row count.
    SELECT TOP 1000 a2.name AS [tablename], (a1.reserved + ISNULL(a4.reserved,0))* 8 AS 'reserved (KB)', a1.rows as row_count, a1.data * 8 AS 'data (KB)', (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data THEN (a1.used + ISNULL(a4.used,0)) - a1.data ELSE 0 END) * 8 AS 'index size (KB)', (CASE WHEN (a1.reserved + ISNULL(a4.reserved,0)) > a1.used THEN (a1.reserved + ISNULL(a4.reserved,0)) - a1.used ELSE 0 END) * 8 AS 'unused (KB)', (row_number() over(order by (a1.reserved + ISNULL(a4.reserved,0)) desc))%2 as l1, a3.name AS [schemaname]FROM (SELECT ps.object_id, SUM (CASE WHEN (ps.index_id < 2) THEN row_count ELSE 0 END) AS [rows], SUM (ps.reserved_page_count) AS reserved, SUM (CASE WHEN (ps.index_id < 2) THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count) ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count) END ) AS data, SUM (ps.used_page_count) AS used FROM sys.dm_db_partition_stats ps GROUP BY ps.object_id) AS a1 LEFT OUTER JOIN (SELECT it.parent_id, SUM(ps.reserved_page_count) AS reserved, SUM(ps.used_page_count) AS used FROM sys.dm_db_partition_stats ps INNER JOIN sys.internal_tables it ON (it.object_id = ps.object_id) WHERE it.internal_type IN (202,204) GROUP BY it.parent_id) AS a4 ON (a4.parent_id = a1.object_id) INNER JOIN sys.all_objects a2 ON ( a1.object_id = a2.object_id ) INNER JOIN sys.schemas a3 ON (a2.schema_id = a3.schema_id)WHERE a2.type <> N'S' and a2.type <> N'IT'
    --order by a1.data descorder by row_count desc
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    Failed Grooming Jobs Shows when grooming jobs in core SCSM are potentially having issues grooming objects out. It's important to note here that this has nothing to do with the DW, this is core SCSM just trying to accomplish its grooming/purge jobs at 12am and 2am.
      -- Run against the ServiceManager database. It will return
     -- a list of all internal SCSM grooming jobs that have failed. While jobs may fail from time to time,
     -- as long as they show a successful execution after a failure there is little cause for concern.
    
    SELECT
       InternalJobHistoryId,
       Command,
       TimeStarted,
       timefinished,
       statuscode
    FROM InternalJobHistory WITH(NOLOCK)
    WHERE
       TimeFinished IS NULL AND
       StatusCode <> 1
    order by timestarted desc
    
    --alternatively, you could run the same query without the WHERE clause to return all jobs
    
    /* SELECT
       InternalJobHistoryId,
       Command,
       TimeStarted,
       timefinished,
       statuscode
    FROM InternalJobHistory WITH(NOLOCK)
    order by timestarted desc
    */

  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    DCM Volume - If you're using SCCM and DCM, it's possible you could be generating a lot of noise. This query helps show those items off
    -- Run against the ServiceManager database. It will return the volume of DCM data within SCSM.
    -- But if you're not using SCCM and DCM then you should expect almost no results for this query
    
    SELECT Count (1) AS DeviceHasWSUSUpdates FROM [LFXSTG].[v_Cached_CMv5_DeviceHasWSUSUpdates]
    SELECT Count (1) As DCMNonCompliantConfigItems FROM [LFXSTG].[v_Cached_CMv5_DCMNonCompliantConfigItems]
    SELECT Count (1) AS DCMConfigItemRelations FROM [LFXSTG].[v_Cached_CMv5_DCMConfigItemRelations]
    SELECT Count (1) AS DCMConfigItems FROM [LFXSTG].[v_Cached_CMv5_DCMConfigItems]
    SELECT Count (1) AS AdminCategories FROM [LFXSTG].[v_Cached_CMv5_AdminCategories]
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    Loudest Objects in SCSMThis is great because you can possibly identify what is making the most noise in SCSM. To analyze this data keep in mind that large numbers aren't necessarily bad, so what you should be focused on is if you have a huge gap between objects. For example, if the first is 1000 but the second is 30...Ultimately you shouldn't expect to see Work Items/Users at the top, if you do it raises the question of what is hammering changes against that specific Work Item/Users.
    SELECT TOP 25 BME.FullName, COUNT(1)
    FROM EntityChangeLog AS ECL WITH(NOLOCK)
    JOIN BaseManagedEntity AS BME WITH(NOLOCK)
        ON ECL.EntityId = BME.BaseManagedEntityId
    WHERE RelatedEntityId IS NULL
    GROUP BY BME.FullName
    ORDER BY COUNT(1) DESC
  • Brian_WiestBrian_Wiest Customer Super IT Monkey ✭✭✭✭✭
    edited February 20
    Attachments were fixed by the vendor. Attached are the queries I was referring too.
    (well was told fixed. still not working.)
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    Well we had a talk with MS and will be performing some DB performance improvements but we found the root cause as to why things were bombing. 

    MS noticed there was a large amount of work items which were going to be groomed all at once in the near future. This was likely caused by implementing the Cireson Advanced Request Offering tool, which moved the last modified date on all work items to the day which we imported the Management Pack. Without being addressed, this would likely take the SCSM application down as it would be attempting to groom a much larger amount of data than usual. After discussing multiple options to address this, MS agreed that the best option would be a controlled one-time grooming of work items to move past those large groups of work items. 
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    And as for performance, MS recommended the following:

    1. Change compatibility level to 110. We have seen this improve performance of SCSM running on SQL 2014.

    2. Change DB owner to service account instead of SQL SA account. Currently, the local SA SQL account owns the SCSM DB. Since you have the database engine in Windows Only authentication mode, this can cause problems with certain Authorization Manager operations.

    3. Max degree of parallelism needs to be to 4 (half procs up to 8).

    4. Cost threshold for par needs to be set to 25 (A number that I have arrived at over the years to prevent parallelism in fragmented query scenarios and such).

    5. Highly recommend setting memory limitations on both SSAS and the database engine (these services will compete for resources).


  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    Rob_Simon said:
    And as for performance, MS recommended the following:

    Max degree of parallelism needs to be to 4 (half procs up to 8).

    Microsoft Support, SCSM folks, SCOM people, SQL DBAs - everyone seems to have a different opinion and I'm not sure I've ever been convinced there is a consensus. In fact, the only thing I'd say seems to be certain amongst parties is a "try it and see what happens" mentality. I'm not even being critical of that attitude but I don't know if I'm convinced there is a definitive answer on this.
  • Tom_HendricksTom_Hendricks Customer Super IT Monkey ✭✭✭✭✭
    Rob_Simon said:
    ...

    MS noticed there was a large amount of work items which were going to be groomed all at once in the near future. This was likely caused by implementing the Cireson Advanced Request Offering tool, which moved the last modified date on all work items to the day which we imported the Management Pack. Without being addressed, this would likely take the SCSM application down as it would be attempting to groom a much larger amount of data than usual. After discussing multiple options to address this, MS agreed that the best option would be a controlled one-time grooming of work items to move past those large groups of work items. 
    We have had a very similar experience, but it required more than just a one-time intervention.  Yes, this will help you, but you could end up having the same problem next time you upgrade that management pack (or others that update the last modified on so many objects).  Microsoft had provided me with a more long-term fix, which has made it into the docs site, albeit extremely briefly and without much helpful detail provided.

    In our case, we also had a problem where the grooming job would consume all available resources and prevent even itself from running, which halted the whole system for our offices on the other side of the world (during their afternoon) and also compounded the problem because the job could never clear the backlog of items in need of grooming.  Because of the 24x7x365 requirement for operation, we had to have the job run in smaller batches over longer periods of time so that the system could continue to run.  What this also did was even things out to where the items could finally get groomed properly and there was a performance lift.

    You cannot configure the grooming job from any known interface (including PS or the C# SDK that I know of) but the SQL queries used and the job schedules are contained in a system management pack that can be overridden (like a SCOM MP).  We run the queries with a SQL agent job that can take all the time it wants to finish, and that has made the difference.  Running low (items per batch) and slow has been the key.

    As for the compatibility level, I am told that the reason why 110 works better is because of the "legacy cardinality estimation" that it uses.  SCSM does not perform as well with the newer method for some reason.  This can also be set with a trace flag on the server, or I believe it can be chosen directly from the DB options in 2016/2017.
  • Rob_SimonRob_Simon Customer IT Monkey ✭
    Tom_Hendricks said:
    You cannot configure the grooming job from any known interface (including PS or the C# SDK that I know of) but the SQL queries used and the job schedules are contained in a system management pack that can be overridden (like a SCOM MP).  We run the queries with a SQL agent job that can take all the time it wants to finish, and that has made the difference.  Running low (items per batch) and slow has been the key.

    That's exactly what our DBAs are planning!
Sign In or Register to comment.