Further optimizing the ServiceManager database and workflows

Tom_HendricksTom_Hendricks Customer Super IT Monkey ✭✭✭✭✭

These questions might belong on the Microsoft forums, but I am interested to find out what all the bright minds here can come up with, first.

  • Has anyone successfully migrated ServiceManager tables (particularly the ECL, RECL, and DiscoverySource tables...) to SQL 2016/2017 in-memory tables, without breaking any functionality? Indexes and constraints can be rebuilt, but since clustered indexes are a no-go, I am curious if re-creating them as hash/noncluster/etc. breaks SCSM.
  • Generally speaking, of course, what is a good DOP (Degrees of Parallelism) Threshold for Service Manager? The default for SQL Server is 5, I've seen recommendations (not specific to SCSM) to bump it to between 25 and 50, but it seems to perform best around 10. This seems to be a function of the execution plan for queries wanting to execute with parallelism while maintaining enough "free" cores to process other, simpler queries at the same time, as I understand it. With the number of blocks and long waits I'm seeing on Insert operations, this seems like it would be a factor.
  • There are some horrendously poor performing stored procs in the ServiceManager DB, which no amount of CPU, RAM, or I/O investments (~3ms SSD here...) seems to help. e.g.: p_EntityChangeLogSnapshot which gets called on all the save/Commit calls to the SDK. Are there any supported ways to optimize these?
  • Are there any ways to limit the performance impact of using the SCOM Alert Connector with a high volume of alert IR's?
  • Does anyone have a good formula to use for picking the right batch size and delay with which to override OOB workflows that are struggling to keep up? (i.e.: more instances keep getting created before the previous one finishes due to too many items) I have been experimenting and getting mixed results.

Best Answer

Answers

  • Tom_HendricksTom_Hendricks Customer Super IT Monkey ✭✭✭✭✭

    Thanks, @Nicholas_Velich. I am going to mark this as answered since this will probably be right for most people, but there are still some other parts to this, too.

    I ended up in a different place, through trial and error, having set the threshold at 25 briefly (also trying it around 40 and 50 with slightly negative results) and the Max DOP at 4 (on an 8-core server). If anything, this seemed to make it a bit worse in my case. I currently have the threshold at 10, with the server Max DOP at 4 and the ServiceManager DB in particular overriding at Max DOP of 6. This has implications for queries against both the ServiceManager and ServiceManagement DBs since the DB against which the query runs dictates the Max DOP in those cases, but that actually seems to be working in our favor.

    There are several other databases on this particular SQL server, which typically do not consume much in the way of resources but occasionally spike high for just a minute or so, which is when I believe giving them fewer cores than ServiceManager gives us some benefit.

    The CPU utilization was very, very low prior to this, in spite of how hard the server was seeming to struggle. Although SQL Server was allocated the majority of 128GB, just leaving enough reserved for the OS, it wasn't utilizing very much of it in actuality according to the VM host. Disk IO looked very responsive too.

    Since moving to the current config, it has definitely sped up the DB reads. Writes are still a bit slower than I would like, but have also improved. One thing we did find however was that statistics were not getting updated properly by the SQL Agent job that was supposed to be handling it, and that has helped enormously (as one would expect!).

    Having said all of this, I do still wonder about the following, though:

    • For those of us who can spare the RAM and can keep the table size below the SQL-imposed size limit, are in-memory tables an option?
    • I still see workflows running a few minutes behind, although they have vastly improved. What is a good guideline (or guidelines) to follow for increasing batch size or adjusting the delay time? I like this classic blog post on the subject, but it is heavy on identifying the problem and light on solving it (without trial and error, that is).
Sign In or Register to comment.