Anyone experiencing "missing" runbooks in SCSM?
Hi.
I thought I'd ask the community, we do work in a unique environment, and we do have additional security controls in place which can cause some issues.
But our current issue is that our runbooks are unstable, in that they constantly go missing from SCSM. We have a monitor to alert us when this occurs, and we resync the runbooks straight away. However there's clearly an underlying issue, our thoughts are it might be related to memory, shells, or something along those lines. It wasn't always a problem, but now every day runbooks go missing which is a problem for us as every RO submitted with the missing RB fails leading to dozens of rejected SRs. If anyone's experienced something similar we welcome your input.
Comments
Are they changing their status to Missing? or are they disappearing entirely?
The fix for this is to delete the RunBook record from SCSM and re-run the connector.
As the GUID for the RunBook has not changed any RunBook Automation Activities that are associated to it should just pick up where it left off once the sync has happened.
Hi Brett, by missing I mean at random times during the day a number of runbooks within SCSM change to the status 'missing' from Active. They occasionally correct themselves when SCSM syncs, but often stay Missing until we notice and resync the connector manually. To reduce the time between identifying the error and addressing it, we have a Powershell script that monitors the status of runbooks every 5 minutes, from the time SCSM recognises it's missing (when SCSM tries to sync), then forces a resync.
Latest thoughts are it's something to do with connectivity issues between SCSM and the SCORCH web service.
I think it's 7 for SCSM. And v8 UR for SCO designer, and v7 for SCO runbook server.
The thing is they worked fine at first, this missing issue seemed to appear after a month or more but we can't think of anything we've changed config wise. We did go live so thought there was a correlation of increase usage of the RBs to the increase in errors.
It looks like we need a good look at through those. I'll update with what if anything we find.
This is where those ITIL practices would come in handy.
No, I'm not going to start preaching ITIL, but if another team are responsible for SCO and they are restarting IIS, rebooting servers, deleting or moving Runbooks, changing permission etc. without you knowing this could be the cause of this issue depending on when it is occurring.
There is a great blog here by Travis Wright, that explains what the statuses of the Runbooks mean in SCSM.
His statement for Missing is:
Missing – Missing is the status of a Runbook that cannot be found by the Service Manager – Orchestrator Connector since last synchronization happened. That could be cause by (1) The Runbook was removed from System Center Orchestrator or (2) permissions were changed and the current Run As Account for the Service Manager – Orchestrator Connector does not have access to that Runbook.
Good luck hunting down the cause and I hope this answers your question.
Even on the latest Orch UR we were still experiencing issues with our authorization cache that will cause runbooks to go missing in the web console and SCSM. Truncating the authorization cache table fixes the issue for a while, so we ended up scheduling a job that runs multiple times a day to do this.
As it turns out the invocation of the runbook coincided with the execution of the stored proc that rebuilds the AuthorizationCache table. Instead of every 10 minutes, I moved this job out to once every 24 hours (early morning hours). As such, SCSM no longer produces an error with respect to invoking an SCO runbook (Windows event log produces something to the effect that the runbook can't be found and actually, to be really clear about the error, it's the same one Pete Zerger mentions here - http://www.systemcentercentral.com/scsm-tip-troubleshooting-failed-service-requests-in-scsm-2012-r2/ with identical symptoms/results/etc). This however comes at the cost that if new users/runbooks are added to SCO they won't be able to work until the following day.
It was rumored this (AuthorizationCache table issues) was going to be addressed in SCO 2016 but unfortunately, it has not.
Hrm.
Update - doesn't explain the missing/invalid issue, the below offers some helpful advice.
Those are some good tips! The ampersand issue got us a few times.
If you don't want to assign a variable to everything or there's no reason to save the output you could also pipe the output to out-null.
If you run through and assign variables and redirect your output but still have issues you can is increase the amount of memory available to each powershell session on the server you are invoking to:
http://jeffwouters.nl/index.php/2014/03/out-of-memory-exception-in-powershell/
We had the same problem, but now have fixed this issue completely. In our case the issue was the SCSM and SCORCH servers are on Azure VMs. SCSM and SCORCH were not designed with cloud in mind at the time. Our RBs, would go missing even with no changes to the RBs. This usually stems from a change in the VM and SCSM connector does not recover.
But we _had_ this problem and now fixed this 100%. The solution was how we designed our RBs. We no longer have a connector from SCSM to SCORCH, but we do have a connection from SCORCH to SCSM from the SCORCH side. We created a RB called Master Ticket Handler. Every 5 minutes the Master Ticket Handler wakes up and runs a PowerShell activity that retrieves all tickets where a change has been made to the ticket since last execution. We then built in the intelligence to handle tickets accordingly. 95% of the time we ignore the tickets as it's being worked on by analysts or SCSM triggered a workflow update on the tickets and nothing has changed.
Here's the HUGE bonus from this design that we are looking into now and we hope to release sometime later in 2017. Here is an important piece of information, our analysts use the Cireson portal 100%; they don't want to learn or use the SCSM Console. So they cannot add tasks to tickets. With this new design, we will be adding tags to SCSM tickets via the private notes to allow our analysts to add manual, review, and automation tasks. So lets say an analyst need an approval for a ticket. In the notes they would put #review @ccarver @vsridhar and save it as a private note. (Yes we are modeling after Twitter.) The Master Ticket Handler will pick up the note in the next cycle, determine the analyst is asking for a ticket review, and automatically create a review activity for the ticket.
Instead of thinking of SCORCH as an extension for SCSM, we treat SCORCH as our artificial intelligence (a digital analyst) that has the ability to roam and look at the tickets at will. This really helped to expand our functionality.
We do have the ability to add, remove and edit Activities via the portal as one of our development goals at the moment so it will be coming out in the coming versions, but the idea of a #tag based AI (As you call it) is a really amazing way for analysts to interact with their tickets.
I can see many uses for such a thing.
Thanks for sharing your idea with the community.
If you could share your Runbooks too that would be amazing, as I feel there are many others out there who would like to replicate the idea.
Here is the basic flow.
Devils and Details:
In the Get Runbook Info, I need to extract the current runbook's GUID. The reason for this is, I use the GUID in Get Last Synchronized activity to perform a SQL query on at table that is used to store the timestamp of the last lookup. (Exactly the same as LastModified table for the CiresonCacheBuilder.) The reason for the GUID and not runbook name is, runbook names change from time to time and GUIDS will not.
In the Get Last Synchronized activity, I do a simple SQL query passing the runbook GUID and pulling back the timestamp. Then subsequently I update the timestamp with a new one with Update Last Synchronization. This removes issues when automation is offline or you need to increase frequency of execution in picking up only the new comments.
In Latest Private Messages, I am pulling from Trouble Ticket Analyst Comments. What's nice is this separates analysts from customer comments. We don't want customers invoking automation requests. I pass in the timestamp into the "Entered date" field with the "After" relationship; that way I am not only parsing new comments. I also make sure this is a private comment. Automation invocations should only be coming from private comments.
(Side comment) Between the Latest Private Messages and Map Automation to Runbook, this is where it would be a good place to extract who exactly the commenter of the private comment is if you want finer control on who can request an automation sequence. Last thing you want is the intern to kick off a costly request. No offense to interns out there. This is an academic activity of extracting who the commenter is and easy to implement if you want it.
Then in Regex Match, I find the # match from the comments.
Now in Map Automation to Runbook I use the Utilities -> Map Published Data activity. (This can easily be replaced by a SQL query to a table mapping, but for simplicity for now I use this until we go to production.) In this activity I map the matched # automation request to a runbook GUID to handle the request. Reason again, runbook names can change in our environment, so using a GUID is safer.
The next two steps (Get SR Relationship and Get Service Request) is very basic 101 automation of finding the relationship between the comment and the ticket itself.
The last activity is now invoking the runbook to go off and perform the automation action requested by the ticket and passing over the SR GUID and the comment too. Some automation passes arguments along with the automation invocation. Example: #CreateRA @ccarver will create a review activity where I am the reviewer. I put the burden of the details on the runbook that is servicing the request and not this generic handling. I keep with the UNIX mantra, do one thing really well.
And there it is. Pretty simple. This framework and easily be added to and expanded on meeting your needs.
If your runbooks go missing every few days there are a few issues Microsoft identified that was causing our runbooks to go missing randomly.
1) Apparently, workflows should not (cannot) have multiple criteria for the same field. I am not sure if this is true but the two workflows causing our issue did have that although we have workflows with multiple criteria that are not problems.
2) We have almost 300 runbooks so the Folders setting in IIS was not high enough to handle that.
Attached is a document that may help with troubleshooting the issue. Let me know if you have questions.