Anyone experiencing "missing" runbooks in SCSM?

Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭

Hi.

I thought I'd ask the community, we do work in a unique environment, and we do have additional security controls in place which can cause some issues.

But our current issue is that our runbooks are unstable, in that they constantly go missing from SCSM. We have a monitor to alert us when this occurs, and we resync the runbooks straight away. However there's clearly an underlying issue, our thoughts are it might be related to memory, shells, or something along those lines. It wasn't always a problem, but now every day runbooks go missing which is a problem for us as every RO submitted with the missing RB fails leading to dozens of rejected SRs.  If anyone's experienced something similar we welcome your input.

Comments

  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    Define "Missing" RunBooks?
    Are they changing their status to Missing? or are they disappearing entirely?
  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    A common issue I see is when people change the input properties of a RunBook in the "Initialize RunBook" action. This will cause the RunBook record within SCSM to change it's status to 'Invalid'.
    The fix for this is to delete the RunBook record from SCSM and re-run the connector.

    As the GUID for the RunBook has not changed any RunBook Automation Activities that are associated to it should just pick up where it left off once the sync has happened.
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭

    Hi Brett, by missing I mean at random times during the day a number of runbooks within SCSM change to the status 'missing' from Active. They occasionally correct themselves when SCSM syncs, but often stay Missing until we notice and resync the connector manually. To reduce the time between identifying the error and addressing it, we have a Powershell script that monitors the status of runbooks every 5 minutes, from the time SCSM recognises it's missing (when SCSM tries to sync), then forces a resync. 

    Latest thoughts are it's something to do with connectivity issues between SCSM and the SCORCH web service.

  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    What UR are you on between SCSM and SCO?
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭

    I think it's 7 for SCSM. And v8 UR for SCO designer, and v7 for SCO runbook server.


    The thing is they worked fine at first, this missing issue seemed to appear after a month or more but we can't think of anything we've changed config wise. We did go live so thought there was a correlation of increase usage of the RBs to the increase in errors.

  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    Any errors in the Event log?
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭
      Hi Brett, SCSM no, other than it Sync's the runbook connector - that's all we can see. For SCO it's a little more frustrating as another team manage it so we are demanding logs. >:)
    It looks like we need a good look at through those. I'll update with what if anything we find.
  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    ITIL best practices call for event tracking so things like server, network, and DB outages are all tracked.
    This is where those ITIL practices would come in handy.

    No, I'm not going to start preaching ITIL, but if another team are responsible for SCO and they are restarting IIS, rebooting servers, deleting or moving Runbooks, changing permission etc. without you knowing this could be the cause of this issue depending on when it is occurring.

    There is a great blog here by Travis Wright, that explains what the statuses of the Runbooks mean in SCSM.

    His statement for Missing is:
    Missing – Missing is the status of a Runbook that cannot be found by the Service Manager – Orchestrator Connector since last synchronization happened. That could be cause by (1) The Runbook was removed from System Center Orchestrator or (2) permissions were changed and the current Run As Account for the Service Manager – Orchestrator Connector does not have access to that Runbook.

    Good luck hunting down the cause and I hope this answers your question.
  • Cory_BoweCory_Bowe Customer Adept IT Monkey ✭✭
    Mark_Wahlert - Please look at this link: https://blogs.technet.microsoft.com/thomase/2012/05/29/orchestrator-runbooks-not-appearing-on-the-web-console/

    Even on the latest Orch UR we were still experiencing issues with our authorization cache that will cause runbooks to go missing in the web console and SCSM.  Truncating the authorization cache table fixes the issue for a while, so we ended up scheduling a job that runs multiple times a day to do this.
  • Cory_BoweCory_Bowe Customer Adept IT Monkey ✭✭
    I forgot to mention, if you clear the auth cache, and then run the SCSM orch connector again they will return to normal without any other intervention.  You shouldn't need to delete them before syncing.
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    edited October 2016
    The AuthorizationCache table is frequently mentioned on several independent blogs as the source of essentially all issues with respect to SCO. Incredibly recently I was troubleshooting an issue wherein runbooks occasionally and I mean, really, really, really, occasionally don't invoke when SCSM calls them.

    As it turns out the invocation of the runbook coincided with the execution of the stored proc that rebuilds the AuthorizationCache table. Instead of every 10 minutes, I moved this job out to once every 24 hours (early morning hours). As such, SCSM no longer produces an error with respect to invoking an SCO runbook (Windows event log produces something to the effect that the runbook can't be found and actually, to be really clear about the error, it's the same one Pete Zerger mentions here - http://www.systemcentercentral.com/scsm-tip-troubleshooting-failed-service-requests-in-scsm-2012-r2/ with identical symptoms/results/etc). This however comes at the cost that if new users/runbooks are added to SCO they won't be able to work until the following day.

    It was rumored this (AuthorizationCache table issues) was going to be addressed in SCO 2016 but unfortunately, it has not.
  • Cory_BoweCory_Bowe Customer Adept IT Monkey ✭✭
    Adam_Dzyacky - We had it running every 24 hours also, but what we noticed is sometimes the authorization tables will still calculate incorrectly and then we would have failures until run manually.  Once an hour seems to be the sweet spot for us, as any incorrectly calculated authorization values are generally overwritten before anyone notices but not often enough to leave much of a gap during the re-calc.
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    Interesting @Cory_Bowe ...and also slightly aggravating to hear only in that this means this value easily varies from deployment to deployment. But never the less insightful on the topic!

    Hrm.
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭
    @Cory_Bowe @Adam_Dzyacky Thanks guys. So I'm told we've tried the above, we've had to schedule some tasks and monitor the SCSM connections for missing and invalid runbook status. Unfortunately it still occurs and we end up having dozens of broken SRs submitted that have to be manually resolved, either rerunning the runbook, skipping and adding the Reviewers manually (for AD manager look up runbook) or sometimes use the console to turn off automation to force the activity through. It still leaves us with a flaky automation solution.
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭
    I forgot to add, we have MS Premiere on the case, but 3 weeks in and we've been given some other ideas but no resolution yet.  I'll update when we do.
  • Mark_WahlertMark_Wahlert Customer Advanced IT Monkey ✭✭✭

    Update - doesn't explain the missing/invalid issue, the below offers some helpful advice.

    1. Don't use ampersand '&' character in anything you pass to SCORCH. Eg prompts passed to SCORCH. This will leave the runbook in pending as it can't interpret the xml.
    2. Large amounts of output with no variable in a remote session will run out of memory. This is fixed by making commands  more efficient with every output assigned a variable within the remote session. This prevents failing runbooks.
  • Cory_BoweCory_Bowe Customer Adept IT Monkey ✭✭
    @Mark_Wahlert

    Those are some good tips!  The ampersand issue got us a few times.

    If you don't want to assign a variable to everything or there's no reason to save the output you could also pipe the output to out-null.

    If you run through and assign variables and redirect your output but still have issues you can is increase the amount of memory available to each powershell session on the server you are invoking to:
    http://jeffwouters.nl/index.php/2014/03/out-of-memory-exception-in-powershell/
  • Christopher_CarverChristopher_Carver Customer Adept IT Monkey ✭✭
    Trolling through the posts and raising this from the depths. 
    We had the same problem, but now have fixed this issue completely. In our case the issue was the SCSM and SCORCH servers are on Azure VMs. SCSM and SCORCH were not designed with cloud in mind at the time. Our RBs, would go missing even with no changes to the RBs. This usually stems from a change in the VM and SCSM connector does not recover. 

    But we _had_ this problem and now fixed this 100%. The solution was how we designed our RBs. We no longer have a connector from SCSM to SCORCH, but we do have a connection from SCORCH to SCSM from the SCORCH side. We created a RB called Master Ticket Handler. Every 5 minutes the Master Ticket Handler wakes up and runs a PowerShell activity that retrieves all tickets where a change has been made to the ticket since last execution. We then built in the intelligence to handle tickets accordingly. 95% of the time we ignore the tickets as it's being worked on by analysts or SCSM triggered a workflow update on the tickets and nothing has changed.

    Here's the HUGE bonus from this design that we are looking into now and we hope to release sometime later in 2017. Here is an important piece of information, our analysts use the Cireson portal 100%; they don't want to learn or use the SCSM Console. So they cannot add tasks to tickets. With this new design, we will be adding tags to SCSM tickets via the private notes to allow our analysts to add manual, review, and automation tasks. So lets say an analyst need an approval for a ticket. In the notes they would put #review @ccarver @vsridhar and save it as a private note. (Yes we are modeling after Twitter.) The Master Ticket Handler will pick up the note in the next cycle, determine the analyst is asking for a ticket review, and automatically create a review activity for the ticket.

    Instead of thinking of SCORCH as an extension for SCSM, we treat SCORCH as our artificial intelligence (a digital analyst) that has the ability to roam and look at the tickets at will. This really helped to expand our functionality. 
  • Brett_MoffettBrett_Moffett Cireson PACE Super IT Monkey ✭✭✭✭✭
    WOW! Nice idea @Christopher_Carver.
    We do have the ability to add, remove and edit Activities via the portal as one of our development goals at the moment so it will be coming out in the coming versions, but the idea of a #tag based AI (As you call it) is a really amazing way for analysts to interact with their tickets.

    I can see many uses for such a thing.

    Thanks for sharing your idea with the community.
    If you could share your Runbooks too that would be amazing, as I feel there are many others out there who would like to replicate the idea.
  • Trevor_WendtTrevor_Wendt Customer IT Monkey ✭
    edited May 2017
    Second that "WOW! Nice idea @Christopher_Carver."... and sharing a sample runbook or two would be awesome too. 
  • Adam_DzyackyAdam_Dzyacky Customer Contributor Monkey ✭✭✭✭✭
    Not related to the missing runbooks, but I too am leveraging a "digital analyst" of sorts by replacing the stock Exchange Connector with a PowerShell/runbook that is handling some of these scenarios. - https://community.cireson.com/discussion/2471/an-smlets-based-exchange-connector
  • Christopher_CarverChristopher_Carver Customer Adept IT Monkey ✭✭
    Invoking ad-hoc automation from tickets is pretty straight forward. I would note that I use Kelverion integration packs and this makes things a lot easier. You do not need Kelverion integration packs to make this work, but you will need to add some creative alternatives.

    Here is the basic flow. 
    • Figure out when was the last time I got private ticket comments.
    • Get the private comments since last time they were retrieved. 
    • Make sure the private comment contains a automation tag.
    • Map the automation tag to the proper runbook.
    • Gather some minor ticket details.
    • Kick off the right runbook to process the automation request.



    Devils and Details:

    In the Get Runbook Info, I need to extract the current runbook's GUID. The reason for this is, I use the GUID in Get Last Synchronized activity to perform a SQL query on at table that is used to store the timestamp of the last lookup. (Exactly the same as LastModified table for the CiresonCacheBuilder.) The reason for the GUID and not runbook name is, runbook names change from time to time and GUIDS will not. 

    In the Get Last Synchronized activity, I do a simple SQL query passing the runbook GUID and pulling back the timestamp. Then subsequently I update the timestamp with a new one with Update Last Synchronization. This removes issues when automation is offline or you need to increase frequency of execution in picking up only the new comments. 

    In Latest Private Messages, I am pulling from Trouble Ticket Analyst Comments. What's nice is this separates analysts from customer comments. We don't want customers invoking automation requests. I pass in the timestamp into the "Entered date" field with the "After" relationship; that way I am not only parsing new comments. I also make sure this is a private comment. Automation invocations should only be coming from private comments.



    (Side comment) Between the Latest Private Messages and Map Automation to Runbook, this is where it would be a good place to extract who exactly the commenter of the private comment is if you want finer control on who can request an automation sequence. Last thing you want is the intern to kick off a costly request. No offense to interns out there. This is an academic activity of extracting who the commenter is and easy to implement if you want it.

    Then in Regex Match, I find the # match from the comments.


    Now in Map Automation to Runbook I use the Utilities -> Map Published Data activity. (This can easily be replaced by a SQL query to a table mapping, but for simplicity for now I use this until we go to production.) In this activity I map the matched # automation request to a runbook GUID to handle the request. Reason again, runbook names can change in our environment, so using a GUID is safer. 


    The next two steps (Get SR Relationship and Get Service Request) is very basic 101 automation of finding the relationship between the comment and the ticket itself. 

    The last activity is now invoking the runbook to go off and perform the automation action requested by the ticket and passing over the SR GUID and the comment too. Some automation passes arguments along with the automation invocation. Example: #CreateRA @ccarver will create a review activity where I am the reviewer. I put the burden of the details on the runbook that is servicing the request and not this generic handling. I keep with the UNIX mantra, do one thing really well. 



    And there it is. Pretty simple. This framework and easily be added to and expanded on meeting your needs. 
Sign In or Register to comment.