Search This Blog

Tuesday, November 17, 2015

ESXi and NIC Teaming gone wrong

Lately I have had the pleasure of discovering a new issue that has risen up from the 1s and 0s.  Currently, there is a network that is not using LACP but using NIC teaming, this is fine.  The weird part is, one of their UC VMs randomly stopped talking outside the subnet.  Rebooting the VM did nothing and it kept looking like it was more and more of a network problem.  A TAC case was opened and even Cisco was pointing the finger to the network.

Without being to in detail on ESXi, since I'm no guru, I took a look at the network side of the house and everything looked like it was in working order as well.  Pings to and from the ESXi were working but for some strange reason on VM was sending data out one NIC and receiving it on the other.  This was the only VM doing this and I think if LACP was enabled it would never have happened.  The end result was a VM that wasn't talking correctly and couldn't communicate with it's subscriber.  We ended up moving one of the NICs from active to standby and ran a test.  Everything magically worked itself out as all traffic was being forced over one lane.  We then put it back into active mode and everything still was working fine.  Weird.... How did the TCP/UDP data streams end up getting hosed up?

I checked again today and traffic was still evenly distributed and no weirdness was going on behind the scenes with one VM sending out one port and receiving on the other.  I'm not sure this was ever a problem as again, I'm not an ESXi expert but it seemed to have resolved the issues and they haven't returned.  This also seemed to have fixed a second issue they were getting which was calls going to voicemail but then terminating after 5 seconds.  If this was indeed the resolution, nothing was being shown as wrong anywhere on the network and no errors were being thrown.  Just a heads up for any of you that run into this issue yourselves.  Just bounce the port or go standby then back to active, it should end with the same result.

Friday, November 13, 2015

7940 / 7960 DST change and CUCM 6.x - 7.x

Last week I was working on a set of problem phones on a really old version of CUCM.  The problem was that when daylight savings time ended, all the newer phones rolled back an hour with the change of NTP based on the Date/Time group but the older 7940s and 7960s did not.  This is a constant yearly issue with this particular client's CUCM as they are on 6.x.  After doing some research I found a bug that affects the 6.x train and also appears to go all the way to 8.0.  This bug essentially holds the 40s and 60s back and you either have to manually change the time, update the CUCM with a .cop file, or just wait a week for them to change.

The weird part about all of this is that I changed it earlier this year to bump it ahead an hour and I didn't have to change it again afterwards.  I cycled the time back to -7 time since I am on central time and everything was fine.  Later last week I got a call that their phones had once again went forward an hour.  This didn't make sense at all.  I went in and did a screen cap of the 7940 and sure enough, it was showing eastern time again.  I set the date/time group back to central time and this ultimately resolved the issue.

So the lesson learned is:

  1. Upgrade your CUCM
  2. Get rid of those ancient phones
  3. Upgrade and get rid of the phones
  4. Wait a week for the time to change
  5. Update the CUCM with a COP file

Why this issue still persists is beyond me.  This is something that was going on back when 6.x and 7.x were still supported over 4 years ago or so.  It should have been fixed then but never was.  Still, all the more reason to get off of an MCS server and move over to a BE6k or BE7k!

Wednesday, November 11, 2015

T1s being T1s

So there I was, end of the day and a ticket came in for a long distance code not working for a customer.  Easy, probably doesn't have a FAC on CUCM for some reason, just as the others didn't.  Well, long story short, there was a FAC and the call was hitting the gateway and I was getting a 41 cause code via RTMT.  DSPs were fine and the T1 looked good.  Came in this morning figuring whatever the carrier had issues with was fixed but nope, still hosed.  More tickets were coming in saying their long distance codes stopped working.

This was the last 24 hours of my work life.  Everything looked good, no errors, no issues except randomly failing calls out a particular site.  Sure, I could have re-routed the entire site out another gateway but that would have taken significant time, even with BAT changing out the CSS to route out the other direction.  That would also put a huge load on the other T1 bundle and soak up a crap ton of ports and DSPs.

So I sat there, in discontent wondering what was going on.  Logs looked good, traces were fairly clean and still Cause Code 41.  I tried rapid long distance dialing and got a few calls to work.  My thoughts immediately went back to the carrier.  I even tried changing out the gateways but since both T1 CASs were failing that was also doomed to fail.  I was talking with another voice guy and he figured we should try to reset the T1 CAS.  I thought..why didn't I try this?  So simple...  My thought process was that it was broken however, and now I know better.  Reset it even if it is appearing to be working, who cares since long distance isn't working anyways right?  I reset the port made about 10 calls and had the customer call me back and everything is hunky doory.

So what was the problem?  Damned if I know.  Neither of us did anything and the carrier sure didn't do anything.  A reset seemed to fix an invisible hangup on the T1.  Note to self, just reset the damn T1, even if it doesn't look like it needs it!  Also, who the heck still uses T1 CAS?  That's the first time I saw a CAS in about 8 years.

Tuesday, November 3, 2015

Unity Connection integration with an analog PBX system

Recently I got the awesome experience of integrating Unity Connection into an analog setup.  While the customer does indeed use Cisco, the non-Cisco side is still in place for a few reasons and needs to continue to work.  Their old Unity box was decommissioned but then brought back on.  When the phones received calls, they would send the voicemail to the Unity Connection box as you would expect.  The problem is, the old Unity box was still for some reason triggering when the messages button would be pressed on the non-cisco phones.

After poking around a bit I found they have a DMG or Digital Media Gateway.  This had routes to Unity Connection but was listed second in the top down list.  The first thing to fix was this and get rid of the Unity route completely.  The issue is, being unfamiliar with the DMG, I was rolling the dice.  The configuration menus were not too difficult to figure out but I still didn't full understand everything.  It turns out however, my solution was correct.

The next stop was to make PIMG ports on the Unity Connection box.  This will allow the analog phones to get their MWI and access their voicemail on demand when pressed.  Now the only thing I still don't fully understand in the first place is how the Unity Connection was being used by the DMG as voicemails were being delivered there somehow without the PIMG ports.  My guess is that it hit a route elsewhere when it got sent to the Cisco side.  However, being on a time schedule to getting something fixed since it was paid per hour, I didn't have too much time to try to figure that aspect out.