SIP Monitoring and Troubleshooting

Operating a VoIP system with a focus on great customer experience can be quite challenging, especially if you run a heterogeneous network with lots of different SIP clients (like various software clients, all kinds of SIP Phones and Terminal Adapters and especially IP PBXs). SIP clients are known to have all kinds of quirks and implementation errors, and if you don’t control them yourselves (e.g. with a central device provisioning tool), the additional factor of configuration errors introduced by your customers comes into play. Putting the right values into the configuration interface of the clients is not always that straight-forward, it sometimes needs an engineering degree to find out what’s up with parameters like registrar, outbound-proxy, session-timers, codec ordering etc. Flexibility is not always key, especially when it comes to end user interfaces. That’s why Skype is so successful, because “it just works”.

Anyways, if a customer uses your VoIP service (especially if it’s a paid service), it just needs to work, and if not, you better pin down the error cause as soon as possible and provide a solution to the customer, otherwise she’ll turn away from you quite quickly.

The poor man’s approach

In the past, VoIP troubleshooting went somewhere along this line (we’ve been there and done that):

  • Ask the customer when approximately she did the failed call or failed to register her phone
  • Grep the (hopefully extensive) log files for hints pointing to the error
  • If nothing obvious comes up there, start a tcpdump on the system and ask the customer to try the call again
  • Copy the resulting trace to your local machine and try to extract the relevant packages from a potentially HUGE trace
  • Analyze the call, take your actions, and if necessary repeat the process

This approach has some obvious flaws. First, your support agent needs access directly on the system and the proper rights to start a trace.

It is also quite time consuming and probably doesn’t draw a professional picture if you need to ask your customer for some action in order for you to find the problem. It’s also a heavily manual process, requires quite some technical expertise to pull off, and if the support agent needs to escalate the issue to 2nd Level Support, it involves uploading SIP traces to somewhere, or even worse, sending them back and forth by email.

External Monitoring Tools to the rescue!

Due to the huge overhead of the traditional troubleshooting approach, a whole new ecosystem around external SIP monitoring and analysis. New start-ups were created to tackle these issues, and established network monitoring vendors pushed into the market, providing traffic analyzer solutions to ease the pain of VoIP support. The problem for small VoIP operators is that these solutions can be horrendously expensive. In the telephony industry, licensing models are broken down to a per-line or per-subscriber price, and it’s not uncommon that the line price of the analyzer tools exceed the line price of the VoIP soft-switch, which is just unfeasible.

However, since open source projects increasingly get their feet into the VoIP market, it’s quite natural that also open source VoIP monitoring and troubleshooting tools start to appear. The most promising project in the open source landscape is Homer, an open source SIP capturing server. Since it can passively wiretap traffic on mirrored switch ports, it integrates nicely into a VoIP network environment without interfering with existing networking elements.

Using such tools, the support process changes significantly, because all SIP packets are constantly captured on the network and can be filtered and viewed on web interfaces. Most of them, like Homer, also visually present the call flows of the SIP packets, so it gets very easy to spot issues between the involved hops.

Instead of having to involve the customer into the troubleshooting process, it becomes something like this:

  • Filter for calls or registrations of the respective customer
  • Visually check the call flows and packets for obvious issues
  • If necessary, grep the logs for specific calls
  • Take actions and repeat the process if necessary

If more people need to be involved into the troubleshooting process, just the link to the call flow in question needs to be shared.

However, the problem with such tools is that they can only provide an external view of a VoIP system, because in most of the cases it’s not possible to hook into the internal communication of a VoIP soft-switch.