Freeswitch verto performance is low - freeswitch

This is my first question, thanks in advance for all your help!
I setup a freeswitch server, and call it from a web page, I found freeswitch server costs more than 13 percent cpu, when it's idle it costs less than 1 percent, if I use a sip client call it, it costs about 4 percent cpu, anyone know why it costs 9 percent more cpu if use verto? Below is some detailed information.
Freeswitch version: 1.7.0+git~20151231T160311Z~de5bbefdf0~64bit (git de5bbef 2015-12-31 16:03:11Z 64bit).
I use ubuntu 14.04 on intel i5 cpu with 16G ram.
Sip client used is Zoiper on windows.

You did not tell anything about the codecs that you're using. If you use OPUS in your WebRTC client, and it needs to be transcoded by the FreeSWITCH, then the workload looks quite relevant. OPUS is a quite expensive codec in terms of CPU effort needed for transcoding.
The same applies to the SIP client. The CPU load depends significantly on the codec encoding/decoding job during the call. In ideal situation, both legs of a call use the same codec, and then your FreeSWITCH server would only be busy with sending and receiving RTP frames, without heavy processing of the payload.
Keep in mind that the primary platform for FreeSWITCH is Debian Jessie, and there may also be issues with the kernel or libraries in Ubuntu, as nobody took care to analyze and optimize for this platform.

Related

Network performance issues and slow tcp_write_xmit/tcp_ack syscalls with a lot of save_stack calls on OpenVZ kernel

I ran into a trouble with a bad network performance on Centos. The issue was observed on the latest OpenVZ RHEL7 kernel (3.10 based) on Dell server with 24 cores and Broadcom 5720 NIC. No matter it was host system or OpenVZ container. Server receives RTMP connections and reproxy RTMP streams to another consumers. Reads and writes was unstable and streams froze periodically for few seconds.
I've started to check system with strace and perf. Strace affects system heavily and seems that only perf may help. I've used OpenVZ debug kernel with debugfs enabled. System spends too much time in swapper process (according to perf data). I've built flame graph for the system under the load (100mbit in data, 200 mbit out) and have noticed that kernel spent too much time in tcp_write_xmit and tcp_ack. On the top of these calls I see save_stack syscalls.
On another hand, I tested the same scenario on Amazon EC2 instance (latest Amazon Linux AMI 2017.09) and perf doesn't track such issues. Total amount of samples was 300000, system spends 82% of time according to perf samples in swapper, but net_rx_action (and as consequent tcp_write_xmit and tcp_ack) in swapper takes only 1797 samples (0.59% of total amount of samples). On the top of net_rx_action call in flame graph I don't see any calls related to stack traces.
Output of OpenVZ system looks differently. Among 1833152 samples 500892 (27%) was in swapper process, 194289 samples (10.5%) was in net_rx_action.
Full svg of calls on vzkernel7 is here and svg of EC2 instance calls is here. You may download it and open in browser to interactively check flame graph.
So, I want to ask for help and I have few questions.
Why flame graph from EC2 instance doesn't contain so much save_stack calls like my server?
Does perf forces system to call save_stack or it's some kernel setting? May it be disabled and how?
Does Xen on EC2 guest process all tcp_ack and other syscalls? Is it possible that host system on EC2 server makes some job and guest system doesn't see it?
Thank you for a help.
I've read kernel sources and have an answer for my questions.
save_stack calls is caused by the Kernel Address Sanitizer feature that was enabled in OpenVZ debug kernel by CONFIG_KASAN option. When this options is enabled, on each kmem_cache_free syscall kernel calls __cache_free
static inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
/* Put the object into the quarantine, don't touch it for now. */
if (kasan_slab_free(cachep, objp))
return;
___cache_free(cachep, objp, caller);
}
With CONFIG_KASAN disabled kasan_slab_free will response with false (check include/linux/kasan.h). OpenVZ debug kernel was built with CONFIG_KASAN=y, Amazon AMI wasn't.

Are there routers available which allow to introduce latency and configure BER for testing

I am currently working on an application which is a client server based application. The client and server will be on a wireless network with limited bandwidth and both could be moving. I need to simulate latency and BER issues in order to test and ensure that my application's performance does not degrade too much.
I was wondering if there are any routers available which will allow me to introduce latency and also increase or reduce the BER. If anyone knows of such a router which I can buy from the market or a software which I can install to simulate this on LAN, please do answer.
Thanks.
You can try netem if your application running in Linux.
With it, You can simulate packet delay, loss, corruption .etc
Detailed information please refer:
http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

Soft realtime on windows OS- what to consider?

What consideration should we have (both software and hardware) when we build a soft-realtime application on windows : a task that occurs every XXX milliseconds and that should be completed within YYY milliseconds. (Altough consequences of missing a deadline are bad, the application can still recover from missed deadline - hence the "soft" realtime).
A few questions that already comes to my mind:
Are there registry settings that should be changed, looked at?
Is it better to use external graphic card instead of onboard video?
Example expected answer:
You should read on (and disable) Nagle Algorithm if you use TCP as it can delay packet sending.
(This could maybe be turned in community wiki)
Consider using Multimedia Class Scheduler Service
From the doc
The Multimedia Class Scheduler service (MMCSS) enables multimedia
applications to ensure that their time-sensitive processing receives
prioritized access to CPU resources. This service enables multimedia
applications to utilize as much of the CPU as possible without denying
CPU resources to lower-priority applications
Another option availale to you is to adjust your thread priorities but you need to be very careful not to get to aggressive with this.
Hardware-wise, will this be running on server-class equipment? If so, the usual steps apply. Disable hyperthreading, turbo boost, and CPU C-states. Implement some level of CPU-affinity on your critical processes.

How many requests per second does libmemcached can handle?

I hava a linux server which has 2G memory/ Intel Core 2 Duo 2.4 GHz cpu, I am developing a networking system. I use
libmemcached/memcache to store and access packet info, I want to know how many requests does
libmemcached can handle in a plain linux server ? thanks!
There are too many things that could affect the request rate (CPU speed, other hardware drivers, exact kernel version, request size, cache hit rate, etc ad infinitum). There's no such thing as a "plain linux server."
Since it sounds like you've got fixed hardware, your best bet is to test the hardware you've got, and see how well it performs under your desired load.

How do improve Tx peformance in USB device driver?

I developed a device driver for a USB 1.1 device onr Windows 2000 and later with Windows Driver Model (WDM).
My problem is the pretty bad Tx performance when using 64byte bulk transfers. Depending on the used USB Host Controller the maximum packet throughput is either 1000 packets (UHCI) or 2000 packets (OHCI) per seconds. I've developed a similar driver on Linux Kernel 2.6 with round about 5000 packets per second.
The Linux driver uses up to 10 asynchronous bulk transfer while the Windows driver uses 1 synchronous bulk transfer. So comparing this makes it clear while the performance is so bad, but I already tried with asynchronous bulk transfers as well without success (no performance gain).
Does anybody has some tips and tricks how to boost the performance on Windows?
I've now managed it to speed up sending to about 6.6k messages/s. The solution was pretty simple, I've just implemented the same mechanism as in the Linux driver.
So now I'm scheduling up to 20 URBs at once, at what should I say, it worked.
What kind of throughput are you getting? USB 1.1 is limited to about 1.5 Mbit/s
It might be a limitation that you'll have to live with, the one thing you must never do is to starve the system for resources. I've seen so many poor driver implementations where the driver is hogging system resources in utter failure to increase its own performance.
My guess is that you're using the wrong API calls, have you looked at the USB samples in the Win32 DDK?

Resources