reducing lag and packet loss in online multiplayer games - amazon-ec2

good day dears,
i have developed a turn based multiplayer game, using Nakama server and deployed it on AWS ec2 instance and using unity as editor,
since i did not use their managed cloud i had to implement every server configurations myself, and by that i'm facing a lot of lag, desync and packet loss even if the internet connection is not very bad, the socket gets disconnected in most devices which results in more packet loss,
here is my netcode:
my game is turn based, i don't send any data inside update method, a player does a bunch of movement(in all cases a player has either 2 moves or 4 moves which might be completed around a second or two) each move is send as a packet this packet has 5 payloads which usually contains a small number of bytes (example: a string such as "2.736" or "move")
for serialization i use tiny json
my server Type:
since i'm not using nakama enterprise i can't create a cluster, i only have one instance of type t2.xlarge ( 4 cpus, 16 gib memory) deployed on ireland and i'm in iraq
server configurations:
max_request_size_bytes: 8000
max_message_size_bytes: 8000
outgoing_queue_size: 32
read_buffer_size_bytes: 8192
write_buffer_size_bytes: 8192
please someone guide me into fixing this issue that i think it all occures because of bad server setup,
any help is much appreciated.

Related

Performance of Nats Jetstream

I'm trying to understand how Nats Jetstream scales and have a couple of questions.
How efficient is subscribing by subject to historic messages? For example lets say have a stream foo that consists of 100 million messages with a subject of foo.bar and then a single message with a subject foo.baz. If I then make a subscription to foo.baz from the start of the stream will something on the server have to perform a linear scan of all messages in foo or will it be able to immediately seek to the foo.baz message.
How well does the system horizontally scale? I ask because I'm having issues getting Jetstream to scale much above a few thousand messages per second, regardless of how many machines I throw at it. Test parameters are as follows:
Nats Server 2.6.3 running on 4 core 8GB nodes
Single Stream replicated 3 times (disk or in-memory appears to make no difference)
500 byte message payloads
n publishers each publishing 1k messages per second
The bottleneck appears to be on the publishing side as I can retrieve messages at least as fast as I can publish them.
Publishing in NATS JetStream is slightly different than publishing in Core NATS.
Yes, you can publish a Core NATS message to a subject that is recorded by a stream and that message will indeed be captured in the stream, but in the case of the Core NATS publication, the publishing application does not expect an acknowledgement back from the nats-server, while in the case of the JetStream publish call, there is an acknowledgement sent back to the client from the nats-server that indicates that the message was indeed successfully persisted and replicated (or not).
So when you do js.Publish() you are actually making a synchronous relatively high latency request-reply (especially if your replication is 3 or 5, and more so if your stream is persisted to file, and depending on the network latency between the client application and the nats-server), which means that your throughput is going to be limited if you are just doing those synchronous publish calls back to back.
If you want throughput of publishing messages to a stream, you should use the asynchronous version of the JetStream publish call instead (i.e. you should use js.AsyncPublish() that returns a PubAckFuture).
However in that case you must also remember to introduce some amount of flow control by limiting the number of 'in-flight' asynchronous publish applications you want to have at any given time (this is because you can always publish asynchronously much much faster than the nats-server(s) can replicate and persist messages.
If you were to continuously publish asynchronously as fast as you can (e.g. when publishing the result of some kind of batch process) then you would eventually overwhelm your servers, which is something you really want to avoid.
You have two options to flow-control your JetStream async publications:
specify a max number of in-flight asynchronous publication requests as an option when obtaining your JetStream context: i.e. js = nc.JetStream(nats.PublishAsyncMaxPending(100))
Do a simple batch mechanism to check for the publication's PubAcks every so many asynchronous publications, like nats bench does: https://github.com/nats-io/natscli/blob/e6b2b478dbc432a639fbf92c5c89570438c31ee7/cli/bench_command.go#L476
About the expected performance: using async publications allows you to really get the throughput that NATS and JetStream are capable of. A simple way to validate or measure performance is to use the nats CLI tool (https://github.com/nats-io/natscli) to run benchmarks.
For example you can start with a simple test: nats bench foo --js --pub 4 --msgs 1000000 --replicas 3 (in memory stream with 3 replicas 4 go-routines each with it's own connection publishing 128 byte messages in batches of 100) and you should get a lot more than a few thousands messages per second.
For more information and examples of how to use the nats bench command you can take a look at this video: https://youtu.be/HwwvFeUHAyo
Would be good to get an opinion on this. I have a similar behaviour and the only way to achieve higher throughput for publishers is to lower replication (from 3 to 1) but that won't be an acceptable solution.
I have tried adding more resources (cpu/ram) with no success on increasing the publishing rate.
Also, scaling horizontally did not make any difference.
In my situation , i am using Bench tool to publish to js.
For an R3 filestore you can expect ~250k small msgs per second. If you utilize synchronous publish that will be dominated by RTT from the application to the system, and from the stream leader to the closest follower. You can use windowed intelligent async publish to get better performance.
You can get higher numbers with memory stores, but again will be dominated by RTT throughout the system.
If you give me a sense of how large are your messages we can show you some results from nats bench against the demo servers (R1) and NGS (R1 & R3).
For the original question regarding filtered consumers, >= 2.8.x will not do a linear scan to retrieve foo.baz. We could also show an example of this as well if it would help.
Feel free to join the slack channel (slack.nats.io) which is a pretty active community. Even feel free to DM me directly, happy to help.

Azure Stream Analytics is too slow - also time values are irrelavant

We want to migrate our dedicated servers to Azure platform for scaling easy and investigated a lot of Azure services for our needs. So one of the Azure service that we want to use is Azure Stream Analytics (ASA).
We've added some Azure Platforms according to our needs for performing some tests (it is not important what they for, for now). Here is the structure:
SimpleApp (Sending Request, Not In Azure) -> Event Hub 1 (EH1) -> ASA -> Event Hub 2 (EH2) -> Function App (FA)
SimpleApp sends a simple HTTP GET request to classic dedicated server
which is named TESTSERVER. It tooks max 100-150ms and it represents
our start time. After that it sends the message to EH1.
ASA's query is simple like this: SELECT * INTO [Output] FROM [Input]
Function App sends a simple HTTP GET request to TESTSERVER for
identifying finish time.
We've shocked when we see the results from our TESTSERVER logs. It tooks 4000-5000ms!
Then we started to investigate the issue. Checked values like EventEnqueuedUtcTime and EventProcessedUtcTime to identify which block causes this slowness. But these time values are totally irrelevant. For example; EventEnqueuedUtcTime should be less than EventProcessedUtcTime but not! So this shows us time servers may be different even in different Azure blocks and we cannot use them to measure. Am I wrong?
Anyway, after this we suspected that maybe the last Azure Function App may cause this slowness. We thought that maybe Function App's Event Hub Trigger does not work well. So we designed a new test environment:
SimpleApp (Sending Request, Not In Azure) -> Event Hub 1 (EH1) -> Function App (FA1) -> Event Hub 2 (EH2) -> Function App 2 (FA2)
Second shock... It tooks only ~400ms totally!
Then, we've performed a lot of tests with different architecture which contains ASA but all of them are too slow for us.
Have you experienced any performance issues with ASA? Could you please share your experience and your flows' total time consumption?
Best regards.
There is a latency when merging all the event in chronological order from the Event Hub.
ASA will visit all partitions from EH, get the data and organize the events into chronological order. This means that data must arrive at all partitions in the EH. I think this will also explain the strange behavior you are seeing with the EventProcessedUtcTime, it might be that because the events are ordered, the logical processing time is before the actual arrival time. Although I'm unsure about this because I do not know the inner workings of ASA.
This latency will increase with the number of partitions used, especially when the dataflow is slow.
You can sidestep the merger by partitioning on the field partitionid from EH.
Make sure you are sending the data to the correct partition in EH as well.
More information can be found here at the Stream Analytics blog.

How to debug performance issue/optimize your meteor app

I just deployed my Meteor app onto a production server on Digital Ocean.
I noticed that for about 7500 documents, it takes about 3-5 seconds to fully fetch the objects (selectively taking only 3 fields) and populate the autocomplete data. I believe it should rather be instantenous for such number of data, so I am curious how I can debug performance issues from here and optimize more. How should I go about debugging performance issues for a Meteor app? I tried seeing the network tab but nothing seems to take more than a second. I am not sure why it takes 3-5 seconds for the search bar with an autocomplete feature to get ready. After a close inspection, populating autocomplete fields is instantaneous, and the time until subscribe function's callback is called is about 3 to 5 seconds.
I've already looked into Kadira, but it reported that everything was complete within milliseconds, so I am confused.
possibly related: Meteor's subscription and sync are slow
After all, is 3-5 seconds for 7800 documents with 2 fields reasonable?
Let me tell you what's really happening here.
Kadira shows the time taken to fetch the data from the server and queue it to the network. So, 500 - 700 ms is reasonable for that.
So, this 3-5 ms latency is the network latency. That means the time taken to send data to the client via the network. It's quite okay for 7500+ documents even with three fields over DDP.
So, my suggestion is to do the search on the server and use something like Search Source for that.
With that, you'll get the only data required to the client. Which reduce the latency and saves the CPU of your app.

Azure Redis cache - timeouts on GET calls

We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4,
mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line
1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line
79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key,
CommandFlags flags) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line
1346 at
OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at
Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable1
shouldRetryPredicates, Func`1 policyStateFactory) at
OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean
allowDirtyRead) in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at
OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext()
in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196;
TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.
I know this thread is months old but I think my own experiences can add some value here. I had the same problem with Azure Redis Cache (timeouts on Gets) but realized that it was almost exclusively happening on Gets where the string value was relatively large (> 250K in length). I implemented gzip on both Gets and Sets (when the string value is large) and now I almost never get a timeout.
Even if this doesn't solve your particular problem, it's probably good practice to compress the values in general to reduce costs and improve performance.
Regarding the version numbers, it seems that the AssemblyVersion has been locked at 1.0.316 for the last several releases, but the AssemblyFileVersion has been updated to match the NuGet package version. For now, I recommend ignoring AssemblyVersion and just using AssemblyFileVersion to ensure you have the correct binary.
Please contact us at AzureCache#microsoft.com if you are still seeing timeouts using Azure Redis Cache.

Maximum concurrent Socket.IO connections

This question has been asked previously but not recently and not with a clear answer.
Using Socket.io, is there a maximum number of concurrent connections that one can maintain before you need to add another server?
Does anyone know of any active production environments that are using websockets (particularly socket.io) on a massive scale? I'd really like to know what sort of setup is best for maximum connections?
Because Websockets are built on top of TCP, my understanding is that unless ports are shared between connections you are going to be bound by the 64K port limit. But I've also seen reports of 512K connections using Gretty. So I don't know.
This article may help you along the way: http://drewww.github.io/socket.io-benchmarking/
I wondered the same question, so I ended up writing a small test (using XHR-polling) to see when the connections started to fail (or fall behind). I found (in my case) that the sockets started acting up at around 1400-1800 concurrent connections.
This is a short gist I made, similar to the test I used: https://gist.github.com/jmyrland/5535279
I tried to use socket.io on AWS, I can at most keep around 600 connections stable.
And I found out it is because socket.io used long polling first and upgraded to websocket later.
after I set the config to use websocket only, I can keep around 9000 connections.
Set this config at client side:
const socket = require('socket.io-client')
const conn = socket(host, { upgrade: false, transports: ['websocket'] })
GO THROUGH THE COMMENT OF THIS ANSWER BEFORE PROCEEDING FURTHER
The question ask about socket.io sockets, the answer is for native
sockets. These changes are dangerous as they apply to everything on
the system, not just socket.io sockets. Besides, today networks is
never the bottleneck for socket.io. Do not make these changes to your
system without understanding the implications first.
For +300k concurrent connection:
Set these variables in /etc/sysctl.conf:
fs.file-max = 10000000
fs.nr_open = 10000000
Also, change these variables in /etc/security/limits.conf:
* soft nofile 10000000
* hard nofile 10000000
root soft nofile 10000000
root hard nofile 10000000
And finally, increase TCP buffers in /etc/sysctl.conf, too:
net.ipv4.tcp_mem = 786432 1697152 1945728
net.ipv4.tcp_rmem = 4096 4096 16777216
net.ipv4.tcp_wmem = 4096 4096 16777216
for more information please refer to this.
This guy appears to have succeeded in having over 1 million concurrent connections on a single Node.js server.
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/
It's not clear to me exactly how many ports he was using though.
After making configurations, you can check by writing this command on terminal
sysctl -a | grep file
I would like to provide yet another answer in 2023.
We only use websocket in socket.io-client. We have done 2 type of performance tests,
my test team uses JMeter to test up to 5000 concurrent connections. Due to the nature of our product, 5000 connections is enough for us, so we didn't go higher.
I use https://a.testable.io/ to do another performance test. The reason I uses testable (this is NOT a sales pitch for them lol) I can choose ws clients from different locations, e.g. I chose 3 different locations from NA and one location from Asia. I believe this would be more closer to real life scenario than I just run a test script from my local machine(which I do have too). Doing this kind of test caused money, to quote from their technical support words after I did my test, "I see you ran a 20,000 user test successfully today too, great! Less than $20 for a test of this size is by far the best pricing out there :)."
BTW, you can also refer to https://ably.com/topic/scaling-socketio, which the latest published article about socket.io performance I can find.
So in summary I would argue that if you only use websocket, 5000 to 10,000 concurrent connection should not be to hard to achieve.

Resources