firestore query issues with 500 concurrent clients - go

I am doing some stress testing on my project and need your help in understanding the behavior. I have a web server which accepts json data from users and stores it in a firestore collection. Users can query this data. Document json only has two fields id1 and id2 and both are strings. Now as part of my stress test I start 500 threads to mimic 500 clients which query the collection to give each thread the documents where id1 == thread_id like this:
query := client.Collection("mycollection").Where("id1", ==, my_id)
iter := query.Documents(ctx)
snapList, err = iter.GetAll()
I see two issues:
some of these queries are taking so long upto 20 seconds to return.
Some of the queries fail with connection error/io time out. I am using go sdk.
"message":"error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp x.x.x.x:443: i/o timeout""}
as per firestore documentation uptop 1 million concurrent clients are allowed. Then why am I getting issues on just 500? Even running this test on an empty collection I observe the same behavior. Is there any other rate limit that I am missing?
New

When adding load to Firestore it is recommended to follow the 500/50/5 rule, as explained in the documentation on ramping up traffic:
You should gradually ramp up traffic to new collections or lexicographically close documents to give Cloud Firestore sufficient time to prepare documents for increased traffic. We recommend starting with a maximum of 500 operations per second to a new collection and then increasing traffic by 50% every 5 minutes. You can similarly ramp up your write traffic, but keep in mind the Cloud Firestore Standard Limits. Be sure that operations are distributed relatively evenly throughout the key range. This is called the "500/50/5" rule.
So you might want to start with a lower number of threads, and then increase 50% every 5 minutes until you reach the desired load. Some of the SDKs even have support classes for this, such as the BulkWriter in Node.js.

Related

Best practices for loading large data from Oracle to Tabular model

We have created some new SSAS Tabular models which fetch data directly from Oracle. But after some testing, we found that with real customer data (with few millions of rows of data), the processing times go close to 4 hours. Our goal is to keep them under about 15mins (Due to existing system performance). We fetch from Oracle tables so query performance is not the bottleneck.
Are there any general design guides/best practices to handle such a scenario?
Check your application side array fetch size as you could be experiencing network latency.
** Array fetch size note:
As per the Oracle documentation the Fetch Buffer Size is an application side memory setting that affects the number of rows returned by a single fetch. Generally, you balance the number of rows returned with a single fetch (a.k.a. array fetch size) with the number of rows needed to be fetched.
A low array fetch size compared to the number of rows needed to be returned will manifest as delays from increased network and client side processing needed to process each fetch (i.e. the high cost of each network round trip [SQL*Net protocol]).
If this is the case, on the Oracle side you will likely see very high waits on “SQL*Net message from client”. [This wait event is posted by the session when it is waiting for a message from the client to arrive. Generally, this means that the session is just sitting idle, however, in a Client/Server environment it could also mean that either the client process is running slow or there are network latency delays. The database performance is not degraded by high wait times for this wait event.]
As I like to say: “SQL*Net is a chatty protocol”; so even though Oracle may be done with its processing of the query, excessive network round-trips results in slower response times on the client side. One should expect that low array fetch size may be contributing to the slowness if the elapsed time to get the data into the application is much longer than the elapsed time for the DB to run the SQL; in this case app side processing time can also be a factor contributing to the slowness [you can look into app specific ways to troubleshoot/tune app side processing].
Array fetch size is not an attribute of the Oracle account nor is it an Oracle side session setting. Array fetch size can only be set at the client; there is no DB setting for the array fetch size the client will use. Every client application has a different mechanism for specifying the array fetch size:
Informatica: ?? config. file param ??? setting at the connection or
result set level??
Cognos http://www-01.ibm.com/support/docview.wss?uid=swg21981559
SQL*Plus: set arraysize n
Java/JDBC: setFetchSize(int rows) /* method in Statement,
PreparedStatement, CallableStatement, and ResultSet objects */
Properties object put method “defaultRowPrefetch”
http://download.oracle.com/otn_hosted_doc/jdeveloper/905/jdbc-javadoc/oracle/jdbc/OracleDriver.html Another link to Oracle JDBC DefaultRowPrefetch
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html
.Net Oracle .Net Developers Guide The FetchSize property represents
the total memory size in bytes that ODP.NET allocates to cache the
data fetched from a database round-trip. The FetchSize property can
be set on the OracleCommand, OracleDataReader, or OracleRefCursor
object, depending on the situation. It controls the fetch size for
filling a DataSet or DataTable using an OracleDataAdapter.
ODBC driver: ?? something like: SetRowsetSize

Load big amount of data - Allowed memory size exhausted

I have JAVA Spring Boot application (I'll call it A1) that is connected to Rabbit. The A1 receives data and saves it to the database(MySql) (I'll call database DB1).You can imagine this data as football matches with appropriate markets and outcomes.We are receiving data for the next 10 days over A1 app, and that data is stored in the database.
One more thing is worth to emphasize that every football match has 4 markets and every market has 7 outcomes.
I will explain how DB1 looks like.There are 3 tables (matches, markets, outcomes) worth mentioning besides other tables.Those are related in the way:matches 1.....* marketsmarkets 1.....* outcomes
Data that is received over A1 is constantly updating(every second some update is received for the football events from the current moment plus 2 hours).
There is another PHP Symfony application (I'll call it S1). This application serves as a REST API.
There is one more frontend application (I'll call it F1) that is communicating with S1 over HTTP in order to retrieve data from the database.
F1 application is sending an HTTP request to S1 in order to retrieve this data (matches with markets and outcomes) but the date time frame is from the current moment plus 7 days (business requirement).
When an HTTP request is sent to S1 error occurred because there are over 10 000 football matches plus bets and outcomes.
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in.
I am considering two options in order to solve this issue and please if neither of my options are good enough, I would appreciate if You could suggest me some solution to my problem.
Option - F1 iterate per day and sending 7 HTTP requests asynchronously to S1 in order to retrieve data for all 7 days.
Option - F1 sends an HTTP request to S1. S1 returns data only for today, and the next 6 days is sending over a socket iterating per day, using https://pusher.com/ or something similar.
One more thing to emphasize is the number of this HTTP that we count is two per second and it tends to grow.
10K matches is turning into 134 MB of data? You are using more than 13K for each record... Likely you are trying to make your data structure too flat, duplicating metadata on matches/bets/etc into a single flat record. Try making your objects hierarchical, having a match have bets instead of single-row objects.
If not, then you have an inefficiency in your processing of the data that we cannot diagnose remotely.
You would do even better if you did more processing server-side instead of sending the raw data over the wire. The more you can answer questions inside of S1 instead of sending the data to the client, the less data you will have to send.

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

Google big query API returns "too many free query bytes scanned for this project"

I am using Google's big query API to retrieve results from their n-gram dataset. So I send multiple queries of "SELECT ngram from trigram_dataset where ngram == 'natural language processing'".
I'm basically using the same code posted here (https://developers.google.com/bigquery/bigquery-api-quickstart) replaced with my query statement.
On every program run, I have to get a new code of authorization and type it in the console, which gives authorization to my program to send queries to google big query under my project ID. However, after sending 5 queries, it just returns " "message" : "Exceeded quota: too many free query bytes scanned for this project".
According to Google Big Query policy, their free quota is 100G/month, and I don't think I've even nearly come close to their quota. Someone suggested in the previous thread that I should enable billing information to use their free quota, which I did, but it's still giving me the same error. Is there any way to check the leftover quota or how to resolve this problem? Thank you very much!
The query you've mentioned scans 1.12 GB of data, so you should be able to run it 89 times in a month.
The way the quota works is that you start out with 100 GB of monthly quota -- if you use it up, you don't have to wait an entire month, but you get 3.3 more quota every day.
My guess (please confirm) is that you ran a bunch of queries and used up your 100 GB monthly free quota, then waited a day, and only were able to run a few queries before hitting the quota cap. If this is not the case, please let me know, and provide your project id and I can take a look in the logs.
Also, note that this isn't the most efficient usage of bigquery; an option would be to batch together multiple requests. In this case you could do something like:
SELECT ngram
FROM trigram_dataset
WHERE ngram IN (
'natural language processing',
'some other trigram',
'three more words')

Resources