Load big amount of data - Allowed memory size exhausted - performance

I have JAVA Spring Boot application (I'll call it A1) that is connected to Rabbit. The A1 receives data and saves it to the database(MySql) (I'll call database DB1).You can imagine this data as football matches with appropriate markets and outcomes.We are receiving data for the next 10 days over A1 app, and that data is stored in the database.
One more thing is worth to emphasize that every football match has 4 markets and every market has 7 outcomes.
I will explain how DB1 looks like.There are 3 tables (matches, markets, outcomes) worth mentioning besides other tables.Those are related in the way:matches 1.....* marketsmarkets 1.....* outcomes
Data that is received over A1 is constantly updating(every second some update is received for the football events from the current moment plus 2 hours).
There is another PHP Symfony application (I'll call it S1). This application serves as a REST API.
There is one more frontend application (I'll call it F1) that is communicating with S1 over HTTP in order to retrieve data from the database.
F1 application is sending an HTTP request to S1 in order to retrieve this data (matches with markets and outcomes) but the date time frame is from the current moment plus 7 days (business requirement).
When an HTTP request is sent to S1 error occurred because there are over 10 000 football matches plus bets and outcomes.
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in.
I am considering two options in order to solve this issue and please if neither of my options are good enough, I would appreciate if You could suggest me some solution to my problem.
Option - F1 iterate per day and sending 7 HTTP requests asynchronously to S1 in order to retrieve data for all 7 days.
Option - F1 sends an HTTP request to S1. S1 returns data only for today, and the next 6 days is sending over a socket iterating per day, using https://pusher.com/ or something similar.
One more thing to emphasize is the number of this HTTP that we count is two per second and it tends to grow.

10K matches is turning into 134 MB of data? You are using more than 13K for each record... Likely you are trying to make your data structure too flat, duplicating metadata on matches/bets/etc into a single flat record. Try making your objects hierarchical, having a match have bets instead of single-row objects.
If not, then you have an inefficiency in your processing of the data that we cannot diagnose remotely.
You would do even better if you did more processing server-side instead of sending the raw data over the wire. The more you can answer questions inside of S1 instead of sending the data to the client, the less data you will have to send.

Related

Browser: Network - response time shows 2 seconds when it displays after ~15

I wanted to check how quickly my web application will display results for a query : SELECT * FROM orders.
the query returns about 20k records on one page and it takes about 15 seconds
Why on every browser the response time stops after two seconds? Is it because the browser has trouble displaying so many records per one page? at 70k it gets out of memory.
Database - mysql on hosting
problem
correct response time
If you want to check how long it takes for the web app to process. You can add logging before and after doing the query.
You also could add some logging of the current time, when receiving the request and before returning the response.
As for why the request stops after two seconds, I don't think we have enough information to decide.
It could be from the web server default configuration that you use.
In my opinion, displaying 20k records might not be an efficient approach.
Other than the time to query and response time.
You might want to consider the looping that happens on the front end.
Personally, I would recommend paging at a lower number, and if you need to display all the data at once. You might consider using lazy loading as an option.
I know this is a very generic answer, but hopefully, this could help you out.

firestore query issues with 500 concurrent clients

I am doing some stress testing on my project and need your help in understanding the behavior. I have a web server which accepts json data from users and stores it in a firestore collection. Users can query this data. Document json only has two fields id1 and id2 and both are strings. Now as part of my stress test I start 500 threads to mimic 500 clients which query the collection to give each thread the documents where id1 == thread_id like this:
query := client.Collection("mycollection").Where("id1", ==, my_id)
iter := query.Documents(ctx)
snapList, err = iter.GetAll()
I see two issues:
some of these queries are taking so long upto 20 seconds to return.
Some of the queries fail with connection error/io time out. I am using go sdk.
"message":"error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp x.x.x.x:443: i/o timeout""}
as per firestore documentation uptop 1 million concurrent clients are allowed. Then why am I getting issues on just 500? Even running this test on an empty collection I observe the same behavior. Is there any other rate limit that I am missing?
New
When adding load to Firestore it is recommended to follow the 500/50/5 rule, as explained in the documentation on ramping up traffic:
You should gradually ramp up traffic to new collections or lexicographically close documents to give Cloud Firestore sufficient time to prepare documents for increased traffic. We recommend starting with a maximum of 500 operations per second to a new collection and then increasing traffic by 50% every 5 minutes. You can similarly ramp up your write traffic, but keep in mind the Cloud Firestore Standard Limits. Be sure that operations are distributed relatively evenly throughout the key range. This is called the "500/50/5" rule.
So you might want to start with a lower number of threads, and then increase 50% every 5 minutes until you reach the desired load. Some of the SDKs even have support classes for this, such as the BulkWriter in Node.js.

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

Handling Duplicates in Data Warehouse

I was going through the below link for handling Data Quality issues in a data warehouse.
http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/
"
Responding to Quality Events
I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice.
"
In some dimensional feeds (like Client list), sometimes we get a same Client twice (the two records having difference in certain attributes). What is the best solution in this scenario?
I don't want to reject both records (as that would mean incomplete client data).
The source systems are very slow in fixing the issue, so we get the same issues every day. That means a manual fix to the problem also is tough as it has to be done every day (we receive the client list everyday).
Selecting a single record is not possible as we don't know what the correct value is.
Having both the records in our warehouse means our joins are disrupted. Because of two rows for the same ID, the fact table rows are doubled (in a join).
Any thoughts?
What is the best solution in this scenario?
There are a lot of permutations and combinations with your scenario. The big questions is "Are the differing details valid or invalid? as this will change how you deal with them.
Valid Data Example: Record 1 has John Smith living at 12 Main St, Record 2 has John Smith living at 45 Main St. This is valid because John Smith moved address between the first and second record. This is an example of Valid Data. If the data is valid you have options such as create a slowly changing dimension and track the changes (end date old record, start date new record).
Invalid Data Example: However if the data is INVALID (eg your system somehow creates duplicate keys incorrectly) then your options are different. I doubt you want to surface this data, as it's currently invalid and, as you pointed out, you don't have a way to identify which duplicate record is "correct". But you don't want your whole load to fail/halt.
In this instance you would usually:
Push these duplicate rows to a "Quarantine" area
Push an alert to the people who have the power to fix this operationally
Optionally select one of the records randomly as the "golden detail" record (so your system will still tally with totals) and mark an attribute on the record saying that it's "Invalid" and under investigation.
The point that Kimball is trying to make is that Option 1 is not desirable because it halts your entire system for errors that will happen, Option 2 isn't ideal because it means your aggregations will appear out of sync with your source systems, so Option 3 is the most desirable as it still leads to a data fix, but doesn't halt the process or the use of the data (but it does alert the users that this data is suspect).

REST Api for Infinite scrolled query results

I'm building an internal server which contains a database of customer events. The webpage which allows access to the events is going to utilize an infinite scroll/dynamic loading scheme for display of live events as well as for browsing the results of queries to the database. So, you might query the database and maybe get 200k results. The webpage would display the 'first' 50 and allow you to scroll and scroll and scroll to see more and more results (loading perhaps 50 more at time).
I'm supposed to be using a REST api for the database access (a C# server). I'm unsure what the API should be so it remains RESTful. I've come up with 3 options. The question is, are any of them RESTful and which is most RESTful(is there such a thing -- if not I'll pick one of the RESTful).
Option 1.
GET /events?query=asdfasdf&first=1&last=50
This simply does the query and specifies the range of results to return. The server, unable to keep state, would have to requery the database each time (though perhaps utilizing the first/last hints to stop early) the infinite scroll occurs. Seems bad and there isn't any feedback about how many results are forthcoming.
Option 2 :
GET /events/?query=asdfasdf
GET /events/details?id1=asdf&id2=qwer&id3=zxcv&id4=tyui&...&id50=vbnm
This option first does a query which then returns the list of event ids but no further details. The webpage simply has the list of all the ids(at least it knows the count). The webpage holds onto the event id list and as infinite scroll/dynamic load is needed, makes another query for the event details of the specified ids. Each id is would nominally be a guid, so about 36 characters per id (plus &id##= for 41 characters). At 50 queries per hit, the URL would be quite long, 2000+ characters. The URL limit mentioned elsewhere on SO is around 2k. Maybe if I limit it to 40 ids per query this would be fine. It'd be nice to simply have a comma separated list instead of all the query parameters. Can you make a query parameter like ?ids=qwer,asdf,zxcv,wert,sdfg,rtyu,gfhj, ... ,vbnm ?
Option 3 :
POST /events/?query=asdfasdf
GET /events/results/{id}?first=1&last=50
This would post the query to the server and cause it to create a results resource. The ID of the results resource would be returned and would then be used to get blocks of the query results which in turn contain the event details needed for the webpage. The return from the POST XML could contain the number of records and other useful information besides the ID. Either the webpage would have to later delete the resource when the query page closed or the server would have to clean them up once they expire (days or weeks later).
I am concerned at Option 1, while RESTful is horrible for the server. I'm not sure requesting so many simultaneous resources, like the second GET in Option 2 is really RESTful or practical(seems like there has to be a better way). I'm not sure Option 3 is RESTful at all or if it is, its sort of cheating the REST thing by creating state via a POST(or should that be PUT).
Option 3 worked out fine. It required the server to maintain the query results and there was a bit of debate about how many queries (from various users) should simultaneously be saved as there would be no way to know when a user was actually done with a query.

Resources