Parse, replacing large (several thousands) number of records - parse-platform

I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.

I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances

I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).

Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.

Related

Max frame length of 65536 has been exceeded

I have a set up where I am using the gremlin-core library to query a remote Janusgraph server. The data size is moderate for now but will increase in the future.
A few days ago, I saw the "Max frame length of 65536 has been exceeded" error on my client. The value for the maxContentLength parameter in my server yaml file was set to default (65536). I dug up the code and realized that I am sending a large array of vertex ids as a query parameter to fetch vertices. I applied a batch to the array with a size of 100 vertex ids per batch and it resolved the issue.
After sometime I started seeing this error again in my client logs. This time around, there was no query with a large number of parameters being sent to the server. I saw a proposed solution on this topic which said that I need to set the maxContentLength parameter on the client-side as well. I did that and the issue got resolved. However, it raised a few questions regarding the configuration parameters, their values and their impact on the query request/response size.
Is the maxContentLength parameter related to the response size of a query? If yes, how do I figure out the value for this parameter with respect to my database size?
Are there any other parameters that dictate the maximum size of the query parameters in the request? If yes, which are they and how do they relate to the size of the query parameters?
Are there any parameters that dictate the size of a query response? If yes, which are they and how do they relate to the size of the query response?
The answers to these questions are crucial for me to make a robust server that will not break under the onslaught of data.
Thanks in advance
Anya
The maxContentLength is the number of bytes a single "message" can contain as a request or a response. It serves the same function as similar settings in web servers to allow filtering of obviously invalid requests. The setting has little to do with database size and more to do with the types of requests you are making and the nature of your results. For requests, I tend to think it atypical for a request to exceed 65k in most situations. Folks who exceed that size are typically trying to do batch loading or are using code generated scripts (the latter is typically problematic, but I won't go into details). For responses, 65k may not be enough depending on the nature of your queries. For example, the query:
g.V().valueMap(true)
will return all vertices in your database as an Iterator<Map> and Gremlin Server will stream those result back in batches controlled by the resultIterationBatchSize (default is 64). So if you have 128 vertices in your database Gremlin Server will stream back two batches of results behind the scenes. If those two batches are each below maxContentLength in size then no problems. If your batches are bigger than that (because you have say, 1000 properties on each vertex) then you either need to
limit the data you return - e.g. return fewer properties
increase maxContentLength
lower the resultIterationBatchSize
Also note that the previous query is very different from something like:
g.V().valueMap(true).fold()
because the fold() will realize all the vertices into a list in memory and then that list must be serialized as a whole. There is only 1 result (i.e. List<Map> with 128 vertices) and thus nothing to batch, so its much more likely that you would exceed the maxContentLength here and lowering the resultIterationBatchSize wouldn't even help. You're only recourse would be to increase maxContentLength or alter the query to allow batching to kick in to hopefully break up that large chunk of data to fit in the maxContentLength.
Setting your maxContentLength to 2mb or larger shouldn't be too big a deal. If you need to go higher for requests, then I'd be curious what the reason was for that. If you need to go much higher for responses, then perhaps I'd take a look at my queries and see if there's a better way to limit the data I'm returning or to see if there's a nicer way to get Gremlin Server streaming to work for me.

Best way to retrieve 150,000 records from Oracle with JDBC

I have been searching for an answer to this today, and it seems the best approach divides opinion somewhat.
I have 150,000 records that I need to retrieve from an Oracle database using JDBC. Is it better to retrieve the data using one select query and allowing the JDBC driver to take care of transferring the records from the database using Oracle cursor and default fetchSize - OR to split up the query into batches using LIMIT / OFFSET?
With the LIMIT / OFFSET option, I think the pros are that you can take control over the number of results you return in each chunk. The cons are that the query is executed multiple times, and you also need to run a COUNT(*) up front using the same query to calculate the number of iterations required.
The pros of retrieving all at once are that you rely on the JDBC driver to manage the retrieval of data from the database. The cons are that the setFetchSize() hint can sometimes be ignored meaning that we could end up with a huge resultSet containing all 150,000 records at once!!
Would be great to hear some real life experiences solving similar issues, and recommendations would be much appreciated.
The native way in Oracle JDBC is to use the prepareStatement for the query, executeQuery and fetch
in a loop the results with defined fetchSize
Yes, of course the details are Oracle Database and JDBC Driver Version dependent and in some case the required fetchSize
can be ignored. But the typical problem is that the required fetch size is reset to fetchSize = 1 and you effectively makes a round trip for each record. (not that you get all records at once).
Your alternative with LIMIT seems to be meaningfull on the first view. But if you investigate the implementation you will probably decide to not use it.
Say you will divide the result set in 15 chunks 10K each:
You open 15 queries, each of them on average with a half of the resource consumption as the original query (OFFSET select the data and skips them).
So the only think you will reach is that the processing will take aproximatly 7,5x more time.
Best Practice
Take your query, write a simple script with JDBC fetch, use 10046 trace to see the effective used fetch size.
Test with a range of fetch sizes and observe the perfomance; choose the optimal one.
my preference is to maintain a safe execution time with the ability to continue if interrupted. i prefer this approach because it is future proof and respects memory and execution time limits. remember you're not planning for today, you're planning for 6m down the road. what may be 150,000 today may be 1.5m in 6 months.
i use a length + 1 recipe to know if there is more to fetch, although the count query will enable you to do a progress bar in % if that is important.
when considering 150,000 record result set, this is a memory pressure question. this will depend on the average size of each row. if it is a row with three integers, that's small. if it is a row with a bunch of text elements to store user profile details then that's potentially very large. so be prudent with what fields you're pulling.
also need to ask - you may not need to pull all the records all the time. it may be useful to apply a sync pattern. to only pull records with an updated date newer than your last pull.

What are the performance considerations when adding a large number of documents to a large Solr core?

If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.
two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.
This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.
100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.
An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.

Spreading out data from bursts

I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

Resources