what does the s3 prefix means with respect to scale?

what does the s3 prefix means with respect to scale? - performance

From Request Rate and Performance Guidelines - Amazon Simple Storage Service:
Amazon S3 automatically scales to high request rates. For example,
your application can achieve at least 3,500 PUT/POST/DELETE and 5,500
GET requests per second per prefix in a bucket. There are no limits to
the number of prefixes in a bucket. It is simple to increase your read
or write performance exponentially. For example, if you create 10
prefixes in an Amazon S3 bucket to parallelize reads, you could scale
your read performance to 55,000 read requests per second.
Assume the S3 bucket with the folder s3bucket/sample/. There are multiple objects present in the folder. Example : s3bucket/sample/object_1, s3bucket/sample/object_2.
What does the prefix mean in this statement? Is it the full object path like s3bucket/sample or s3bucket/sample/object_1?
Is the 5500 requests per second for the folder - s3bucket/sample or 5500 requests are allowed for every object in the folder?
We have a multiple asset types belonging to one content, would like to understand which of the below 2 option will scale better?
Option 1
s3bucket/contentId_1/assetType_1
s3bucket/contentId_1/assetType_2
s3bucket/contentId_1/assetType_3
s3bucket/contentId_2/assetType_1
s3bucket/contentId_2/assetType_2
s3bucket/contentId_3/assetType_3
or
Option 2
s3bucket/contentId_1_assetType_1
s3bucket/contentId_1_assetType_2
s3bucket/contentId_1_assetType_3
s3bucket/contentId_2_assetType_1
s3bucket/contentId_2_assetType_2
s3bucket/contentId_3_assetType_3

The page says requests per second per prefix in a bucket, which is effectively the same as saying "per directory per bucket".
Frankly, you are unlikely to go anywhere near these performance limits. Large companies with millions of customers might want to use these performance hints, but the vast majority of AWS customers would not approach such levels of usage.
I would suggest you deploy your data in the most meaningful way for your application rather than having to conform to these techniques, at least until you scale to very large usage patterns.

Related

Nutch 1.17 web crawling with storage optimization

I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.
One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
After the job completion, index URLs in Solr
Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).
There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:
a. How to achieve above document crawl rate i.e., how many machines are enough ?
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.

a. How to achieve above document crawl rate i.e., how many machines are enough ?
Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800 pages per domain within 48 hours.
c. Is it possible to remove raw data from Nutch and keep metadata only ?
As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.
After the job completion, index URLs in Solr
Maybe index segments immediately after each cycle.
b. Should I need to add more machines or is there any better solution ?
d. Is there any best strategy to achieve the above objectives.
A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.

EC2 host type for a DynamoDB batchWrite call

I have a requirement to bulk upload an excel sheet to a DynamoDB table and the maximum number of rows are 200,000. The website for bulk upload will be used less frequently, so we can assume there are only 1 - 2 bulk uploads being processed at a given time. In the backend, I am using Apache POI API to parse the excel sheet into DynamoDB Items.
Because we can only send up to 25 items in a batchWriteItem call, the currently latency is around 15 minutes (900 seconds) to completely upload all the 200,000 items. Hence I am planning to implement multi threading to execute multiple batchWriteItem API calls in parallel. Can you help me understand which EC2 host types are best suited for multi-threading for this purpose.
Any references will be really helpful.

Normally, multi-threading would be helped by using an Instance Type that has multiple CPUs.
However, you are describing behaviour that is waiting on network rather than CPU. Therefore, it is likely that the operation you describe is not being heavily impacted by CPU Utilization.
The best way to answer your question is to recommend that you experiment with different instance types to find the one that is best for your application's combination of needs:
Pick an instance family (eg m5) and try a few different sizes
Compare this against another family (eg c5) to see whether the improved performance is worth the extra cost
Monitor the application to find the bottleneck, which would either be RAM, CPU, Network or Disk access
Please note that smaller instances have less Network bandwidth, so you might need to choose a larger instance type to avoid being throttled on network bandwidth. This might result in excess CPU that isn't being fully utilized.

Google Cloud Storage API write Limits

Reading the documentation on here I am still not clear of the following points:
Are there any limits on the size of the API request to the Google cloud Storage bucket? (We need to transfer PDFs from a CRM to Google Cloud Storage)
How many files we can send (it mentions a limit of 1000 writes per second) is that the same thing?

Files are the objects you store in Cloud Storage, so the maximum size limit 5 TB should be considered for individual files. There is no limit to write across multiple objects, however buckets support roughly 1000 writes per second and then scale up as needed.
For parallel uploads, please take note that up to 32 objects/files can be composed in a single composition request and a per-project composition rate limit of approximately 1,000 source objects per second.
I also recommend that you take a look at best practices on how to ramp up the request rate

You go with more than 1000 writes per second per bucket - the GCS infrastructure will scale automatically to accommodate it. The only requirement is not to ramp up the load too quickly, namely "double the request rate no faster than every 20 minutes".
https://cloud.google.com/storage/docs/request-rate#ramp-up

Estimating computation costs for parallel computing

I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.
My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.
I am very interested in estimating these costs:
How many instances should I select to perform this task? What are
the capacity of the instance (the size of the master instance and
the map-reduce instance)? Can I deduct these capacities and costs
based on n and k as well as each operation cost?
I have designed an example data flow: I used one xlarge instance as
my master node, and 10 medium instances as my map reduce group.
Would this be enough?
How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?

1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.
2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.
3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.
I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?

Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry

As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio