Storing and processing millions of images in a project

Storing and processing millions of images in a project - image

I have a project that will generate a huge number of images. (1 000 000 sorry i erred)
I need to process every image through algorithm.
Can you advice me an archinecture for this project?
It is proprietary algorithm in the area of computer vision.
Average size of image is near 20 kB
I need to process them when they are uploaded and 1 or 2 times on request.
On average, once a day I get a million images, each of which I will need to navigate through the algorithm 1-2 times per day.
Yep most often, the image will be stored on a local disk
When i process images i will generate new image.
Current view:
Most likely i will have a few servers (i do not own) for each of the servers i have to perform the procedure described above.
Internet bandwidth between servers is very thin (about 1 Mb \ s) but for me it is necessary to exchange messages between servers (update coefficients of the neural network) and update algorithm.
On current hardware (intel family 6 model 26) it is about 10 minutes to complete full procedure for 50 000 images.
May be
Where will be wide internel channels so i can upload this images to servers i have.

Dont know much about images. But i guess this should help http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html
Also please let us know what kind of processing you are talking about and when you are saying huge number of images. How much do you expect per hour or per day ?

Related

Nutch 1.17 web crawling with storage optimization

I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.
One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
After the job completion, index URLs in Solr
Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).
There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:
a. How to achieve above document crawl rate i.e., how many machines are enough ?
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.

a. How to achieve above document crawl rate i.e., how many machines are enough ?
Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800 pages per domain within 48 hours.
c. Is it possible to remove raw data from Nutch and keep metadata only ?
As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.
After the job completion, index URLs in Solr
Maybe index segments immediately after each cycle.
b. Should I need to add more machines or is there any better solution ?
d. Is there any best strategy to achieve the above objectives.
A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.

Estimating Transloadit Assembly Durations

I have a web app that allows users to insert short advert videos (30 to 60 seconds) into a longer main video (typically 45 minutes, but file sizes can vary widely).
The entire process involves:
Importing all selected files from s3
Encoding each to a common scheme, ipad-high.
Extracting clips from the main video.
Concatenating all clips from the main video with the advert videos.
For n videos to be inserted into the main video, n + 1 clips will be extracted.
Since Transloadit does not provide any estimates on how long an assembly may run, I'm looking to find a way to estimate this myself so I can display a progress bar or just an ETA to give users an idea of how long their jobs will take.
My first thought is to determine the total size of all files in the assembly and save that to some redis database, along with the completion time for that.
Subsequent runs will use this as a benchmark of sorts, i.e if 60GB took 50 minutes, how long will 25GB take.
The data on redis will be continually updated (I guess I could make the values a running average of sorts) to make the estimates a reliable as possible.
Any ideas are welcome, thanks :)

I'll para-phrase some of the conversation had over at Transloadit regarding this question:
Estimating the duration of an assembly is a complex problem to solve due to how many factors go into the calculation, for example: how many files are in a zip that is being uploaded? how many files in the directory that is will be imported? how many files will pass the filter on colorspace: rgb? These are things that are only found out as the Assembly runs - but they can wildly alter the ETA
There are plans for a dashboard that will showcase graphs with information on your Assemblies - such as throughput in Mbit/s, combined with historical data on the Template and filesizes, this could be used for rough estimations.
One suggestion was that instead of an ETA, it may be easier to implement a progress bar showcasing when each step or job has been completed. The downside with this is of course the accuracy, but it may be all you need for a front-facing solution
You may also be interested in looking into turbo mode. If you're using the /video/encode or /video/concat robot, it may help dramatically reduce the encoding speeds

How can I do a capacity planning of my web application and decide the deployment architecture?

I have an ASP.net web application deployed on the small AWS instance (Dual Core AMD, 2.60 GHz, 1.7 GB RAM). I would like to perform load testing on this server for 300 concurrent users and for future, I want to design the tentative capacity planning and deployment architecture for 250,000 registered users for my application.
I am very new person in this area and have not done any kind of load testing before.
The Use-case and scenario of my
application will be as below:
Scenario - 250, 000 registered users in database
Concurrency – 5% - 7% - approximately 17,500
Each user has a book shelf and
assuming each user is subscribed for
10 books. Each book is of around 25 MB
in size with 400 pages
Use cases
User Login
Database authentication & authorization
View Book Shelf with book images
Book Shelf (.swf) - 400 KB (gets loaded for each user)
10 book images will be loaded (20KB per image)(approximately)
catalog.xml - 30 KB / user for allocated for the user
Note: Approximately 650KB of data is gets downloaded on to client
machine
Browse book : On clicking a book image following files & its sizes will
be downloaded to clients machine
One time
Reader.swf - 950 KB (first download)
XML data’s of approximately 100 KB / per book (on click)
Book.xml
Annotation.xml
Catalog.xml
Usersettings.xml 40KB*4 = 160 KB per user (.swf)
Note: Approximately 1200KB of data is gets downloaded on to client
machine
Could someone please suggest how can I proceed with this?
Very much thanks in advance,
Amar

Completing the first goal (test 300 users) is pretty straightforward - choose a load testing tool, build the scenarios and test. Then tune/optimize and repeat.
But I think your bigger question is how to approach testing and planning for your full capacity - which you say is ~18k concurrent users. First, make sure that number (7% of user base) is the peak concurrency, not average. You need to test the peak.
So assuming that you are planning a load-balanced multiple-server cluster to handle that load, the next step is to determine the maximum capacity of a single web/app server, without the load-balancer in place. This gives you a baseline that you can use to judge the performance of the cluster. This is a really important step and many of our clients skip this step, to their own detriment. It is important because there are many conditions under which a load-balanced system does not scale linearly with the number of servers in the cluster. Ideally it should and good systems get pretty close. You'd be surprised how frequently we see systems that don't scale well at all. We've even seen a few systems that actually have lower capacity as a cluster than a single server could handle on its own.
Once you have that baseline established, you can make a preliminary estimate about the total number of servers you'll need and you can build your cluster. I recommend next testing with 2 web/app servers. This should nearly double your capacity. If it doesn't then you need to determine why before moving on to larger tests. Likely candidates are the load balancer setup or the database (if a single database server is servicing all the web/app servers). Occasionally something more fundamental to the application architecture is at play.
When you are satisfied that scaling from 1 to 2 servers is performing optimally, then you can proceed to scale up to your full cluster and test maximum capacity. Be prepared to back-track if you don't see the scalability you expected - test with 3, 4, 5 servers, etc.
I hope that helps! Good luck :>

This link: http://support.microsoft.com/kb/231282, has links to some tools to stress test your website.
This is obviously a complicated area, so you may have 2.5 million registered users (really that many?), but how many are concurrent, and what areas of the website will they be using.
All of these things (and many more), will impact the capacity planning for your system.

Maximum number of concurrent connections jBoss

We are currently developing a servlet that will stream large image files to a client. We are trying to determine how many Jboss nodes we would need in our cluster with an Apache mod_jk load balancer. I know that it takes roughly 5000 milliseconds to serve a single request. I am trying to use the forumula here http://people.apache.org/~mturk/docs/article/ftwai.html to figure out how many connections are possible, but I am having an issue because they don't explain each one of the numbers in the formula. Specifically they say that you should limit each server to 200 requests per cpu, but I don't know if I should use that in the formula or not. Each server we are using will have 8 cores so I think the forumula should either go like this:
Concurrent Users = (500/5000) * 200 * 8 = 100 concurrent users
Or like this:
Concurrent Users = (500/5000) * (200 * 8) * 8 = ~1200 concurrent users
It makes a big difference which one they meant. Without a example in their documentation it is hard to tell. Could anyone clarify?
Thanks in advance.

I guess these images aren't static, or you'd have stopped at this line?
First thing to ease the load from the
Tomcat is to use the Web server for
serving static content like images,
etc..
Even if not, you've got larger issues than a factor of 8: the purpose of his formula is to determine how many concurrent connections you can handle without the AART (average application response time) exceeding 0.5 seconds. Your application takes 5 seconds to serve a single request. The formula as you're applying it is telling you 9 women can produce a baby in one month.
If you agree that 0.5 seconds is the maximum acceptable AART, then you first have to be able to serve a single request in <=0.5 seconds.
Otherwise, you need to replace his value for maximum AART in ms (500) with yours (which must be greater than or equal to your actual AART).
Finally, as to the question of whether his CPU term should account for cores: it's going to vary depending on CPU & workload. If you're serving large images, you're probably IO-bound, not CPU-bound. You need to test.
Max out Tomcat's thread pools & add more load until you find the point where your AART degrades. That's your actual value for the second half of his equation. But at that point you can keep testing and see the actual value for "Concurrent Users" by determining when the AART exceeds your maximum.

What is considered fast performance for a single server request?

I understand there are several factors involved when making db calls and internet, but i am referring strictly to the methods processing the requests, not the roundtrip to the client. I am using stopwatch to get an average range but i do not know what is generally considered fast or decent performance. 10ms 500ms?

It is really subjective question, but I think it is valid. All of us know that 4 MPH is slow for the car, while 150 MPH - very fast. Now lets back to servers. The fast (indexed) call to the DB takes about 20 ms. Let say we need 5 of them. Storage latency is also about a 5-10 milliseconds with data dozens of megabytes per second throughput. Let say we need to read 1Mb. It should take let say 50 milliseconds. 10 milliseconds of CPU is enough to make dozens of searches in the various maps. 10-20 is enough to efficiently fill some template of the result.
So we get to sum 20*5 (for DB) + 50 (file system) + 10 (in memory searches)+20 (template filling). 180 milliseconds. So very very roughly we can assume that efficient server, not overloaded, not doing excessive scans over data should have around 200 milliseconds respond time. From above we can also assume that get less then 50 - is very challenging.
Of course all above is depends on many factors, but the goal of the post - to give some feeling what is fast and what is slow.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio