We are creating a site which will have users uploading images that's classifiable and searchable.
My question is surrounding the image storing thereof, what would make a solid maintainable solution?
I've looked at S3 - it looks promising.
If S3 is a good option, where would I store the references to the objects (along with the metadata/tags)?
Thanks :)
If I were architecting such a system, I would certainly look no further than S3 for scalability and durability for actually storing the images -- and thumbnails -- and metadata, to some extent.
S3 metadata storage is limited to 2KB (total number of bytes of all keys and all values combined), is limited to US-ASCII, and is not indexed -- you have to fetch the metadata for the specific object. For many applications, this is entirely sufficient but that's very doubtful in your case.
http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata
So, the question "is S3 a good option" is easily answered: if you mean among AWS services, the answer is yes, it's difficult to argue that it is the best fit.
You may also consider CloudFront -- not instead of, but in addition to S3. It can improve load times by caching your "popular" content closer to where users are located, among other things.
Where to store the references to the objects goes off into the land of "opinion based," which we don't do on Stack Overflow. The answer is, of course, "in a database," but AWS has options here.
I'm a relational database DBA, so of course, my inclination is that everything should have a relational database (such as RDS) as its authoritative data store, while others would probably say the DynamoDB NoSQL database offering would be a useful data store.
From there (wherever "there") is, CloudSearch could be populated with the metadata, keywords, etc., for processing the actual search operations, using indexes it builds which are more potentially better-suited to search-intensive operations than proper databases. I would not, however, try to use CloudSearch as the authoritative store of all your valuable metadata. Search indexes should be treated as disposable, rebuildable assets... although I fear even that statement might strike some as being opinion-based.
One thing that isn't a matter of opinion is that all of these various cloud services allow you to spin up a substantial proof-of-concept infrastructure at costs that are so low as to have been unimaginable just a few years ago... so you can try them, play with them, and throw them away if they don't do what you expect. You don't have to buy before you try.
Related
I have been looking but haven't found the answer to this simple question: In general, how must faster is AWS EBS than S3?
I'm looking for a general ballpark answer in terms of "X times faster", "Y orders of magnitude faster", or a range of "somewhere # to # times faster depending on the specific use case". I'll even take answers that give different use cases as long as there's actual relative performance NUMBERS associated with them.
And PLEASE, I do NOT want this thread to devolve into an architectural discussion of various other solutions I might try (e.g., DynamoDB, ElastiCache, RDS, etc) or the limitations imposed by whatever compute solution I choose (e.g. EC2, Lambda, ECS, etc). Nor do I find "why would you want to do that?" counter-questions helpful no matter how often they appear on StackOverflow.
I'm just looking for "How much faster is EBS than S3?" because I haven't the slightest clue about it right now and haven't found a resource that gives me the type of answer I'm looking for.
Yes, yes, I know the real answer HAS to be "it depends" because determining the answer is infinitely more nuanced than my question.
I know that generally EBS is faster and that how much faster will depend on all sorts of things, like drive type, PIOPs, network speed, etc. all related to a specific use case. But surely there's a general rule of thumb to help choose between the two when evaluating a system's design cost/benefit tradeoffs. ("No there isn't, and stop calling me Shirley.")
If you need to know why I'm asking, let's say I'm just curious what the general speed difference is in case I ever want to decide between them when standing up a cheap-as-heck web site with a dirt-simple data store that is either on EBS or S3. (Again, I'm NOT looking for design options).
Thanks
For the edge case where you are storing a file as an immutable data file and need to choose between S3 object store and EBS-based file system, both are valid options if you stretch your definition of validity.
To answer the question: EBS is faster (20x?) than S3, but more expensive and bound to a single EC2 instance. S3 is slower but cheaper & more widely accessible to multiple resources. S3 can be made to be much faster with some additional components (See #Mark B's comments above for more details)
You (and I) will want to use S3.
Additional Info
For immutable data files, S3 makes more sense in all normal cases: it's fast enough, cheaper, durable, and its contents are more available to multiple parallel processes, making it better for HA solutions.
EBS could make sense if you are hyper sensitive to speed (and can't achieve the speeds with S3+other components) and not cost or high availability. EBS (or it's LAN-like equivalent EFS) is the only choice (out of the 2 being explored) if your file is going to be frequently updated (e.g., random access files for an RDBMS) or used by an application that requires a file system be in place.
(Thanks #Mark B for your patience, and #Jarmod for the objective data)
I'm quite new to BigData architecture so please don't be to harsh on me.
I am trying to figure out the best alternative to build a BI Architecture able to deal with huge amounts of data. As I see it, the solution has to be clustered/horizontally scalable to cope with system growing. I would like to be able to interact with the system using SQL, so HBase + Hive (or even Pig, not for sql but not to need to manually write MR tasks) could be a solution. What would be the benefits/disadvantages of such an architecture opposed to, for instance, Exasolution and their In-Memory - MPP - Columnar solution.
Are there other alternatives which might have some extra-benefits? What about maintenance and configuration? Any Microsoft solution (I may find customer specific needs regarding this)
Sorry for posting such an open question, but I would like to see some discussion so that I can learn from you as much as possible.
Though being an EXASOL guy, I will not start to try to convince you that EXASOL is the one and only good solution out there. It heavily depends on the use case you are trying to implement, and the requirements you have to fulfill.
Hadoop is a very flexible, scalable system and used very often for storing and processing huge volumes of data.
EXASOL in contrast is a specialized RDBMS for complex analytic query processing.
I think that these two options don't really directly compete but complement each other. In many cases companies need a scalable data lake to store and preprocess there data, or to query it in rather simply ways. Once you want to enter the real-time business with complex analytics, where dozens, hundreds or even thousands of analysts are running lots of queries, then an in-memory RDBMS is a great choice.
King, the producer of Candy Crush, combines these two worlds to a powerful data management eco system. They store petabytes of data within Hadoop and use EXASOL on top as an in-memory layer for hundreds of terabytes of data. You can read more about that exciting use case here: http://bit.ly/1TR8APY
Another important difference of these two worlds is the complexity. While EXASOL is tuning-free because it is a specialized system (similar to an appliance) for a certain use case running SQL queries or R/Python/Java in-database-analytics, the Hadoop stack is much more complex. You'll need a certain level of know how to setup, maintain and tune this system. This doesn't need to be a reason for any of the two option. As mentioned, it heavily depends on what you want.
From a price perspective, Hadoop is free and so it should be much cheaper than an in-memory db such as EXASOL, right? Wait a minute, it's not that easy. Again, you have to consider the whole picture. How much data you really want to store, how much of that needs to be queried for analysis, how much hardware would you need to buy, how many people do you have to be hired and trained for the operation or the analytics deployed on the system.
Summary
To summarize my thoughts, the world is too complicated to directly compare these two technologies. Depending on the use case and your personal requirements, either one or the other could be the better option. And in my opinion, the trend in the market is combining such systems to a data mgmt eco systems where you get the best out of the two worlds... Actually three worlds, because the world of operational data processing of NoSQL solutions should also be mentioned here.
I hope that helped a bit. If you need any further details especially about EXASOL, don't hesitate to contact me or connect with me on LinkedIn: de.linkedin.com/in/exagolo
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking for the database/mechanism to store the data where I can write the data and read the data with high performance.
This storage is used to for storing the Logging like important information across multiple systems. Since it's critical data which will be logged, read performance should be pretty fast as these data will be used to show history. Since we never do update on them/delete on them/or do any kinda joins, I am looking for right solution. Probably we might archive the data in long time but that's something ok to deal with.
I tried looking at different sources to understand different NoSql databases, experts opinion is always better :)
Must Have:
1. Fast Read without fail
2. Fast Write without fail
3. Random access Performance
4. Replication kinda feature, one goes down, immediately another should be up and working
5. Concurrent write/read data
Good to Have:
1. Search content like analysing the data for auditing with/without Indexes
Don't required:
1. Transactions are not required at all
2. Update never happens
3. Delete never happens
4. Joins are not required
Referred: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Disclosure: Kevin Porter is a Senior Software Engineer at Aerospike, Inc. since May 2013. (ref)
Be sure to consider Aerospike; Aerospike dominates in the adtech space where high throughput reads and writes are a required. Aerospike is frequently touted as having "the speed of Redis with the scalability of Cassandra." For searching/querying see Aerospike's secondary index documentation.
For more information see the discussion/articles below:
Aerospike vs Cassandra
Aerospike vs Redis and Mongo
Aerospike Benchmarks
Lastly verify the performance for yourself with the One million TPS on EC2 Instructions.
Let me be the Cassandra sponsor.
Disclaimer: I don't say Cassandra is better than the others because I don't even know so deeply mongo/redis/whatever and I don't want even come into this kind of stuffs.
The reason why I suggest Cassandra is because your needs match perfectly with what Cassandra offers and your "don't required list" is a set of feature that are either not supported in Cassandra (joins for instances) or considered an anti-pattern (deletes and in some situations updates).
From your "Must Have" list, point by point
Fast Read without fail: Supported. You can choose the consistency level of each read operation deciding how much important is to retrieve the most fresh information and how much important is speed
Fast Write without fail: Same as point 1
Random access Performance: When coming in the Cassandra world you have to consider many parameters to get a random access performance but the most important that comes into my mind is the data model -- if you create a data model that scales horizontally (give a look here) and you avoid hotspots you get what you need. If you model your DB in a good way you should have O(1) for each operation since data are structured to be queried
Replication: In this Cassandra is even better than what you might think. If one node goes down nothing changes to the cluster and everything(*) keep working perfectly. Cassandra spots no single point of failure. I can tell you with older Cassandra version I've had an uptime of more than 3 years
Concurrent write/read data: Cassandra uses the lww policy (last-write-wins) to handle concurrent writes on the same key. The system supports multiple read-write and with newer protocols also async operations.
There are lots of other interesting features Cassandra offers: linear horizontal scaling is the one I appreciate more but there is also the fact that you can know the instant in which every piece of data has been updated (the timestamp of lww), counters features and so on.
(*) - if you don't use Consistency Level All which, imho, should NEVER be used in such a system.
Here's a few more links on how you can span In-Memory with Disk (DRAM, SSM, and disk storage) w/ Aerospike:
http://www.aerospike.com/hybrid-memory/
http://www.aerospike.com/docs/architecture/storage.html
I think everyone is right in terms of matching the specific DB to your specific use case. For instance, Aerospike is optimal for key-value data. Other options might be better.
By way of analogy, I'll always remember how, decades ago, a sister of mine once borrowed my computer and wrote her term paper in Microsoft Excel. Line after line was a different row of a spreadsheet. It looked ugly as heck, but, uh, okay. She got the task done. She cursed and swore at how difficult it was to edit the thing. No kidding!
Choosing the right NoSQL database for the right task will either make your job a breeze, or could cause you to curse a blue streak if you decided on the wrong basic tool for the task at hand.
Of course, every vendor's going to defend their product. I think it's best the community answer the question. Here's another Stack Overflow thread answering a similar question:
Has anyone worked with Aerospike? How does it compare to MongoDB?
btw: Do you have any more specific insights for us on what type of problem you are trying to solve?
Currently Table Storage supports From, Where, Take, and First.
Are there plans to support any of the other 29 operators?
Are there architectural or design practices in regards to storage that one should follow in order to implement things like COUNT, SUM, GROUP BY, etc?
If we have to code for these ourselves, how much of a performance difference are we looking at to something similar via SQL and SQL Server? Do you see it being somewhat comparable or will it be far far slower if I need to do a Count or Sum or Group By over a gigantic dataset?
I like the Azure platform and the idea of cloud based storage. I like Table Storage for the amount of data it can store and its schema-less nature. SQL Azure just won't work due to the high cost of storage space.
Ryan,
As Steve said, aggregations are resolved "client side", which might kead to bad perfromance if your datasets are too large.
An alternative is to think about the problem in a different way. You might want to pre-compute those values so they are readily available. For example if you have master-detail data (like the proverbial Purchase order + line items), you might want to store the "sum of line items" in the header. This might appear to be "redundant" (and it is), but de-normalization is something you will have to consider.
These pre-computations can be done "synch" or "asynch". In some situations you can afford having approximations, so delaying the computation might be beneficial from a perfromance perspective.
The only alternative is to pull everything down locally and run Count() or Sum() over the local objects. Because you have to transfer the entire contents of your table before doing the count, this will certainly be much slower than doing something server-side like with SQL. How much slower depends on the size of your data.
What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.