Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am evaluating a number of different NoSQL databases to store time series JSON data. ElasticSearch has been very interesting due to the query engine, I just don't know how well it is suited to storing time series data.
The data is composed of various metrics and stats collected at various intervals from devices. Each piece of data is a JSON object. I expect to collect around 12GB/day, but only need to keep the data in ES for 180 days.
Would ElasticSearch be a good fit for this data vs MongoDB or Hbase?
You can read up on ElasticSearch time-series use-case example here.
But I think columnar databases are a better fit for your requirements.
My understanding is that ElasticSearch works best when your queries return a small subset of results, and it caches such parameters to be used later. If same parameters are used in queries again, it can use these cached results together in union, hence returning results really fast. But in time series data, you generally need to aggregate data, which means you will be traversing a lot of rows and columns together. Such behavior is quite structured and is easy to model, in which case there does not seem to be a reason why ElasticSearch should perform better than columnar databases. On the other hand, it may provide ease of use, less tuning, etc all of which may make it more preferable.
Columnar databases generally provide a more efficient data structure for time series data. If your query structures are known well in advance, then you can use Cassandra. Beware that if your queries request without using the primary key, Cassandra will not be performant. You may need to create different tables with the same data for different queries, as its read speed is dependent on the way it writes to disk. You need to learn its intricacies, a time-series example is here.
Another columnar database that you can try is the columnar extension provided for Postgresql. Considering that your max db size will be about 180 * 12 = 2.16 TB, this method should work perfectly, and may actually be your best option. You can also expect some significant size compression of about 3x. You can learn more about it here.
Using time based indices, for instance an index a day, together with the index-template feature and an alias to query all indices at once there could be a good match. Still there are so many factors that you have to take into account like:
- type of queries
- Structure of the document and query requirements over this structure.
- Amount of reads versus writes
- Availability, backups, monitoring
- etc
Not an easy question to answer with yes or no, I am afraid you have to do more research yourself before you are really say that it is the best tool for the job.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm working on an game application where I need real-time data for leaderboards I'm building. I've read a bunch of stackoverflows and company blogs- but honestly, I'm not sure which one best fits my use case. I am using DynamoDB to record players' recent moves, and the history of moves are in kafka. I am looking to stream data from these two data sources into a database and my leaderboard-service can then query the database to render the contents of each leaderboard. My data velocity is modest (1K game events/sec). I find these three different databases that I can use, has anybody used any of these database for game-leaderboarding? If so, can you share the advantages or pains that you have encountered while doing so? According to all 3 companies, they are able to do real-time data.
You would have to evaluate the scale and performance that you require and it is difficult for me to estimate those based on the data you provided. But I can do a feature comparison of using some of these systems.
The first option is to run your leaderboards by querying DynamoDB itself and you do not need any additional systems. The advantage obviously is that there is one less component for you to manage. But I am assuming that your leaderboards need complex logic to render, and because DynamoDB api deals with key/values, you have to fetch a lot of data from DynamoDB to execute every query to render the leaderboard.
The second option you specified is Elastic Search. Great system that gives query results really fast because it stores data as an inverted index. However, you wont be able to do JOINs between your DynamoDB data and kafka stream. But you sure can run a lot of concurrent queries on Elastic. I am assuming that you need concurrent queries because you are powering an online game where m
multiple players are accessing the leaderboard at the same time.
The third option, Druid, is a hybrid between a datalake and a data warehouse. You can store large volumes of semi-structured data, but unlike Elastic, you would need to flatten the nested json data an ingest time. I have used Druid for large scale analytical processing for powering my dashboards, and it does not support as high a concurrency as Elastic.
Rockset seems to be much newer product and is a hosted service on the cloud. It says that it build inverted index like Elastic and also supports JOINs. It can auto-tail data from DynamoDB (using change-streams) and kafka. I do not see any performance numbers on the website, but the functionality is very compatible with what I would need for building a game leaderboard.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Currently we use Oracle for storing images in the application. But we expect to see lot of images/videos in the application. We would like to move away from oracle to be able to shard easily and achieve high throughput. Any recommendations?
Did anyone try using NoSQL databases such as Couchbase/MongoDB for this purpose? Are they optimized for this purpose.
I see that Cloudinary uses Amazon S3 for this purpose. But I am looking for something, which can be deployed in our datacenter for privacy concerns.
From your problem description, I can't see any indication pro or contra a NoSQL database.
Having media like pictures, sound, or video, in a database means just having a large uninterpreted binary object. Uninterpreted means: The database can store and deliver the binary, but can't analyze it for its properties, take it as a basis for queries, and the like (what databases are made for).
Both relational and non-relational databases provide data types for that kind of BLOB. The features in which they differ are, for example,
tabular vs. tree structured data structures - not applicable for the BLOB, as it will be one attribute, no matter how large it becomes,
different sorts of transaction logic (CAP theorem) that aren't addressed by the BLOB subject matter.
So I'm afraid your architecture will need to be decided on a much broader range than just considering your media data. Which are your data structures? Which are your query and update scenarios?
What I see people do with Couchbase is store all of the meta-data about the image in a JSON document in Couchbase, but host the image itself is something optimized for files. You get the benefits of both worlds. In this kind of use case you mention, from my experience a NoSQL database will be much better than a relational database.
Having managed very large relational and NoSQL databases with blobs in them, IMO it is a terrible idea in most cases, regardless of the database type. So I wrote up this blog post for just such a situation.
As you are looking for private deployment in your data center, you may consider MongoDB or OpenStack Swift.
I have seen people using MongoDB gridfs (https://docs.mongodb.com/manual/core/gridfs/) for storing images/videos.
The advantages of using MongoDB gridfs:
You can use MongoDB replica set for fault tolerance/high availability.
You can access a portion of a large file without loading the whole file into memory. As MongoDB stores files into small chunks(255KB), so video files can be streamed faster.
You can scale using MongoDB sharding.
Openstack Swift is a highly available, distributed, eventually consistent object/blob store comparable to Amazon S3, which you can deploy in your data center.
Also OpenStack Swift is being used by many companies, Rackspace's Cloud Files runs Swift. You may also give a look into Swift :
http://docs.openstack.org/developer/swift/
S3 which has a very strong commitment to privacy. What are your concerns regarding S3? Also, which datacenter are you planning to move away from Oracle's storage?
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I look after a system which uploads flat files generated by ABAP. We have a large file (500,000 records) generated from the HR module in SAP every day which generates a record for every person for the next year. One person gets a record if they are rostered on on a certain day or have planned leave for a given day.
This job takes over 8 hours to run and it is starting to get time critical. I am not an ABAP programmer but I was concerned when discussing this with the programmers as they kept on mentioning 'loops'.
Looking at the source it's just a bunch of single row selects inside nested loops after nested loop. Not only that it has loads of SELECT
I suggested to the programmers that they use SQL more heavily but they insist the SAP approved way is to use loops instead of SQL and use the provided SAP functions (i.e to look up the work schedule rule), and that using SQL will be slower.
Being a database programmer I never use loops (cursors) because they are far slower than SQL, and cursors are usually a giveaway that a procedural programmer has been let loose on the database.
I just can't believe that changing an existing program to use SQL more heavily than loops will slow it down. Does anyone have any insight? I can provide more info if needed.
Looking at google, I'm guessing I'll get people from both sides saying it is better.
I've read the question and I stopped when I read this:
Looking at the source it's just a bunch of single row selects inside
nested loops after nested loop. Not only that it has loads of SELECT
*.
Without knowing more about the issue this looks overkilling, because with every loop the program executes a call to the database. Maybe this was done in this way because the dataset of the selected data is too big, however it is possible to load chunks of data, then treat them and then repeat the action or you can make a big JOIN and operate over that data. This is a little tricky but trust me that this does the job.
In SAP you must use this kind of techniques when this situations happen. Nothing is more efficient than handling datasets in memory. To this I can recommend the use of sorted and/or hashed tables and BINARY SEARCH.
On the other hand, using a JOIN does not necessarily improves performance, it depends on the knowledge and use of the indexes and foreign keys in the tables. For example, if you join a table to get a description I think is better to load that data in an internal table and get the description from it with a BINARY SEARCH.
I can't tell exactly what is the formula, it depends on the case, Most of the time you have to tweak the code, debug and test and make use of transactions 'ST05' and 'SE30' to check performance and repeat the process. The experience with those issues in SAP gives you a clear sight of these patterns.
My best advice for you is to make a copy of that program and correct it according to your experience. The code that you describe can definitely be improved. What can you loose?
Hope it helps
Sounds like the import as it stands is looping over single records and importing them into a DB one at a time. It's highly likely that there's a lot of redundancy there. It's a pattern I've seen many times and the general solution we've adopted is to import data in batches...
A SQL Server stored procedure can accept 'table' typed parameters, which on the client/C# side of the database connection are simple lists of some data structure corresponding to the table structure.
A stored procedure can then receive and process multiple rows of your csv file in one call, therefore any joins you need to do are being done on sets of input data which is how relational databases are designed to be used. This is especially beneficial if you're joining out to commonly used data or have lots of foreign keys (which are essentially invoking a join in order to validate the keys you're trying to insert).
We've found that the SQL Server CPU and IO load for a given amount of import data is much reduced by using this approach. It does however require consultation with DBAs and some tuning of indexes to get it to work well.
You are correct.
Without knowing the code, in most cases it is much faster to use views or joins instead of nested loops. (there are exceptions, but they are very rare).
You can define views in SE11 or SE80 and they usually heavily reduce the communication overhead between abap server and database server.
Often there are readily defined views from SAP for common cases.
edit:
You can check where your performance goes to: http://scn.sap.com/community/abap/testing-and-troubleshooting/blog/2007/11/13/the-abap-runtime-trace-se30--quick-and-easy
Badly written parts that are used sparsely don't matter.
With the statistics you know where it hurts an where your optimization effort pays.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We have 1 million dataset and each dataset is around 180mb. SO the total size of our data is around 185T. Each dataset is plain DEL file with only three columns. The first two columns is the row key and the last one is the value of the row. For example, the first column is A, the second is B the third is C. The value of A is the dataset number, so A is fixed in one dataset and its range is from 1-1million. B is the position number, B can range from 1 to 3 million.
What we are planning to do is given a any set of non-overlapping ranges of B like from 1-1000, 10000-13000, 16030-17000...., we calculate the sum of the values of each dataset over all these ranges, and return the top 200 dataset number(A) with in seconds.
Do any one expertised in bigdata have any idea on how many severs we will need to handle this case? My boss believe 10 servers (16 cores each) can do it with a budget of $50,000. Do you think it's feasible?
I think that services such as Microsoft Azure can be your friend in this case. I think that your budget will go further using "pay as you use" services. You can decide how many servers / instances you would like to use to crunch your data.
I do think that one slight problem might be the way your data is currently formatted. I would most certainly look at using Azure table storage possibly and first work on getting your data in a service such as that. Once that has been complete, you now have a more "queryable" and reliable data store. From there you can use your language of choice to interact with that data. Using table storage, you can create partition keys.
Once you have your partitions that you would like to use, you can then create a service that you would perhaps supply a partition or more likely partition range, which it will process. You will be able to adjust the size of your instances and also what hardware should drive them, with something like this in place you can then determine an average on how long it would take 1 instance to process x records. Perhaps you could write some logs as to the performance.
Once you have your logs, it will be simple to determine how long the process would take with reasonable accuracy. you can then start adding more instances to your service, thus starting to work through the data at a faster pace.
Table storage was also designed to work with big datasets, so going through the documentation on this, you will find many key features that you could use.
there are honestly many ways in which this problem could be solved, this is simply one option that I have used in the past and it worked for me at the time.
I would make sure that if this was a viable option for you, to place your data and services in the same data centre. While I assume you have some form of sequence in your files, you could also persist placeholders storing your sum values for future use and should your data grow in the future you could simply add the new data and run your services again to update the system.
I would not go on this journey without making sure that you can persist your sum values in some or other way, otherwise should you need to get the values again in the future you will again need to start from the beginning again.
I managed to find one quick write up about the services mentioned above working with big data. Perhaps it might be able to help you further. http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking for the database/mechanism to store the data where I can write the data and read the data with high performance.
This storage is used to for storing the Logging like important information across multiple systems. Since it's critical data which will be logged, read performance should be pretty fast as these data will be used to show history. Since we never do update on them/delete on them/or do any kinda joins, I am looking for right solution. Probably we might archive the data in long time but that's something ok to deal with.
I tried looking at different sources to understand different NoSql databases, experts opinion is always better :)
Must Have:
1. Fast Read without fail
2. Fast Write without fail
3. Random access Performance
4. Replication kinda feature, one goes down, immediately another should be up and working
5. Concurrent write/read data
Good to Have:
1. Search content like analysing the data for auditing with/without Indexes
Don't required:
1. Transactions are not required at all
2. Update never happens
3. Delete never happens
4. Joins are not required
Referred: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Disclosure: Kevin Porter is a Senior Software Engineer at Aerospike, Inc. since May 2013. (ref)
Be sure to consider Aerospike; Aerospike dominates in the adtech space where high throughput reads and writes are a required. Aerospike is frequently touted as having "the speed of Redis with the scalability of Cassandra." For searching/querying see Aerospike's secondary index documentation.
For more information see the discussion/articles below:
Aerospike vs Cassandra
Aerospike vs Redis and Mongo
Aerospike Benchmarks
Lastly verify the performance for yourself with the One million TPS on EC2 Instructions.
Let me be the Cassandra sponsor.
Disclaimer: I don't say Cassandra is better than the others because I don't even know so deeply mongo/redis/whatever and I don't want even come into this kind of stuffs.
The reason why I suggest Cassandra is because your needs match perfectly with what Cassandra offers and your "don't required list" is a set of feature that are either not supported in Cassandra (joins for instances) or considered an anti-pattern (deletes and in some situations updates).
From your "Must Have" list, point by point
Fast Read without fail: Supported. You can choose the consistency level of each read operation deciding how much important is to retrieve the most fresh information and how much important is speed
Fast Write without fail: Same as point 1
Random access Performance: When coming in the Cassandra world you have to consider many parameters to get a random access performance but the most important that comes into my mind is the data model -- if you create a data model that scales horizontally (give a look here) and you avoid hotspots you get what you need. If you model your DB in a good way you should have O(1) for each operation since data are structured to be queried
Replication: In this Cassandra is even better than what you might think. If one node goes down nothing changes to the cluster and everything(*) keep working perfectly. Cassandra spots no single point of failure. I can tell you with older Cassandra version I've had an uptime of more than 3 years
Concurrent write/read data: Cassandra uses the lww policy (last-write-wins) to handle concurrent writes on the same key. The system supports multiple read-write and with newer protocols also async operations.
There are lots of other interesting features Cassandra offers: linear horizontal scaling is the one I appreciate more but there is also the fact that you can know the instant in which every piece of data has been updated (the timestamp of lww), counters features and so on.
(*) - if you don't use Consistency Level All which, imho, should NEVER be used in such a system.
Here's a few more links on how you can span In-Memory with Disk (DRAM, SSM, and disk storage) w/ Aerospike:
http://www.aerospike.com/hybrid-memory/
http://www.aerospike.com/docs/architecture/storage.html
I think everyone is right in terms of matching the specific DB to your specific use case. For instance, Aerospike is optimal for key-value data. Other options might be better.
By way of analogy, I'll always remember how, decades ago, a sister of mine once borrowed my computer and wrote her term paper in Microsoft Excel. Line after line was a different row of a spreadsheet. It looked ugly as heck, but, uh, okay. She got the task done. She cursed and swore at how difficult it was to edit the thing. No kidding!
Choosing the right NoSQL database for the right task will either make your job a breeze, or could cause you to curse a blue streak if you decided on the wrong basic tool for the task at hand.
Of course, every vendor's going to defend their product. I think it's best the community answer the question. Here's another Stack Overflow thread answering a similar question:
Has anyone worked with Aerospike? How does it compare to MongoDB?
btw: Do you have any more specific insights for us on what type of problem you are trying to solve?