Fastest nosql option for number crunching? - ruby

I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?
My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.

First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.
That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).
Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.

Related

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

Rethinkdb Scalability

How scalable is rethinkdb? Can it be used for TB's of data?
I have around 400 GB's of data, which are bound to increase by 10-25 GB's per week. Any suggestions, would be of great help.
Rethinkdb is very scalable and you should be able to use it with TBs of data no problem, that's what its designed for. As for its stability it should be very stable as of version 2.0 and fully production.
http://www.rethinkdb.com/stability/
Note that a NoSQL-Database does increase the amount of storage required massively, so be sure to have enough disk space on your nodes.
But some terabytes are no issue for RethinkDB, it's designed to handle a nearly infinite amount of data if your nodes have a lot of RAM it also provides nearly Memory-alike performance because the caching algorithms are very good.
You can easily spread out your data on many nodes, by sharding, which also increases the absolute throughput of your cluster, but slightly increases the latency of queries, because of the round trips between the cluster-members.

Cassandra + Solr/Hadoop/Spark - Choosing the right tools

I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.
Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.
As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:
Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries
The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.
So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S
Thanks in advance,
Arman
In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.
So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).
Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)
I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.
You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.
Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.
Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.
The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Is excessive use of lucene good?

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?
As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.
Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).
By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.
What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.
We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Resources