Greenplum vs. mysql performance comparison - greenplum

I tested the data analysis performance of Greenplum and mysql under the same hardware conditions. Under the condition of less than 10 million data, there was no obvious difference between them. Greenplum had 1 master on 1 node and 4 segments distributed on 2 nodes, which was different from what I expected,Is that normal?

Related

Indexing multiple indexes in elastic search at the same time

I am using logstash for ETL purpose and have 3 indexes in the Elastic search .Can I insert documents into my 3 indexes through 3 different logtash processes at the same time to improve the parallelization or should I insert documents into 1 index at a time.
My elastic search cluster configuration looks like:
3 data nodes
1 client node
3 data nodes - 64 GB RAM, SSD Disk
1 client node - 8 GB RAM
Shards - 20 Shards
Replica - 1
Thanks
As always it depends. The distribution concept of Elasticsearch is based on shards. Since the shards of an index live on different nodes, you are automatically spreading the load.
However, if Logstash is your bottleneck, you might gain performance from running multiple processes. Though if running multiple LS process on a single machine will make a positive impact is doubtful.
Short answer: Parallelising over 3 indexes won't make much sense, but if Logstash is your bottleneck, it might make sense to run those in parallel (on different machines).
PS: The biggest performance improvement generally is batching requests together, but Logstash does that by default.

Hbase cluster running Phoenix is slow using jdbc and python phoenixdb

I have a cluster setup running HBase and a phoenix queryserver. Currently my cluster contains a master node and 3 slaves. The table I am connecting to consists of 124 columns and a total of 16 million rows. A simple COUNT(*) or DISTINCT "value" query takes around 1-2 minutes, which as far as I understood shouldn't be the case - How fast is Phoenix? Why is it so fast?
In the documentation linked above a full table scan of 100 million rows should take around 20 seconds. And since my table size is significantly smaller I don't understand why my queries take that long. What could I do to optimize my queries? I plan on reconstructing my table using column families (which I know improves performance, but I was wondering if there are other ways to get quick performance boosts, as reconstructing my current table will be a quite huge task.
I am using Phoenix 4.9 and HBase 1.2.

ElasticSearch replica and nodes

I am testing now clustering with ElasticSearch and have question about the replicas between the nodes.
As you can see in the screenshot from Head I have 2 indexes.
movies has 5 shards and 2 replica
students has 5 shards and 1 replica
Which one is better and which one is faster with 3 active nodes and why?
Costs of having more number of replicas would be
more storage space required(Obviously)
less indexing performance
while the advantage from it would be
better search performance
better resiliency
Note that even though you have 2 replicas, it does not mean that your cluster can endure 2 nodes going down since all indexing request would fail if only one out of 3 copies of shards is available.(because of indexing quorum)
For detailed explanation please refer to this official document
"Better" is subjective.
With two replicas, you can handle two of the three machines in your cluster going down, though at the price of writing all the data to every machine. Read performance should also be higher as the cluster has more nodes from which to request the data.
With one replica, you can only survive the outage of one machine in your cluster, but you'll get a performance boost by writing 2 copies of the data across 3 servers (less IO on each server).
So it comes down to risk and performance. Hope that helps.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

Does HBase use the compute capacity of all nodes in the cluster for query execution?

We are having a setup of 1 master and 2 slave nodes. The data is setup in postgres and in hbase and its a similar dataset (same number of rows) - 65 million rows. Yet, we dont find a measurable increase in performance from HBase for the same query.
My first thought is - does HBase use the compute capacity of all nodes to fork the query out? Perhaps this is why the performance is not measurably better.
Any other reasons for why the performance between Postgres and HBase would be about the same? Any specific configuration items to look for?
EDIT : Something I found while researching this : http://www.flurry.com/2012/06/12/137492485#.VaQP_5QpBpg
This is kind of a yes and no answer. Depending on what you are doing for your 'query' and your region distribution, you may or may not be using all the nodes. For example, if you are running a scan across the table it will run against each region (assuming more then one) in sequence. However if you are using a multi-get for keys that are in different regions, this will run in parallel.
The real benefit is going to come as the number of regions increase and you start parallelizing requests (multiple clients). Regions will be distributed across region servers by the Master as regions are split.

Resources