How to decide number of segments per host/node in Greenplum - greenplum

I am setting up prod cluster and want to choose number of segments per host/node, how to decide this? And what is the benefits to have multiple primary segments (excluding mirror) per node?

Take a look at this guide... specifically page 22
https://gpdb.docs.pivotal.io/43100/pdf/Pivotal_Clustering_Concepts_A03.pdf
hope this helps..

Related

Examples of 10 million plus Elastic clusters?

Unless I'm missing an obvious list that's provided somewhere, there doesn't seem to be a list that gives examples of large-ish Elastic clusters.
To answer this question I'd appreciate it if you could list a solution you know of and some brief details about it. Note that no organisational details need be shared unless these are already public.
Core info
Number of nodes (machines)
Gb of index size
Gb of source size
Number of documents / items (millions)
When the system was built (year)
Any of the follow information would be appreciated as well:
Node layout / Gb of memory in each node. Number of master nodes (generally smaller), number and layout of data nodes
Ingest and / or query performance (docs per second, queries per second)
Types of CPU - num cores, year of manufacture or actual CPU specifics
Any other relevant information or suggestions of types of additional info to request
And as always - many thanks for this, and I hope it helps all of us !
24 m4.2xlarge for data nodes,
separate masters and monitoring cluster
multiple indices (~30 per day), 1-2Tb of data per day
700-1000M documents per day
It is continiously building, changing, optimizing (since version 1.4)
hundreds of search requests per second, 10-30k documents per second

Putting cache entries to specific Ignite Server

I have an Ignite data grid of five servers(say A,B,C,D and E). A partitioned cache has been distributed across these five servers with the number of backups set as 1.
I want to store 100 million entries in this partitioned cache. But, I want to control the partitioning of my cache entries to the Ignite servers.
Is it possible that I can direct my Ignite client to put a cache entry on a particular server (say E)?
The only way to do this is to implement your own Affinity Function instead of the ones provided out of the box. However, I would encoredge you to rethink this approach because it's not scalable. Affinity functions included in Ignite are designed to provide even distribution on any set of nodes, so you can dynamically scale up and down whenewer you need this. Your approach is much less flexible.
Also I would recommend you to go through documentation page about Affinity Collocation. Very likely this will give you hints on how to implement your logic in a better way.
And fincally, can you give some more details about your use case? I will be happy to give some advice on how to approach it.

Setup elastic for production

I want to know what configuration setup would be ideal for my case. I have 4 servers (nodes) each with 128 GB RAM. I'll have all 4 nodes under one cluster.
Total number number of indexes would be 10, each getting data of 1500000 documents per day.
Since I'll have 4 servers (nodes) so for all these nodes I'll set master:true, and data:true, so that if one node goes down, other becomes master. Every index will have 5 shards.
I want to know which config parameters should I alter in order to gain maximum potential from elastic.
Also tell me how much memory is enough for my usage, since I'll have very frequent select queries in production (may be 1000 requests per second).
Need a detailed suggestion.s
I'm not sure anyone can give you a definitive answer to exactly how to configure your servers since it is very dependent on your data structure, mapping and specific queries.
You should read this great article series by Elastic regarding production environments

Elasticsearch Multiple Cluster Search

I just watched Rafal Kuc's presentation and would like to use it for the basis of an Elastic Search Question.
If I added 50 million documents per day to a cluster where each day created a new index (time based data design pattern), eventually in time that would get pretty big. For example sake we'll put the avg document at 15kb.
Now lets say I needed to do that for 10 years. Eventually, I would need to create multiple clusters. Can I create multiple clusters in ES and search them all simultaneously? Could I use an alias for something like this? Or is it not possible?
No, I think a search via the api or your client of choice (Java/Python/etc) is going to be against a single cluster.
Your client could make multiple requests one to each cluster, perhaps if you organized your clusters by years?
In theory a cluster could just grow forever, although at some point I would think the overhead of scattering and gathering a query to N nodes (where N is very very very large) would cause problems.

Cassandra scaling cheat-sheet

Of course you can only know the performance of your system with your load with your use-cases by ... actually implementing it! That aside, before embarking on a prototype, I'm searching for some very rough estimates of how Cassandra performs.
For various configurations of nodes and data-centres, and for various read and write consistency levels, what the chances of reading a stale value? What kind of key reads and writes per second would you expect to sustain, and what kind of latency would each read and write have?
Cassandra benchmarking presented at VLDB earlier this year: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Performance/consistency tradeoffs: http://www.datastax.com/dev/blog/your-ideal-performance-consistency-tradeoff
We run an application with 500 datapoints posted per second per web node(we have 6 cassandra nodes). We could probably get 1000 datapoints per second per node if we cached 100M of data in the client to avoid the read.
the profile of that is using PlayOrm with one findAll(List keys) and one putAll(List entities) on each request where each key in that list is a single data point as the clients send a batch of datapoints over http so we don't have as much http overhead....maybe that gives you some idea at least though not sure.
We have not yet tested the correct ratio of web nodes to cassandra nodes but I suspect it is like my last client where it was near one to one on this project though it changes with the profile.
We run 4 web nodes and get 2000 datapoints per second right now.

Resources