What is the minimum System requirement to launch janusgraph - elasticsearch

I just started JanusGraph and my total RAM usage is more than 5GB without firing any queries. I have the following dependent services running
JanusGraph
GremlinServer (via JanusGraph)
Cassandra (via JanusGraph)
Elasticsearch (via JanusGraph)
There are some java instances running
I am trying to find out the minimum system requirements for janusgraph but I am unable to find it mentioned in its documentation. Anyone has a link for the same? or any idea how much is the minimum system requirements?
Also are there any configuration changes to make it consume less RAM?

Related

Ability to offer only part of the node resources?

Using dc/os we like to schedule tasks close to the data that the task requires that in our case is stored in hadoop/hdfs (on an HDP cluster). Issue is that the hadoop cluster is not run from within dc/os and so we are looking for a way to offer only a subset of the system resources.
For example: say we like to reserve 8GB of memory to data node services, then we like to provide the remainder to dc/os to schedule tasks.
From what i have read so far, the task can specify the resources it requires, but i have not found any means to specify what you want to offer from the node perspective.
I'm aware that a CDH cluster can be run on dc/os, that would be one way to go, but for now that is not provided for HDP.
Thanks for any idea's/tips,
Paul

elasticsearch configure hard disk size limit

We have a Linux machine with 56GB of hard disk, but querying the max available space of ES shows only 28GB. We would like the ES to use more of the available space on the machine, but could not find the configuration for this. Thanks.
Elasticsearch uses all the space that is available on your machine. However if this is a single machine in your cluster, then no copies of data will be created (as it does only make sense in a distributed environment to be failsafe friendly). So if you dont have more data, then no more is used.
It might make sense to talk about your setup and provide more information.

system_auth replication in Cassandra

I'm trying to configure authentication on Cassandra. It seems like because of replication strategy that is used for system_auth, it can't replicate user credentials to all the nodes in cluster, so I end up getting Incorrect credentials on one node, and getting successful connection on another.
This is related question. The guy there says you have to make sure credentials are always on all nodes.
How to do it? The option that is offered there says you have to alter keyspace to put replication factor equal to amount of nodes in cluster, then run repair on each node. That's whole tons of work to be done if you want your cassandra to be dynamically scalable. If I add 1 node today, 1 node another day, alter keyspace replication and then keep restarting nodes manually that will end up some kind of chaos.
Hour of googling actually leaded to slightly mentioned EverywhereStrategy, but I don't see anywhere in docs it mentioned as available. How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.
When you enable authentication in Cassandra, then Yes you have increase the system_auth keyspace replication_factor to N(total number of nodes) and run a complete repair, but you don't need to restart the nodes after you add a new Node.
If repair is consuming more time then you optimize your repair like repair only the system_auth keyspace
nodetool repair system_auth
(or)
nodetool repair -pr system_auth
As per Cassandra a complete repair should be done regularly. For more details on repair see the below links:
http://www.datastax.com/dev/blog/repair-in-cassandra
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
Answering your questions:
Question: How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Answer: I'm using Cassandra 2.2 and Astyanax thrift API from my Spring project, using which I am able to handle the Cassandra authentication effectively. Specify what version of Cassandra you are using and what driver you are using to connect CQL driver or Astyanax thrift API?
Question: Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.
Answer: Yes you can scale your Cassandra cluster without restarting nodes, please check the datastax documentation for Cassandra 2.2 version:
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/operations/opsAddNodeToCluster.html
Check the datastax docs for the version you are using.

Ambari scaling memory for all services

Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Resources