We are using Janusgraph with Hbase backend for storing large data lineage graphs. The basic usage is to find a node, then do an impact analysis by traversing all the nodes that are affected by this node (recursively).
The speed I am currently getting is about 620 edge traversals per second. I consider that quite slow.
Here is the gremlin query:
g.V().has('name', 'xxx').
repeat(
outE('flows_into').dedup().inV()
).
until(
or(
outE('flows_into').count().is(0),
cyclicPath()
)
).
path().
unfold().
dedup().
group().by(label).by(count())
Here is our configuration/environment:
Janusgraph server 0.6.2
RAM 32 GB
CPU - enough
HBase 2.1.4
13 nodes
750 GB RAM on each node
The graph has around 4 million nodes and 5 million edges.
Is this speed normal? Is there way to make the query run faster? Would Cassandra be better for our usecase?
It is hard to find any statistics about Janusgraph performace.
Related
in our university we have an elasticsearch cluster with 1 Node. Now we have money to install more powerful server. We produce 7-10 millions accesslogs / day.
What is better to create a cluster with:
a. 3 powerful server each 64GB and 16 CPU + SSD.
b. to have 14 not so powerful server each 32GB and 8CPU +SSD
ps: a & b have the same price.
c. may be some recommendation?
Thank you in advance
it depends on the scenario. for the logging case you describing option b seems more flexible to me. let me explain my opinion:
as you are in a logging scenario, then implement the hot/warm architecture. you'll mainly write and read recent indices. in few cases you want to access older data and you probably want to shrink old and close even older indices.
set up at least 3 master eligble nodes to prevent spit brain problems. configure the same nodes also as coordinating nodes (11 nodes left)
install 2 ingest nodes to move the ingestion workload to dedicated nodes (9 nodes left)
install 3 hot data nodes for storing the most recent indices (6 nodes left)
install 6 warm data nodes for holding older, shrinked and closed indices. (0 nodes left)
the previous setup is just a example. the node numbers/roles should be changed if
if you need more resiliency. then add more master nodes, increase replica count for the index nodes. this will also reduce the total capacity.
the more old data you need to have searchable or being held in already closed indices, the more warm nodes you'll need. then rebalance the hot/warm node count according to you needs. if you can drop your old data early then increase the hot node count.
if you have xpack licensed, consider installing ml/alerting nodes. add this roles to the master nodes or reduce the data nodes count in favor of ml/alertig.
do you need kibana/logstash? depending on the workload, prepare one/two nodes exclusively.
assuming there are the same mainboards in both options you have more potential to quickly scale the 14 boxes up just by adding more ram/cpu/storage. having 3 nodes already maxed out at the specs, you'll need to set up new boxes and join them the cluster in order to scale up. but this also gives you maybe more recent hardware in you rack over the time.
please also have a look on this: https://www.elastic.co/pdf/architecture-best-practices.pdf
if you need some background on sharding configuration please see ElasticSearch - How does sharding affect indexing performance?
BTW: thomas is right with his comment about the heap size. please have a look on this if you want to know the background: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
ElasticSearch is used as a cache for PostgreSQL database to avoid a lot of joins and speed up my application selects.
Initially everything is stored at single large server (32GB RAM): webapp, nginx, postgresql, celery, elasticsearch.
Now I have 2 additional smaller nodes which is not used at all (only for additional storage with nbd-server).
So I have:
- 1 Large node with ES. About 12-16GB of RAM is available for ES.
- 2 small nodes with 8 GB RAM. Everything is free for ES.
All 3 nodes have SSD and same CPU.
Later I will add more 8GB nodes (as storage + ES).
What will be the best way to built ES cluster on this 3 nodes? Should all of them be data/master nodes? Or it will be better to use large node as a master and 2 small as
I presume that having more nodes in a storm cluster increases the "keep-topology-alive" intra-cluster communication.
Given that the topology works fine with 10 nodes (2 or 4 CPU, 4GB RAM) for small data, can we scale the topology to 1,000 or 10,000 nodes and still be competitive for (very) big data? Is there any known practical limit?
Thanks
The scaling of Storm cluster is limited by the speed of state storage in Zookeeper, most of it is "heartbeats" from workers. The theoretical limit is more or less 1,200 nodes (depends on the disk speed, 80MB/s write speed considered here). Obviously using a faster HDD will make things scale more.
However, people at Yahoo are working on In-memory store for worker heartbeats. Their solution will increase the limit to about 6,250 nodes using a GigabitE connections. 10Gigabit connections will increase this theoretical limit to 62,500 nodes. You can take a look at this Hadoop Summit 2015 presentation from Bobby Evans for further details.
I have a 15 node elasticsearch cluster and am indexing a lot of documents. The documents are of the form { "message": "some sentences" }. When I had a 9 node cluster, I could get CPU utilization upto 80% on all of them, when I turned it into a 15 node cluster, i get 90% CPU usage on 4 nodes and only ~50% on the rest.
The specification of the cluster is:
15 Nodes c4.2xlarge EC2 insatnces
15 shards, no replicas
There is load balancer in-front of all the instances and the instances are accessed through the load balancer.
Marvel is running and is used to monitor the cluster
Refresh interval 1s
I could index 50k docs/sec on 9 nodes and only 70k docs/sec on 15 nodes. Shouldn't I be able to do more?
I'm not yet an expert on scalability and load balancing in ES but some things to consider :
load balancing should be native in ES thus having a load balancer in-front can actually mitigate the in-house load balancing results. It's kind of like having a speed limitation on your car but manually using the brakes, it doesn't make that much sense since your speed limitator should already do the job and will be prevented from doing it right when you input "manual regulation". Have you tried not using your load balancer and just using the native load balancing to see how it fares ?
while having more CPU / computation power across different servers / shards, it also forces you to go through multiple shards every time you write/read a document, thus if 1 shard can do N computations, M shards won't actually be able to do M*N computations
having 15 shards is probably overkill in a lot of cases
having 15 shards but no replication is weird/bad since if any of your 15 servers falls, you won't be able to access your whole index
you can actually hold multiple nodes on a single server
What is your index size in terms of storage ?
I'm having an odd issue in where I set up a DSE 4.0 cluster with 1 Cassandra node and 1 Solr node (using DseSimpleSnitch) and performance is great. If I add additional nodes to have 3 Cassandra nodes and 3 Solr nodes, then the performance of my Solr queries goes downhill dramatically. Anyone have any idea what I might be doing wrong? I have basically all default options for DSE and have tried wiping all data and recreating everything from scratch several times with the same result. I've also tried creating the keyspace with replication factors of 1 and 2 with the same results.
Maybe my use case is a bit odd but I'm using Solr for OLTP type queries(via SolrJ with binary writers/readers) which is why the performance is critical. With a very light workload of say 5 clients making very simple Solr queries the response times go up about 50% from a single Solr node to 3 Solr nodes with only a few hundred small documents seeded for my test(~25ms to ~50ms). The response times get about 2 to 3 times slower with 150 clients against 3 nodes compared to a single node. The response times for Cassandra are unchanged, its only the Solr queries that get slower.
Could there be something with my configuration causing this?
Solr queries need to fan out to cover the full range of keys for the column family. So, when you go from one node to three nodes, it should be no surprise that the total query time would rise to three times a query that can be satisfied with a single node.
You haven't mentioned the RF for the Search DC.
For more complex queries, the fan out would give a net reduction in query latency since only a fraction of the total query time would occur on each node, while for a small query the overhead of the fanout and aggregation of query results dwarfs the time to do the actual Solr core query.
Generally, Cassandra queries tend to be much simpler than Solr queries, so they are rarely comparable.
Problem solved. After noticing the documentation mentioning not to use virtual nodes for Solr nodes (and not saying why) I checked my configuration and noticed I was using virtual nodes. I changed my configuration to not use virtual nodes and the performance issue disappeared. I also upgraded from 4.0.0. to 4.0.2 at the same time but I'm pretty sure it was the virtual nodes causing the problem.