Janusgraph(GremlinServer) Import improve performance - performance

I'm trying to import graph data of 1GB (consists of ~100k vertices, 3.6 million edges) which is gryo format. I tried to import through gremlin-client, I'm getting the following error:
gremlin>
graph.io(IoCore.gryo()).readGraph('janusgraph_dump_2020_09_30_local.gryo')
GC overhead limit exceeded Type ':help' or ':h' for help. Display
stack trace? [yN]y java.lang.OutOfMemoryError: GC overhead limit
exceeded at
org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.(NonBlockingHashMapLong.java:471)
at
org.cliffc.high_scale_lib.NonBlockingHashMapLong.initialize(NonBlockingHashMapLong.java:241)
Gremlin-Server, Cassandra details as follows:
Gremlin-Server:
Janusgraph Version: 0.5.2
Gremlin Version: 3.4.6
Heap: JAVA_OPTIONS="-Xms4G -Xmx4G …
// gremlin conf
threadPoolWorker: 8
gremlinPool: 16
scriptEvaluationTimeout: 90000
// cql props
query.batch=true
Cassandra is in Cluster with 3 nodes
Cassandra version: 3.11.0
Node1: RAM: 8GB, Cassandra Heap: 1GB (-Xms1G -Xmx1G)
Node2: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Node3: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Each node has installed with Gremlin-Server (Load Balancer for clients). But we are executing gremlin queries in Node1.
Can someone help me on the following:
What do I need to do import(any configuration changes) ?
>>> What is the best way to export/import huge data into Janusgraph(Gremlin-Server)? (I need answer for this)
Is there any way I can export the data in chunks and import in chunks ?
Thanks in advance.
Edit:
I've increased Node1, Gremlin-Server Heap to 2GB. Import query response is cancelled. Perhaps, for both Gremlin and Cassandra, RAM allocation is not sufficient. That's why I've kept it to 1GB, so that the query will be executed.
Considering huge data (billions of vertices/edges), this is very less, hope 8GB RAM and 2/4 core would be sufficient for each node in cluster.

Graph.io() and the now preferred Gremlin step io() use the GryoReader to read your file (unless the graph provider overrides the latter Gremlin io() step and I don't think that JansuGraph does). So, if you use GryoReader you typically end up needing a lot of memory (more than you would expect) because it holds a cache of all vertices to speed loading. Ultimately, it is not terribly efficient at loading and the expectation has been from TinkerPop's perspective, that providers would optimize loading with their own native bulk loader by intercepting the io() step when encountered. In absence of this optimization, the general recommendation is to use the bulk loading tools of the graph you are using directly. For JanusGraph that likely means parallelizing the load your self as part of a script or using a Gremlin OLAP method of loading. Some recommendations can be found in the JanusGraph Documentation as well as in these blog posts:
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582
You can also consider a custom VertexProgram for bulk loading. TinkerPop has the CloneVertexProgram which is the more general successor to the BulkLoaderVertexProgram (now deprecated/removed in recent versions) which had some popularity with JanusGraph as it's generalized bulk loading tool before TinkerPop moved away from trying to supply that sort of functionality.
At your scale of a few million edges, I probably would have wrote a small groovy script that would run in Gremlin Console to just load my data directly to the graph and avoid trying to go to a intermediate format like Gryo first. It would probably go much faster and would save you from having to dig too far into bulk loading tactics for JanusGraph. If you choose that case, then that link to the JanusGraph Documentation I supplied above should be of most help to you. You can save worrying about using OLAP, Spark and other options until you have hundreds of millions of edges (or more) to load.

Related

Apache NIFI: A cluster or a single big server?

My flow is as below and is a cron job scheduling every 10 minutes:
Query data from a database. Every time the query result could contains 200 million records.
Use PartitionRecord to group records by a specific field of the above query result.
Transfer the group produced by PartitionRecord to a XML. It hard to say how many flowfiles a group contains.
Send the XML to ActiveMQ-Artemis.
I will use NIFI to approach the above flow (requested by my customer).
Now I have below computing resources:
OS: Ubuntu Server 20.04 LTS
CPU: 48 cores
Memory: 384 GB
Storage: SSD, enough space.
And There are two options I can think of:
Build a NIFI cluster composing of three nodes. Each node has 16 cores and 128GB RAM.
Build a single NIFI with 48 cores and 384GB RAM.
Which options of NIFI platform should I use ?
Thanks
Here are some pros and cons I can think of:
Single node pros:
Easier to configure/setup
Easier to manage
Single node cons:
Any unexpected issue with the node, you're no longer processing
Come upgrade time, you may have some downtime
NiFi may not efficiently use large amounts of RAM (you're not getting as much bang for your buck)
Cluster cons:
More complex setup/configuration (needs Zookeeper and extra NiFi cluster config)
More complex management
Sometimes experiences cluster connection issues
Cluster pros:
Reasonable level of redundancy
Should be able to upgrade a node at a time and keep operations going (you would need to investigate how simple this is)
Should maximize hardware utilization

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.
Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

Having performance issues with Datastax cassandra

I have installed datastax Cassandra in 2 independent machines(one with 16gb RAM and other with 32GB RAM) and going with most of the default configuration.
I have created a table with some 700 columns, when I try to insert records using java its able to insert 1000 records per 30 seconds, which seems be very less for me as per datastax benchmark it should be around 18000+. For my surprise performance is same in both 32GB & 16GB RAM machines.
I am new to Cassandra, can any one help me in this regard. I feel I doing something wrong with Cassandra.yaml configurations.
I did a Benchmarking and tuning activity on Cassandra some time ago. Found some useful settings which are mentioned below,
In Cassandra data division is based of strategies. Default is a combination of round robin and token aware policy which works best in almost all cases. If you want to customize data distribution then it is possible to write a new data distribution strategy in Cassandra i.e. distribute the data based on a location, based on an attribute etc. which can be best for customized requirement.
Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. We used bloom filter value is 0.1 to maintain balance between efficiency and overhead
Consistency level is key parameter in NoSQL databases. Try with Quorum or one.
Other options in JVM tuning like, heap memory size, survivor ratio should be optimal to achieve maximum performance
If large memory is available then memTable size can be increased and that can fit into memory and it will improve performance. Flushing memTables to disk interval should be high enough so that it shouldn’t perform unnecessary IO operations
Concurrency settings in Cassandra are important to scale. Based on our tests and observations we found that Cassandra performs better when concurrency is set to no. of cores*5 and native_transport_max_threads set to 256
Follow additional tuning settings recommended by Cassandra like; disable swap, ulimit settings, and compaction settings
Replication factor in Cassandra should be equal to no. of nodes in cluster to achieve maximum throughput of system.
These are mostly for insertion with a little bit of impact of read.
I hope this will help you :)
Are you using async writes?
Try running cassandra-stress, that way you can isolate client issues.
Another option is Brian's cassandra-loader:
https://github.com/brianmhess/cassandra-loader
Since you are writing in Java, use Brian's code as a best practice example.

Resources