Stanford CoreNLP NER training freezes - stanford-nlp

I'm trying to train a NER model for Portuguese. I succeeded when training with 10 entity classes. However, with the same training dataset, increasing the entity classes to 30ish it freezes after some iterations.
I even increased the RAM up to 30g, but no luck. I used 3.7.0 version of Stanford CoreNLP, and ran the following running command (using the default prop configurations):
java -d64 -Xmx30g -cp stanford-corenlp.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop "prop.prop"
Any idea on how to get it working?

#arop,The problem is that the system requires some more heap memory,
What you were thinking the 30gb is not RAM actually, it is the heap size, the memory that the stanford core nlp can use to store the temporary memory.
Increase its size to 100gb and see, based on your hard disk size.
when to try to increase the heap size if the server shutdown's then install the jdk again.

Related

Nifi memory continues to expand

I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.

Janusgraph(GremlinServer) Import improve performance

I'm trying to import graph data of 1GB (consists of ~100k vertices, 3.6 million edges) which is gryo format. I tried to import through gremlin-client, I'm getting the following error:
gremlin>
graph.io(IoCore.gryo()).readGraph('janusgraph_dump_2020_09_30_local.gryo')
GC overhead limit exceeded Type ':help' or ':h' for help. Display
stack trace? [yN]y java.lang.OutOfMemoryError: GC overhead limit
exceeded at
org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.(NonBlockingHashMapLong.java:471)
at
org.cliffc.high_scale_lib.NonBlockingHashMapLong.initialize(NonBlockingHashMapLong.java:241)
Gremlin-Server, Cassandra details as follows:
Gremlin-Server:
Janusgraph Version: 0.5.2
Gremlin Version: 3.4.6
Heap: JAVA_OPTIONS="-Xms4G -Xmx4G …
// gremlin conf
threadPoolWorker: 8
gremlinPool: 16
scriptEvaluationTimeout: 90000
// cql props
query.batch=true
Cassandra is in Cluster with 3 nodes
Cassandra version: 3.11.0
Node1: RAM: 8GB, Cassandra Heap: 1GB (-Xms1G -Xmx1G)
Node2: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Node3: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Each node has installed with Gremlin-Server (Load Balancer for clients). But we are executing gremlin queries in Node1.
Can someone help me on the following:
What do I need to do import(any configuration changes) ?
>>> What is the best way to export/import huge data into Janusgraph(Gremlin-Server)? (I need answer for this)
Is there any way I can export the data in chunks and import in chunks ?
Thanks in advance.
Edit:
I've increased Node1, Gremlin-Server Heap to 2GB. Import query response is cancelled. Perhaps, for both Gremlin and Cassandra, RAM allocation is not sufficient. That's why I've kept it to 1GB, so that the query will be executed.
Considering huge data (billions of vertices/edges), this is very less, hope 8GB RAM and 2/4 core would be sufficient for each node in cluster.
Graph.io() and the now preferred Gremlin step io() use the GryoReader to read your file (unless the graph provider overrides the latter Gremlin io() step and I don't think that JansuGraph does). So, if you use GryoReader you typically end up needing a lot of memory (more than you would expect) because it holds a cache of all vertices to speed loading. Ultimately, it is not terribly efficient at loading and the expectation has been from TinkerPop's perspective, that providers would optimize loading with their own native bulk loader by intercepting the io() step when encountered. In absence of this optimization, the general recommendation is to use the bulk loading tools of the graph you are using directly. For JanusGraph that likely means parallelizing the load your self as part of a script or using a Gremlin OLAP method of loading. Some recommendations can be found in the JanusGraph Documentation as well as in these blog posts:
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582
You can also consider a custom VertexProgram for bulk loading. TinkerPop has the CloneVertexProgram which is the more general successor to the BulkLoaderVertexProgram (now deprecated/removed in recent versions) which had some popularity with JanusGraph as it's generalized bulk loading tool before TinkerPop moved away from trying to supply that sort of functionality.
At your scale of a few million edges, I probably would have wrote a small groovy script that would run in Gremlin Console to just load my data directly to the graph and avoid trying to go to a intermediate format like Gryo first. It would probably go much faster and would save you from having to dig too far into bulk loading tactics for JanusGraph. If you choose that case, then that link to the JanusGraph Documentation I supplied above should be of most help to you. You can save worrying about using OLAP, Spark and other options until you have hundreds of millions of edges (or more) to load.

h2o autoML network usage

First I would like to thank the H2o team for a great product and rapid development / iteration.
I was testing h2o autoML on a 4 machine cluster. (40 cores, 256 gigs of ram, gigabite bandwidth)
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU. I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
AutoML is training H2O models in a sequence, so this advice applies to H2O models in general, not just AutoML -- if your dataset is small enough, adding machines to your H2O cluster will only slow down the training process.
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU.
If you have a 20MB dataset, it's always going to be better to run H2O on a single machine. The overhead of using multiple machines is only worth it when your training frame won't fit into RAM on a single machine.
There is a longer explanation in another Stack Overflow answer I wrote here.
I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
It does make sense for small data, but H2O was designed to scale to big data (with millions or hundreds of millions of rows), so training several models in parallel is not the design pattern that was used. To speed up the training process, you can use a single machine with more cores.

how can i evaluate my spark application

hello i just finished creating my first spark application, now i have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores), i want to know what are criteria that help me to tuning my application and to observe its performance.
i have already visited the official website of spark, it's talking about Data Serialization, but i couldn't get what is it exactly or how to specify it.
it is talking also about "memory management", "Level of Parallelism" but i didn't understand how to control these.
one more thing, i know that the size of data has an effect, but all files.csv that i have have small size, how can i get files with large size (10 GB, 20 GB, 30 GB, 50 GB, 100 GB, 300 GB, 500 GB)
please try to explain well for me, because cluster computing is fresh for me.
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia
From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Resources