CoreNLP runs too slow - stanford-nlp

I intend to use Corenlp to annotate some amazon reviews, however, I wait for over 6 hours, nothing output is produced.
1. the review is about 1MB;
2. the cluster has 12CPU, 64G memory;
3. the command is
java -cp "*" -Xmx64g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,ner,sentiment -outputFormat json -file amazon_apple_comments_4.csv
What has happened? is it so slow?

That's waaaay too slow for a 1Mb document. Try running fewer annotators to narrow down which one is taking the most time. The tokenize and ssplit annotators should be extremely fast; pos is a bit slower, but not bad; ner is slower than pos, but in a 1Mb Amazon review, it shouldn't find many named entities. I've never used sentiment, but I imagine that it's nontrivial.

Related

Stanford CoreNLP NER training freezes

I'm trying to train a NER model for Portuguese. I succeeded when training with 10 entity classes. However, with the same training dataset, increasing the entity classes to 30ish it freezes after some iterations.
I even increased the RAM up to 30g, but no luck. I used 3.7.0 version of Stanford CoreNLP, and ran the following running command (using the default prop configurations):
java -d64 -Xmx30g -cp stanford-corenlp.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop "prop.prop"
Any idea on how to get it working?
#arop,The problem is that the system requires some more heap memory,
What you were thinking the 30gb is not RAM actually, it is the heap size, the memory that the stanford core nlp can use to store the temporary memory.
Increase its size to 100gb and see, based on your hard disk size.
when to try to increase the heap size if the server shutdown's then install the jdk again.

Spark taking 2 seconds to count to 10 ...?

We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions:
Is this expected? How long does it take your platform?
Any possible ideas why? Currently I'm using Spark 1.3 on a two node Hadoop cluster (both 8 cores, 64G RAM). I'm pretty green when it comes to Hadoop and Spark, so I've done little configuration beyond the Ambari/HDP defaults.
Initially I was testing on a hundred million rows - Spark was taking about 10 minutes to simply count it.
Example:
Create text file of 10 numbers, and load it into hadoop:
for i in {1..10}; do echo $1 >> numbers.txt; done
hadoop fs -put numbers.txt numbers.txt
Start pyspark (which takes about 20 seconds ...):
pyspark --master yarn-client --executor-memory 4G --executor-cores 1 --driver-memory 4G --conf spark.python.worker.memory=4G
Load the file from HDFS and count it:
sc.textFile('numbers.txt').count()
According to the feedback, it takes Spark around 1.6 seconds to do that. Even with terrible configuration, I wouldn't expect it to take that long.
This is definitly too slow (on my local machine 0.3 sec) even for bad spark configuration (moreover usualy default spark configuration apply to most of the normal use of it ). Maybe you should double check your HDFS configuration or network related configuration .
It has nothing to do with cluster configuration. It is due to lazy evaluation.
There are two types of APIs in Spark : Transformations & Actions
Have a look at it from above documentation link.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
sc.textFile('numbers.txt').count() is an action operation with count() call.
Due to this reason, even though it took 2 seconds at first time for you, it took fraction of seconds at second time.

Mapreduce Vs Spark Vs Storm Vs Drill - For Small files

I know spark does the in memory computation and is much faster then MapReduce.
I was wondering how well does spark work for say records < 10000 ?
I have huge number of files around ( each file having around 10000 records , say 100 column file) coming into my hadoop data platform and i need to perform some data quality checks before i load then into hbase.
I do the data quality check in hive which uses MapReduce at the back-end. For each file it takes about 8 mins and thats pretty bad for me.
Will spark give me a better performance lets say 2-3 mins ?
I know I got to do a bench marking but i was trying to understand the basics here before i really get going with spark.
As I recollect creating RDD's for the first time will be an overhead and since i got to create a new RDD for each incoming file that going to cost me a bit.
I am confused which would be the best approach for me - spark , drill, storm or Mapreduce itself ?
I am just exploring the performance of Drill vs Spark vs Hive over around millions of records. Dill & Spark both are around 5-10 times faster in my case (I did not perform any performance test over cluster with significant RAM, I just tested on single node) The reason for fast computation - both of them perform the in-memory computation.
The performance of drill & spark is almost comparable in my case. So, I can't say which one is better. You need to try this at your end.
Testing on Drill will not take much time. Download the latest drill, install on your mapr hadoop cluster, add hive-storage plugin and perform the query.

Hadoop file size Clarification

I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.
2 millions files is a lot for hadoop. More then that - running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work.
I think that the problem is of algorithmic nature - how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.
Mahout has implementation for calcualating TF and IDF for text.
check mahout liberary for it,
and splitting each line as a file is not good idea in hadoop map reduce framework.

Parsing large XML to TSV

I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
using SAXParser
use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....
I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/
It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.
I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf

Resources