In SPL TEDA 4.2 , do we have limitation on number of input file types that can be included? - ibm-streams

I am working on TEDA v2.0.1 and SPL v4.2 . When I am trying to add more than 18 different input file types , the job is compiled successfully but at runtime, it is going to status 'no' without any error in logs.
I have faced this issue while developing multiple applications.

Do you face the problem that pe trace files having size of 0 bytes?
The status "no" is related to job status?
How much memory does the zookeeper have? If jobs are getting bigger, than the Zookeeper might become the bottleneck, increase the heap size for the zookeeper helps in this case.

Related

How to fix size limit error when performing actions on hive table in pyspark

I have a hive table with 4 billion rows that I need to load into pyspark. When I try to do any actions such as counting against that table, I get the following exception (followed by TaskKilled exceptions):
Py4JJavaError: An error occurred while calling o89.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6732 in stage 13.0 failed
4 times, most recent failure: Lost task 6732.3 in stage 13.0 (TID 30759, some_server.XX.net, executor 38): org.apache.hive.com.google.protobuf.InvalidProtocolBufferException: Protocol mess
age was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limi
t.
My version of HBase is 1.1.2.2.6.1.0-129 and I am unable to upgrade at this time.
Is there some way I can get around this issue without upgrading, maybe by modifying an environment variable or config somewhere, or by passing an argument to pyspark via the command line?
I would say no.
Based on the following JIRAs increasing the protobuf size seems to require a code change since all these JIRAs were resolved with code patches using CodedInputStream as suggested by the exception.
HDFS-6102 Lower the default maximum items per directory to fix PB fsimage loading
HDFS-10312 Large block reports may fail to decode at NameNode due to 64 MB protobuf maximum length restriction.
HBASE-14076 ResultSerialization and MutationSerialization can throw InvalidProtocolBufferException when serializing a cell larger than 64MB
HIVE-11592 ORC metadata section can sometimes exceed protobuf message size limit
SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Hadoop EMR job runs out of memory before RecordReader initialized

I'm trying to figure out what could be causing my emr job to run out of memory before it has even started processing my file inputs. I'm getting a
"java.lang.OutOfMemoryError cannot be cast to java.lang.Exception" error before my RecordReader is even initialized (aka, before it even tried to unzip the files and process them). I am running my job on a directory with a large amount of inputs. I am able to run my job just fine on a smaller input set. Does anyone have any ideas?
I realized that the answer is that there was too much metadata overhead on the master node. The master node must store ~150 kb of data for each file that will be processed. With millions of files, this can be gigabytes of data, which was too much and caused the master node to crash.
Here's a good source for more information: http://www.inquidia.com/news-and-info/working-small-files-hadoop-part-1#sthash.YOtxmQvh.dpuf

Hadoop Error: Java heap space

So, after seeing the a percent or so of running the job I get an error that says, "Error: Java heap space" and then something along the lines of, "Application container killed"
I am literally running an empty map and reduce job. However, the job does take in an input that is, roughly, about 100 gigs. For whatever reason, I run out of heap space. Although the job does nothing.
I am using default configurations and it's on a single machine. It is running on hadoop version 2.2 and ubuntu. The machine has 4 gigs of ram.
Thanks!
//Note
Got it figured out.
Turns out I was setting the configuration to have a different terminating token/string. The format of the data had changed, so that token/string no longer existed. So it was trying to send all 100gigs into ram for one key.

Page error 0xc0000006 with VC++

I have a VS 2005 application using C++ . It basically importing a large XML of around 9 GB into the application . After running for more than 18 hrs it gave an exception 0xc0000006 In page error. THe virtual memory consumed is 2.6 GB (I have set the 3GB) flag.
Does any one have a clue as to what caused this error and what could be the solution
Instead of loading the whole file into the memory you can use SAX parsers to load only a part of the file to the memory.
9Gb seems overly large to read in. I would say that even 3Gb is too large in one go.
Is your OS 64bit?
What is the maximum pagefile size set to?
How much RAM do you have?
Were you running this in debug or release mode?
I would suggest that you try to reading the XML in smaller chunks.
Why are you trying to read in such a large file in one go?
I would imagine that your application took so long to run before failing as it started to copy the file into virtual memory, which is basically a large file on the hard disk. Thus the OS is reading the XML from the disk and writing it back onto a different area of disk.
**Edit - added text below **
Having had a quick peek at Expat XML parser it does look as if you're running into problems with stack or event handling, most likely you are adding too much to the stack.
Do you really need 3Gb of data on the stack? At a guess I would say that you are trying to process a XML database file, but I can't imagine that you have a table row that is so large.
I think that really you should use it to search for key areas and discard what is not wanted.
I know nothing other than what I have just read about Expat XML Parser but would suggest that you are not using it in the most efficient manner.

Resources