PIG UDF throwing error - hadoop

I am getting an error in PIG script.
PIG SCRIPT :
REGISTER /var/lib/hadoop-hdfs/udf.jar;
REGISTER /var/lib/hadoop-hdfs/udf2.jar;
INPUT_LINES = Load 'hdfs:/Inputdata/DATA_GOV_US_Farmers_Market_DataSet.csv' using PigStorage(',') AS (FMID:chararray, MarketName:chararray, Website:chararray, Street:chararray, City:chararray, County:chararray, State:chararray, Zip:chararray, Schedule:chararray, X:chararray, Y:chararray, Location:chararray, Credit:chararray, WIC:chararray, WICcash:chararray, SFMNP:chararray, SNAP:chararray, Bakedgoods:chararray, Cheese:chararray, Crafts:chararray, Flowers:chararray, Eggs:chararray, Seafood:chararray, Herbs:chararray, Vegetables:chararray, Honey:chararray, Jams:chararray, Maple:chararray, Meat:chararray, Nursery:chararray, Nuts:chararray, Plants:chararray, Poultry:chararray, Prepared:chararray, Soap:chararray, Trees:chararray, Wine:chararray);
FILTERED_COUNTY = FILTER INPUT_LINES BY County=='Los Angeles';
REQUIRED_COLUMNS = FOREACH FILTERED_COUNTY GENERATE FMID,MarketName,$12..;
PER = FOREACH REQUIRED_COLUMNS GENERATE FMID,MarketName,fm($2..) AS Percentage;
STATUS = FOREACH PER GENERATE FMID,MarketName,Percentage,status(Percentage) AS Stat;
UDF1 :
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class fm extends EvalFunc<Integer>
{
String temp;
int per;
int count=0;
public Integer exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return -1;
try
{
for(int i=0;i<25;i++)
{
if(input.get(i) == "" || input.get(i) == null)
return -1;
temp = (String)input.get(i);
if(temp.equals("Y"))
count++;
}
per = count*4;
count = 0;
return per;
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}
}
UDF2 :
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class status extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException
{
if (input == null || input.size() == 0)
return null;
try
{
String str = (String)input.get(0);
int i = Integer.parseInt(str);
if(i>=60)
return "HIGH";
else if(i<=40)
return "LOW";
else
return "MEDIUM";
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}
}
Dataset :
https://onedrive.live.com/redir?resid=7F81451078F4DBE8%21113
ERROR :
Pig Stack Trace
ERROR 2078: Caught error from UDF: status [Caught exception processing input row ]
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias STATUS. Backend error : Caught error from UDF: status [Caught exception processing input row ]
at org.apache.pig.PigServer.openIterator(PigServer.java:828)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: status [Caught exception processing input row ]
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:365)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)

It appears that your problem may be that you are casting your input to a String in your status UDF. Your fm UDF actually returns an Integer. So instead you should have:
Integer i = (Integer)input.get(0);
This definitely will cause a problem unless you fix it. Without the original error message I can't say whether or not there is some other problem that occurs earlier.
I would have expected your stack trace to include the original exception message, which would help you debug this issue. Strange that it doesn't. Without it all you have to go off of is analyzing the code.
This might help with debugging in the future:
throw new IOException("Caught exception processing input row " + e.getMessage(), e);
For the fm UDF, I also recommend making the variables temp, per, and count local to the exec method instead of instances of the class, because they don't need to be. This probably won't cause an error but it is better coding practice.

Related

How to create UDF in pig for categorize columns with respect to another filed

I want to categorize one column with respect to other column using UDF in pig.
Data i have
Id,name,age
1,jhon,31
2,adi,15
3,sam,25
4,lina,28
Expected output
1,jhon,31,30-35
2,adi,15,10-15
3,sam,25,20-25
4,lina,28,25-30
Please suggest
You can do this without a UDF. Assuming you have loaded the data to a relation A.
B = FOREACH A GENERATE A.Id,A.name,A.age,(A.age%5 == 0 ? A.age-5 : (A.age/5)*5) as lower_age,(A.age%5 == 0 ? A.age : ((A.age/5)*5) + 5) as upper_age;
C = FOREACH B GENERATE B.Id,B.name,B.age,CONCAT(CONCAT((chararray)lower_age,'-'),(chararray)upper_age);
DUMP C;
you can create pig udfs in eclipse
create a project in eclipse with pig jars and try below code
package com;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
public class Age extends EvalFunc<String>{
#Override
public String exec(Tuple a) throws IOException {
// TODO Auto-generated method stub
if(a == null || a.size() == 0){
return null;
}
try{
Object object = a.get(0);
if(object == null){
return null;
}
int i = (Integer) object;
if(i >= 10 && i <= 20 ){
return "10-20";
}
else if (i >= 21 && i <= 30){
return "20-30";
}
else
return ">30";
} catch (ExecException e){
throw new IOException(e);
}
}
}
Now export the project as jar and register it in pig shell
REGISTER <path of your .jar file>
Define it with package and class.
DEFINE U com.Age();
a = LOAD '<input path>' using PigStorage(',') as (id:int,name:chararray,age:int);
b = FOREACH a GENERATE id,name,age,U(age);

Protobuf 3.0 Any Type pack/unpack

I would like to know how to transform a Protobuf Any Type to the original Protobuf message type and vice versa. In Java from Message to Any is easy:
Any.Builder anyBuilder = Any.newBuilder().mergeFrom(protoMess.build());
But how can I parse that Any back to the originial message (e.g. to the type of "protoMess")? I could probably parse everything on a stream just to read it back in, but that's not what I want. I want to have some transformation like this:
ProtoMess.MessData.Builder protoMessBuilder = (ProtoMess.MessData.Builder) transformToMessageBuilder(anyBuilder)
How can I achieve that? Is it already implemented for Java? The Protobuf Language Guide says there were pack and unpack methods, but there are none in Java.
Thank you in Advance :)
The answer might be a bit late but maybe this still helps someone.
In the current version of Protocol Buffers 3 pack and unpack are available in Java.
In your example packing can be done like:
Any anyMessage = Any.pack(protoMess.build()));
And unpacking like:
ProtoMess protoMess = anyMessage.unpack(ProtoMess.class);
Here is also a full example for handling Protocol Buffers messages with nested Any messages:
ProtocolBuffers Files
A simple Protocol Buffers file with a nested Any message could look like:
syntax = "proto3";
import "google/protobuf/any.proto";
message ParentMessage {
string text = 1;
google.protobuf.Any childMessage = 2;
}
A possible nested message could then be:
syntax = "proto3";
message ChildMessage {
string text = 1;
}
Packing
To build the full message the following function can be used:
public ParentMessage createMessage() {
// Create child message
ChildMessage.Builder childMessageBuilder = ChildMessage.newBuilder();
childMessageBuilder.setText("Child Text");
// Create parent message
ParentMessage.Builder parentMessageBuilder = ParentMessage.newBuilder();
parentMessageBuilder.setText("Parent Text");
parentMessageBuilder.setChildMessage(Any.pack(childMessageBuilder.build()));
// Return message
return parentMessageBuilder.build();
}
Unpacking
To read the child message from the parent message the following function can be used:
public ChildMessage readChildMessage(ParentMessage parentMessage) {
try {
return parentMessage.getChildMessage().unpack(ChildMessage.class);
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
EDIT:
If your packed messages can have different Types, you can read out the typeUrl and use reflection to unpack the message. Assuming you have the child messages ChildMessage1 and ChildMessage2 you can do the following:
#SuppressWarnings("unchecked")
public Message readChildMessage(ParentMessage parentMessage) {
try {
Any childMessage = parentMessage.getChildMessage();
String clazzName = childMessage.getTypeUrl().split("/")[1];
String clazzPackage = String.format("package.%s", clazzName);
Class<Message> clazz = (Class<Message>) Class.forName(clazzPackage);
return childMessage.unpack(clazz);
} catch (ClassNotFoundException | InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
For further processing, you could determine the type of the message with instanceof, which is not very efficient. If you want to get a message of a certain type, you should compare the typeUrl directly:
public ChildMessage1 readChildMessage(ParentMessage parentMessage) {
try {
Any childMessage = parentMessage.getChildMessage();
String clazzName = childMessage.getTypeUrl().split("/")[1];
if (clazzName.equals("ChildMessage1")) {
return childMessage.unpack("ChildMessage1.class");
}
return null
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
return null;
}
}
Just to add information in case someone has the same problem ... Currently to unpack you have to do (c# .netcore 3.1 Google.Protobuf 3.11.4)
Foo myobject = anyMessage.Unpack<Foo>();
I know this question is very old, but it still came up when I was looking for the answer. Using #sundance answer I had to answer do this a little differently. The problem being that the actual message was a subclass of the actual class. So it required a $.
for(Any x : in.getDetailsList()){
try{
String clazzName = x.getTypeUrl().split("/")[1];
String[] split_name = clazzName.split("\\.");
String nameClass = String.join(".", Arrays.copyOfRange(split_name, 0, split_name.length - 1)) + "$" + split_name[split_name.length-1];
Class<Message> clazz = (Class<Message>) Class.forName(nameClass);
System.out.println(x.unpack(clazz));
} catch (Exception e){
e.printStackTrace();
}
}
With this being the definition of my proto messages
syntax = "proto3";
package cb_grpc.msg.Main;
service QueryService {
rpc anyService (AnyID) returns (QueryResponse) {}
}
enum Buckets {
main = 0;
txn = 1;
hxn = 2;
}
message QueryResponse{
string content = 1;
string code = 2;
}
message AnyID {
Buckets bucket = 1;
string docID = 2;
repeated google.protobuf.Any details = 3;
}
and
syntax = "proto3";
package org.querc.cb_grpc.msg.database;
option java_package = "org.querc.cb_grpc.msg";
option java_outer_classname = "database";
message TxnLog {
string doc_id = 1;
repeated string changes = 2;
}

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?
Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

hbase InternalScanner and filter in coprocessor

all:
Recently,I wrote a coprocessor in Hbase(0.94.17), A Class extends BaseEndpointCoprocessor, a rowcount method to count one table's rows.
And I got a problem.
if I did not set a filter in scan,my code works fine for two tables. One table has 1,000,000 rows,the other has 160,000,000 rows. it took about 2 minutes to count the bigger table.
however ,If I set a filter in scan, it only work on small table. it will throw a exception on the bigger table.
org.apache.hadoop.hbase.ipc.ExecRPCInvoker$1#2c88652b, java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
trust me,I check my code over and over again.
so, to count my table with filter, I have to write the following stupid code, first, I did not set a filter in scan,and then ,after I got one row record, I wrote a method to filter it.
and it work on both tables.
But I do not know why.
I try to read the scanner source code in HRegion.java,however, I did not get it.
So,if you know the answer,please help me. Thank you.
#Override
public long rowCount(Configuration conf) throws IOException {
// TODO Auto-generated method stub
Scan scan = new Scan();
parseConfiguration(conf);
Filter filter = null;
if (this.mFilterString != null && !mFilterString.equals("")) {
ParseFilter parse = new ParseFilter();
filter = parse.parseFilterString(mFilterString);
// scan.setFilter(filter);
}
scan.setCaching(this.mScanCaching);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);
long sum = 0;
try {
List<KeyValue> curVals = new ArrayList<KeyValue>();
boolean hasMore = false;
do {
curVals.clear();
hasMore = scanner.next(curVals);
if (filter != null) {
filter.reset();
if (HbaseUtil.filterOneResult(curVals, filter)) {
continue;
}
}
sum++;
} while (hasMore);
} finally {
scanner.close();
}
return sum;
}
The following is my hbase util code:
public static boolean filterOneResult(List<KeyValue> kvList, Filter filter) {
if (kvList.size() == 0)
return true;
KeyValue kv = kvList.get(0);
if (filter.filterRowKey(kv.getBuffer(), kv.getRowOffset(), kv.getRowLength())) {
return true;
}
for (KeyValue kv2 : kvList) {
if (filter.filterKeyValue(kv2) == Filter.ReturnCode.NEXT_ROW) {
return true;
}
}
filter.filterRow(kvList);
if (filter.filterRow())
return true;
else
return false;
}
Ok,It was my mistake. After I use jdb to debug my code, I got the following exception,
"org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
It is obvious ,my result list is empty.
hasMore = scanner.next(curVals);
it means, if I use a Filter in scan,this curVals list might be empty, but hasMore is true.
but I thought,if a record was filtered, it should jump to the next row,and this list should never be empty. I was wrong.
And my client did not print any remote error message on my console, it just catch this remote Exception, and retry.
after retry 10 times, it print an another exception,which was meaningless.

Problems during counting strings in the txt file

I am developing a progam which reads a text file and creates a report. The content of the report is the following: the number of every string in file, its "status", and some symbols of every string beginning. It works well with file up to 100 Mb.
But when I run the program with input files which are bigger than 1,5Gb in size and contain more than 100000 lines, I get the following error:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Unknown Source) at
> java.lang.String.<init>(Unknown Source) at
> java.lang.StringBuffer.toString(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:771) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:723) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:745) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1512) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1528) at
> org.apache.commons.io.ReadFileToListSample.main(ReadFileToListSample.java:43)
I increased VM arguments up to -Xms128m -Xmx1600m (in eclipse run configuration) but this did not help. Specialists from OTN forum advised me to read some books and improve my program's performance. Could anybody help me to improve it? Thank you.
code:
import org.apache.commons.io.FileUtils;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.io.PrintStream;
import java.util.List;
public class ReadFileToList {
public static void main(String[] args) throws FileNotFoundException
{
File file_out = new File ("D:\\Docs\\test_out.txt");
FileOutputStream fos = new FileOutputStream(file_out);
PrintStream ps = new PrintStream (fos);
System.setOut (ps);
// Create a file object
File file = new File("D:\\Docs\\test_in.txt");
FileReader fr = null;
LineNumberReader lnr = null;
try {
// Here we read a file, sample.txt, using FileUtils
// class of commons-io. Using FileUtils.readLines()
// we can read file content line by line and return
// the result as a List of string.
List<String> contents = FileUtils.readLines(file);
//
// Iterate the result to print each line of the file.
fr = new FileReader(file);
lnr = new LineNumberReader(fr);
for (String line : contents)
{
String begin_line = line.substring(0, 38); // return 38 chars from the string
String begin_line_without_null = begin_line.replace("\u0000", " ");
String begin_line_without_null_spaces = begin_line_without_null.replaceAll(" +", " ");
int stringlenght = line.length();
line = lnr.readLine();
int line_num = lnr.getLineNumber();
String status;
// some correct length for if
int c_u_length_f = 12;
int c_ea_length_f = 13;
int c_a_length_f = 2130;
int c_u_length_e = 3430;
int c_ea_length_e = 1331;
int c_a_length_e = 442;
int h_ext = 6;
int t_ext = 6;
if ( stringlenght == c_u_length_f ||
stringlenght == c_ea_length_f ||
stringlenght == c_a_length_f ||
stringlenght == c_u_length_e ||
stringlenght == c_ea_length_e ||
stringlenght == c_a_length_e ||
stringlenght == h_ext ||
stringlenght == t_ext)
status = "ok";
else status = "fail";
System.out.println(+ line_num + stringlenght + status + begin_line_without_null_spaces);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Also specialists from OTN said that this programm opens the input and reading it twice. May be some mistakes in "for statement"? But I can't find it.
Thank you.
You're declaring variables inside the loop and doing a lot of uneeded work, including reading the file twice - not good for peformance either. You can use the line number reader to get the line number and the text and reuse the line variable (declared outside the loop). Here's a shortened version that does what you need. You'll need to complete the validLength method to check all the values since I included only the first couple of tests.
import java.io.*;
public class TestFile {
//a method to determine if the length is valid implemented outside the method that does the reading
private static String validLength(int length) {
if (length == 12 || length == 13 || length == 2130) //you can finish it
return "ok";
return "fail";
}
public static void main(String[] args) {
try {
LineNumberReader lnr = new LineNumberReader(new FileReader(args[0]));
BufferedWriter out = new BufferedWriter(new FileWriter(args[1]));
String line;
int length;
while (null != (line = lnr.readLine())) {
length = line.length();
line = line.substring(0,38);
line = line.replace("\u0000", " ");
line = line.replace("+", " ");
out.write( lnr.getLineNumber() + length + validLength(length) + line);
out.newLine();
}
out.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Call this as java TestFile D:\Docs\test_in.txt D:\Docs\test_in.txt or replace the args[0] and args[1] with the file names if you want to hard code them.

Resources