Apache Spark with custom InputFormat for HadoopRDD - hadoop

I am currently working on Apache Spark. I have implemented a Custom InputFormat for Apache Hadoop that reads key-value records through TCP Sockets. I wanted to port this code to Apache Spark and use it with the hadoopRDD() function. My Apache Spark code is as follows:
public final class SparkParallelDataLoad {
public static void main(String[] args) {
int iterations = 100;
String dbNodesLocations = "";
if(args.length < 3) {
System.err.printf("Usage ParallelLoad <coordinator-IP> <coordinator-port> <numberOfSplits>\n");
System.exit(1);
}
JobConf jobConf = new JobConf();
jobConf.set(CustomConf.confCoordinatorIP, args[0]);
jobConf.set(CustomConf.confCoordinatorPort, args[1]);
jobConf.set(CustomConf.confDBNodesLocations, dbNodesLocations);
int numOfSplits = Integer.parseInt(args[2]);
CustomInputFormat.setCoordinatorIp(args[0]);
CustomInputFormat.setCoordinatorPort(Integer.parseInt(args[1]));
SparkConf sparkConf = new SparkConf().setAppName("SparkParallelDataLoad");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<LongWritable, Text> records = sc.hadoopRDD(jobConf,
CustomInputFormat.class, LongWritable.class, Text.class,
numOfSplits);
JavaRDD<LabeledPoint> points = records.map(new Function<Tuple2<LongWritable, Text>, LabeledPoint>() {
private final Log log = LogFactory.getLog(Function.class);
/**
*
*/
private static final long serialVersionUID = -1771348263117622186L;
private final Pattern SPACE = Pattern.compile(" ");
#Override
public LabeledPoint call(Tuple2<LongWritable, Text> tuple)
throws Exception {
if(tuple == null || tuple._1() == null || tuple._2() == null)
return null;
double y = Double.parseDouble(Long.toString(tuple._1.get()));
String[] tok = SPACE.split(tuple._2.toString());
double[] x = new double[tok.length];
for (int i = 0; i < tok.length; ++i) {
if(tok[i].isEmpty() == false)
x[i] = Double.parseDouble(tok[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}
});
System.out.println("Number of records: " + points.count());
LinearRegressionModel model = LinearRegressionWithSGD.train(points.rdd(), iterations);
System.out.println("Model weights: " + model.weights());
sc.stop();
}
}
In my project I also have to decide which Spark Worker is going to connect to which Data Source (something like a "matchmake" process with a 1:1 relation). Therefore, I create a number of InputSplits equal to the number of data sources so that my data are sent in parallel to the SparkContext. My questions are the following:
Does the result of method InpuSplit.getLength() affect how many records a RecordReader returns? In detail, I have seen in my test runs that a Job ends after returning only one record, only because I have a value of 0 returned from the CustomInputSplit.getLength() function.
In the Apache Spark context, is the number of workers equal to the number of the InputSplits produced from my InputFormat at least for the execution of the records.map() function call?
The answer to question 2 above is really important for my project.
Thank you,
Nick

Yes. Spark's sc.hadoopRDD will create an RDD with as many partitions as reported by InputFormat.getSplits.
The last argument to hadoopRDD called minPartitions (numOfSplits in your code) will be used as the hint to InputFormat.getSplits. But the number returned by getSplits will be respected no matter if it is greater or smaller.
See the code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L168

Related

Last Reducer is running from last 24 hour for 200 gb of data set

Hi i have one mapreduce apllication that bulk loads data into HBase .
I have total 142 text files of total size 200gb.
My mapper gets completed within 5 minutes and all reducer also but last one is stuck at 100%.
Its taking very long time and running from past 24 hr .
I have one column family .
My row key is like below .
48433197315|1972-03-31T00:00:00Z|4 48433197315|1972-03-31T00:00:00Z|38 48433197315|1972-03-31T00:00:00Z|41 48433197315|1972-03-31T00:00:00Z|23 48433197315|1972-03-31T00:00:00Z|7 48433336118|1972-03-31T00:00:00Z|17 48433197319|1972-03-31T00:00:00Z|64 48433197319|1972-03-31T00:00:00Z|58 48433197319|1972-03-31T00:00:00Z|61 48433197319|1972-03-31T00:00:00Z|73 48433197319|1972-03-31T00:00:00Z|97 48433336119|1972-03-31T00:00:00Z|7
I have created my table like this .
private static Configuration getHbaseConfiguration() {
try {
if (hbaseConf == null) {
System.out.println(
"UserId= " + USERID + " \t keytab file =" + KEYTAB_FILE + " \t conf =" + KRB5_CONF_FILE);
HBaseConfiguration.create();
hbaseConf = HBaseConfiguration.create();
hbaseConf.set("mapreduce.job.queuename", "root.fricadev");
hbaseConf.set("mapreduce.child.java.opts", "-Xmx6553m");
hbaseConf.set("mapreduce.map.memory.mb", "8192");
hbaseConf.setInt(MAX_FILES_PER_REGION_PER_FAMILY, 1024);
System.setProperty("java.security.krb5.conf", KRB5_CONF_FILE);
UserGroupInformation.loginUserFromKeytab(USERID, KEYTAB_FILE);
}
} catch (Exception e) {
e.printStackTrace();
}
return hbaseConf;
}
/**
* HBase bulk import example Data preparation MapReduce job driver
*
* args[0]: HDFS input path args[1]: HDFS output path
*
* #throws Exception
*
*/
public static void main(String[] args) throws Exception {
if (hbaseConf == null)
hbaseConf = getHbaseConfiguration();
String outputPath = args[2];
hbaseConf.set("data.seperator", DATA_SEPERATOR);
hbaseConf.set("hbase.table.name", args[0]);
hbaseConf.setInt(MAX_FILES_PER_REGION_PER_FAMILY, 1024);
Job job = new Job(hbaseConf);
job.setJarByClass(HBaseBulkLoadDriver.class);
job.setJobName("Bulk Loading HBase Table::" + args[0]);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapperClass(HBaseBulkLoadMapperUnzipped.class);
// job.getConfiguration().set("mapreduce.job.acl-view-job",
// "bigdata-app-fricadev-sdw-u6034690");
if (HbaseBulkLoadMapperConstants.FUNDAMENTAL_ANALYTIC.equals(args[0])) {
HTableDescriptor descriptor = new HTableDescriptor(Bytes.toBytes(args[0]));
descriptor.addFamily(new HColumnDescriptor(COLUMN_FAMILY));
HBaseAdmin admin = new HBaseAdmin(hbaseConf);
byte[] startKey = new byte[16];
Arrays.fill(startKey, (byte) 0);
byte[] endKey = new byte[16];
Arrays.fill(endKey, (byte) 255);
admin.createTable(descriptor, startKey, endKey, REGIONS_COUNT);
admin.close();
// HColumnDescriptor hcd = new
// HColumnDescriptor(COLUMN_FAMILY).setMaxVersions(1);
// createPreSplitLoadTestTable(hbaseConf, descriptor, hcd);
}
job.getConfiguration().setBoolean("mapreduce.compress.map.output", true);
job.getConfiguration().setBoolean("mapreduce.map.output.compress", true);
job.getConfiguration().setBoolean("mapreduce.output.fileoutputformat.compress", true);
job.getConfiguration().setClass("mapreduce.map.output.compression.codec",
org.apache.hadoop.io.compress.GzipCodec.class, org.apache.hadoop.io.compress.CompressionCodec.class);
job.getConfiguration().set("hfile.compression", Compression.Algorithm.LZO.getName());
// Connection connection =
// ConnectionFactory.createConnection(hbaseConf);
// Table table = connection.getTable(TableName.valueOf(args[0]));
FileInputFormat.setInputPaths(job, args[1]);
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapOutputValueClass(Put.class);
HFileOutputFormat.configureIncrementalLoad(job, new HTable(hbaseConf, args[0]));
System.exit(job.waitForCompletion(true) ? 0 : -1);
System.out.println("job is successfull..........");
// LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConf);
// loader.doBulkLoad(new Path(outputPath), (HTable) table);
HBaseBulkLoad.doBulkLoad(outputPath, args[0]);
}
/**
* Enum of counters.
* It used for collect statistics
*/
public static enum Counters {
/**
* Counts data format errors.
*/
WRONG_DATA_FORMAT_COUNTER
}
}
There is no reducer in my code only mapper .
My ,mapper code is like this .
public class FundamentalAnalyticLoader implements TableLoader {
private ImmutableBytesWritable hbaseTableName;
private Text value;
private Mapper<LongWritable, Text, ImmutableBytesWritable, Put>.Context context;
private String strFileLocationAndDate;
#SuppressWarnings("unchecked")
public FundamentalAnalyticLoader(ImmutableBytesWritable hbaseTableName, Text value, Context context,
String strFileLocationAndDate) {
//System.out.println("Constructing Fundalmental Analytic Load");
this.hbaseTableName = hbaseTableName;
this.value = value;
this.context = context;
this.strFileLocationAndDate = strFileLocationAndDate;
}
#SuppressWarnings("deprecation")
public void load() {
if (!HbaseBulkLoadMapperConstants.FF_ACTION.contains(value.toString())) {
String[] values = value.toString().split(HbaseBulkLoadMapperConstants.DATA_SEPERATOR);
String[] strArrFileLocationAndDate = strFileLocationAndDate
.split(HbaseBulkLoadMapperConstants.FIELD_SEPERATOR);
if (17 == values.length) {
String strKey = values[5].trim() + "|" + values[0].trim() + "|" + values[3].trim() + "|"
+ values[4].trim() + "|" + values[14].trim() + "|" + strArrFileLocationAndDate[0].trim() + "|"
+ strArrFileLocationAndDate[2].trim();
//String strRowKey=StringUtils.leftPad(Integer.toString(Math.abs(strKey.hashCode() % 470)), 3, "0") + "|" + strKey;
byte[] hashedRowKey = HbaseBulkImportUtil.getHash(strKey);
Put put = new Put((hashedRowKey));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID),
Bytes.toBytes(values[0].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID_OBJECT_TYPE_ID),
Bytes.toBytes(values[1].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FUNDAMENTAL_SERIES_ID_OBJECT_TYPE),
Bytes.toBytes(values[2]));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_END_DATE),
Bytes.toBytes(values[3].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_TYPE),
Bytes.toBytes(values[4].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.LINE_ITEM_ID), Bytes.toBytes(values[5].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_ITEM_INSTANCE_KEY),
Bytes.toBytes(values[6].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_VALUE), Bytes.toBytes(values[7].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_CONCEPT_CODE),
Bytes.toBytes(values[8].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_VALUE_CURRENCY_ID),
Bytes.toBytes(values[9].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_IS_ESTIMATED),
Bytes.toBytes(values[10].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_AUDITABILITY_EQUATION),
Bytes.toBytes(values[11].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FINANCIAL_PERIOD_TYPE_ID),
Bytes.toBytes(values[12].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_CONCEPT_ID),
Bytes.toBytes(values[13].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.ANALYTIC_LINE_ITEM_IS_YEAR_TO_DATE),
Bytes.toBytes(values[14].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.IS_ANNUAL), Bytes.toBytes(values[15].trim()));
// put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
// Bytes.toBytes(HbaseBulkLoadMapperConstants.TAXONOMY_ID),
// Bytes.toBytes(values[16].trim()));
//
// put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
// Bytes.toBytes(HbaseBulkLoadMapperConstants.INSTRUMENT_ID),
// Bytes.toBytes(values[17].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FF_ACTION),
Bytes.toBytes(values[16].substring(0, values[16].length() - 3)));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_PARTITION),
Bytes.toBytes(strArrFileLocationAndDate[0].trim()));
put.add(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),
Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_PARTITION_DATE),
Bytes.toBytes(strArrFileLocationAndDate[2].trim()));
try {
context.write(hbaseTableName, put);
} catch (IOException e) {
context.getCounter(Counters.WRONG_DATA_FORMAT_COUNTER).increment(1);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} else {
System.out.println("Values length is less 15 and value is " + value.toString());
}
}
}
Any help to improve the speed is highly appreciated .
Counter image
here`
I suspect that all records go into single region.
When you created empty table, HBase splitted key address space in even ranges. But because all actual keys share the same prefix, they all go into single region. That means that single region/reduce task does all the job and all others regions/reduce tasks do not do anything useful. You may check this hypothesis by looking at Hadoop counters: how many bytes slow reduce task read/wrote compared to other reduce tasks.
If this is the problem, then you need to manually prepare split keys and create table by using createTable(HTableDescriptor desc, byte[][] splitKeys. Split keys should evenly divide your actual dataset for optimal performance.
Example #1. If your keys were ordinary English words, then it would be easy to split table into 26 regions by first character (split keys are 'a', 'b', ..., 'z'). Or to split it into 26*26 regions by first two characters: ('aa', 'ab', ..., 'zz'). Regions would not be necessarily even, but this would be anyway better than to have only single region.
Example #2. If your keys were 4-byte hashes, then it would be easy to split table into 256 regions by first byte (0x00, 0x01, ..., 0xff) or into 2^16 regions by first two bytes.
In your particular case, I see two options:
Search for smallest key (in sorted order) and for largest key in your dataset. And use them as startKey and endKey to Admin.createTable(). This will work well only if keys are uniformly distributed between startKey and endKey.
Prefix your keys with hash(key) and use method in Example #2. This should work well, but you won't be able to make semantical queries like (KEY >= ${first} and KEY <= ${last}).
Mostly if a job is hanging at the last minute or sec, then the issue could be a particular node or resources having concurrency issues etc.
Small check list could be:
1. Try again with smaller data set. This will rule out basic functioning of the code.
2. Since most of the job is done, the mapper and reducer might be good. You can try the job running with same volume few times. The logs can help you identify if the same node is having issues for repeated runs.
3. Verify if the output is getting generated as expected.
4. You can also reduce the number of columns you are trying to add to HBase. This will relieve the load with same volume.
Jobs getting hanged can be caused due to variety of issues. But trouble shooting mostly consists of above steps - verifying the cause if its data related, resource related, a specific node related, memory related etc.

Requested row out of range for doMiniBatchMutation on HRegion

The error occured when hbase client batch data. At first it's ok. Some time later it's wrong! The detailed error is:
: 1 time, org.apache.hadoop.hbase.exceptions.FailedSanityCheckException: Requested row out of range for doMiniBatchMutation on HRegion idcard,bfef6945ac273d83\x00\x00\x00\x00\x00\x17\xCC$,1461584032622.dadb8843fe441dac4a3d4d7669597ef5., startKey='bfef6945ac273d83\x00\x00\x00\x00\x00\x17\xCC$', getEndKey()='', row='9a6ec957205e1d74\x00\x00\x00\x00\x01\x90\x1F\xF5'
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:712)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:662)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2046)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2117)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
The environment is :
hbase hbase-1.1.3
hadoop2.6
hbase-client 1.2.0
The hbase client's code is :
public static void batchPutData(Connection connection, long startNum, long count) throws IOException, ParseException{
//table
Table table = connection.getTable(TableName.valueOf(TABLE_NAME));
//index table
Table index_table = connection.getTable(TableName.valueOf(INDEX_TABLE_NAME));
//random name
RandomChineseName randomChineseName = new RandomChineseName();
//random car
RandomCar randomCar = new RandomCar();
List<Put> puts = new ArrayList<Put>();
List<Put> indexPlateputs = new ArrayList<Put>();
for(long i = 0; i < count; i++){
long index = startNum+i;
Date birthdate = RandomUtils.randomDate();
String birthdateStr = DateUtil.dateToStr(birthdate, "yyyy-MM-dd");
boolean isBoy = i%2==0?true:false;
String name = isBoy?randomChineseName.randomBoyName():randomChineseName.randomGirlName();
String nation = RandomUtils.randomNation();
String plate = randomCar.randomPlate();
byte[] idbuff = Bytes.toBytes(index);
String hashPrefix = MD5Hash.getMD5AsHex(idbuff).substring(0, 16);
//create a put for table
Put put = new Put(Bytes.add(Bytes.toBytes(hashPrefix), idbuff));
put.addColumn(Bytes.toBytes("idcard"), Bytes.toBytes("name"), Bytes.toBytes(name));
put.addColumn(Bytes.toBytes("idcard"), Bytes.toBytes("sex"), Bytes.toBytes(isBoy?1:0));
put.addColumn(Bytes.toBytes("idcard"), Bytes.toBytes("birthdate"), Bytes.toBytes(birthdateStr));
put.addColumn(Bytes.toBytes("idcard"), Bytes.toBytes("nation"), Bytes.toBytes(nation));
put.addColumn(Bytes.toBytes("idcard"), Bytes.toBytes("plate"), Bytes.toBytes(plate));
puts.add(put);
//create a put for index table
String namehashPrefix = MD5Hash.getMD5AsHex(Bytes.toBytes(name)).substring(0, 16);
byte[] bprf = Bytes.add(Bytes.toBytes(namehashPrefix), Bytes.toBytes(name));
bprf = Bytes.add(bprf, Bytes.toBytes(SPLIT), Bytes.toBytes(birthdateStr));
Put namePut = new Put(Bytes.add(bprf, Bytes.toBytes(SPLIT), Bytes.toBytes(index)));
namePut.addColumn(Bytes.toBytes("index"), Bytes.toBytes("idcard"), Bytes.toBytes(0));
indexPlateputs.add(namePut);
//insert for every ten thousands
if(i%10000 == 0){
table.put(puts);
index_table.put(indexPlateputs);
puts.clear();
indexPlateputs.clear();
}
}
}
It seems like conflicts with HBase version . Change HBase version to 1.1.4 or 1.0.0 or other stable version to have a try.

Spring Data Neo4j Ridiculously Slow Over Rest

public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?
If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?
Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

HBase Aggregation

I'm having some trouble doing aggregation on a particular column in HBase.
This is the snippet of code I tried:
Configuration config = HBaseConfiguration.create();
AggregationClient aggregationClient = new AggregationClient(config);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("drs"), Bytes.toBytes("count"));
ColumnInterpreter<Long, Long> ci = new LongColumnInterpreter();
Long sum = aggregationClient.sum(Bytes.toBytes("DEMO_CALCULATIONS"), ci , scan);
System.out.println(sum);
sum returns a value of null.
The aggregationClient API works fine if I do a rowcount.
I was trying to follow the directions in http://michaelmorello.blogspot.in/2012/01/row-count-hbase-aggregation-example.html
Could there be a problem with me using a LongColumnInterpreter when the 'count' field was an int? What am I missing in here?
You can only use long(8bytes) to do sum with default setting.
Cause in the code of AggregateImplementation's getSum method, it handle all the returned KeyValue as long.
List<KeyValue> results = new ArrayList<KeyValue>();
try {
boolean hasMoreRows = false;
do {
hasMoreRows = scanner.next(results);
for (KeyValue kv : results) {
temp = ci.getValue(colFamily, qualifier, kv);
if (temp != null)
sumVal = ci.add(sumVal, ci.castToReturnType(temp));
}
results.clear();
} while (hasMoreRows);
} finally {
scanner.close();
}
and in LongColumnInterpreter
public Long getValue(byte[] colFamily, byte[] colQualifier, KeyValue kv)
throws IOException {
if (kv == null || kv.getValueLength() != Bytes.SIZEOF_LONG)
return null;
return Bytes.toLong(kv.getBuffer(), kv.getValueOffset());
}

Resources