I have 7 node cassandra (1.1.1) and hadoop (1.03) cluster ( tasktracker install same on every cassandra node).
and my column family use wide row pattern. 1 row contains about 200k columns (max about 300k).
My problem is when we use Hadoop to run analytic jobs ( count numbers of occurrence of a word) the result i received is wrong ( result is too lower as I expected in test records)
there 's one strange when we monitoring on job tracker is map progress task indicate wrong ( in my image below ) , And number of "Map input records" when i rerun job ( same data) is not same.
here is my init job code:
Job job = new Job(conf);
job.setJobName(this.jobname);
job.setJarByClass(BannerCount.class);
job.setMapperClass(BannerViewMapper.class);
job.setReducerClass(BannerClickReducer.class);
FileSystem fs = FileSystem.get(conf);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "192.168.23.114,192.168.23.115,192.168.23.116,192.168.23.117,192.168.23.121,192.168.23.122,192.168.23.123");
ConfigHelper.setInputPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
ConfigHelper.setRangeBatchSize(job.getConfiguration(), 500);
SlicePredicate predicate = new SlicePredicate();
SliceRange sliceRange = new SliceRange();
sliceRange.setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setFinish(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setCount(200000);
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
String outPathString = "BannerViewResultV3" + COLUMN_FAMILY;
if (fs.exists(new Path(outPathString)))
fs.delete(new Path(outPathString), true);
FileOutputFormat.setOutputPath(job, new Path(outPathString));
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(28);
job.waitForCompletion(true);
return 1;
Related
Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?
I have already read previous posts related to this but didn't get anything meaningful.
My use case is :
Aggregate Impression and click data
Separate Clicked and non-clicked data in different files.
I have written mapper and reducer for that but that reducer's output is data containing clicked & non-clicked and it is going in same file. I want to separate that data so clicked data should be present in one file and non clicked should be present in other file.
Error :
java.lang.IllegalStateException: Reducer has been already set
at org.apache.hadoop.mapreduce.lib.chain.Chain.checkReducerAlreadySet(Chain.java:662)
Code
Configuration conf = new Configuration();
conf.set("mapreduce.output.fileoutputformat.compress", "true");
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
Job job = Job.getInstance(conf, "IMPRESSION_CLICK_COMBINE_JOB");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setReducerClass(ImpressionClickReducer.class);
FileInputFormat.setInputDirRecursive(job, true);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// job.setMapperClass(ImpressionMapper.class);
Path p = new Path(args[2]);
FileSystem fs = FileSystem.get(conf);
fs.exists(p);
fs.delete(p, true);
/**
* Here directory of impressions will be present
*/
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, ImpressionMapper.class);
/**
* Here directory of clicks will be present
*/
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ClickMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setNumReduceTasks(10);
job.setPartitionerClass(TrackerPartitioner.class);
ChainReducer.setReducer(job, ImpressionClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
ChainReducer.addMapper(job, ImpressionClickMapper.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
//Below mentioned line is giving Error
//ChainReducer.setReducer(job, ImpressionAndClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
job.waitForCompletion(true);
ChainReducer is used to chain Map tasks after the Reducer, you can only call setReducer() once (See the code here).
From the Javadocs:
The ChainReducer class allows to chain multiple Mapper classes after a
Reducer within the Reducer task.
Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.
So the idea is you set a single Reducer and then chain Map operations after that.
It sounds like you actually want to use MultipleOutputs. The Hadoop Javadocs provide an example on how to use it. With this you can define more than one output and its down to you which output key/values get written to.
Trying to write output into two different named output file using
AvroMultipleOutputs but getting an empty file and no error in the logs. Counter shows correct number of records. Also this
works fine when writing to a single file.
Avro version 1.7.1
Code
Job job = new Job(config, "AVRO_MULTITEST");
job.setJarByClass(AvroMultiWriter.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
job.setMapperClass(AvroMultiWriteMapper.class);
job.setNumReduceTasks(0);
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.setCountersEnabled(job, true);
AvroMultipleOutputs.addNamedOutput(job,"F1",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.addNamedOutput(job,"F2",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
Job Counter
mapred.JobClient: org.apache.avro.mapreduce.AvroMultipleOutputs
mapred.JobClient: F1=3
mapred.JobClient: F2=3
Have you tried calling multipleOutputs.close() in the close() method of the mapper class?
I have a quite simple Hadoop job using Cassandra as input and output. Here is the job configuration code (nothing special):
Job job = new Job(getConf(), JOB_NAME);
job.setJarByClass(getClass());
job.setMapperClass(CassandraHadoopCounterMapper.class);
job.setReducerClass(CassandraHadoopCounterReducer.class);
job.setCombinerClass(CassandraHadoopCounterCombiner.class);
job.setInputFormatClass(CqlInputFormat.class);
job.setOutputFormatClass(CqlOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Map.class);
job.setOutputValueClass(List.class);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, INPUT_COLUMN_FAMILY, WIDE_ROWS);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setOutputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setOutputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setInputPartitioner(job.getConfiguration(), Murmur3Partitioner.class.getName());
ConfigHelper.setOutputPartitioner(job.getConfiguration(), Murmur3Partitioner.class.getName());
String query = "UPDATE " + KEYSPACE + "." + OUTPUT_COLUMN_FAMILY + " SET c = ?";
CqlConfigHelper.setOutputCql(job.getConfiguration(), query);
//aditional properties:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "2000");
ConfigHelper.setInputSplitSize(job.getConfiguration(), 4 * 64 * 1024);
My input cassandra table have 10k rows.
In hadoop I have set max mappers = 2 and max reducers = 2
In job counters i can see the following:
Map input records=4000
Which is InputCQLPageRowSize * mappers
If InputCQLPageRowSize is not set then Map input records equals 2000 (because default InputCQLPageRowSize is 1000)
My questions:
How to make my hadoop job to read all rows in input table?
The job is run entirely locally on my PC.
I am using Cassandra v2.0.11 and Hadoop v1.0.4
My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira:
https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12:
https://issues.apache.org/jira/browse/CASSANDRA-8166
I was trying to cluster data in mahout. An error is showing.
here is the error
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.populateClusterModels(ClusterClassificationMapper.java:129)
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.setup(ClusterClassificationMapper.java:74)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
13/03/07 19:29:31 INFO mapred.JobClient: map 0% reduce 0%
13/03/07 19:29:31 INFO mapred.JobClient: Job complete: job_local_0010
13/03/07 19:29:31 INFO mapred.JobClient: Counters: 0
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at org.apache.mahout.clustering.kmeans.KMeansDriver.clusterData(KMeansDriver.java:260)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
at com.ifm.dataclustering.SequencePrep.<init>(SequencePrep.java:95)
at com.ifm.dataclustering.App.main(App.java:8)
here is my code
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path vector_path = new Path("E:/Thesis/Experiments/Mahout dataset/input/vector_input");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, vector_path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector outputVec : vector) {
vec.set(outputVec);
writer.append(new Text(outputVec.getName()), vec);
}
writer.close();
// create initial cluster
Path cluster_path = new Path("E:/Thesis/Experiments/Mahout dataset/clusters/part-00000");
SequenceFile.Writer cluster_writer = new SequenceFile.Writer(fs, conf, cluster_path, Text.class, Kluster.class);
// number of cluster k
int k=4;
for(i=0;i<k;i++) {
NamedVector outputVec = vector.get(i);
Kluster cluster = new Kluster(outputVec, i, new EuclideanDistanceMeasure());
// System.out.println(cluster);
cluster_writer.append(new Text(cluster.getIdentifier()), cluster);
}
cluster_writer.close();
// set cluster output path
Path output = new Path("E:/Thesis/Experiments/Mahout dataset/output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("E:/Thesis/Experiments/Mahout dataset/input"), new Path("E:/Thesis/Experiments/Mahout dataset/clusters"),
output, new EuclideanDistanceMeasure(), 0.001, 10,
true, 0.0, false);
SequenceFile.Reader output_reader = new SequenceFile.Reader(fs,new Path("E:/Thesis/Experiments/Mahout dataset/output/" + Kluster.CLUSTERED_POINTS_DIR+ "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (output_reader.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
reader.close();
}
The paths to your input/output data seem incorrect. The MapReduce job runs on a cluster. Thus the data is read from HDFS and not from your local hard disk.
The error message:
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
gives you a hint about the incorrect path.
Before running the job, make sure that you fist upload the input data to HDFS:
hadoop fs -mkdir input
hadoop fs -copyFromLocal E:\\file input
...
then instead of:
new Path("E:/Thesis/Experiments/Mahout dataset/input")
you should use the HDFS path:
new Path("input")
or
new Path("/user/<username>/input")
EDIT:
Use FileSystem#exists(Path path) In order to check, whether a Path is valid or not.