mahout kmeans clustering : showing error - hadoop

I was trying to cluster data in mahout. An error is showing.
here is the error
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.populateClusterModels(ClusterClassificationMapper.java:129)
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.setup(ClusterClassificationMapper.java:74)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
13/03/07 19:29:31 INFO mapred.JobClient: map 0% reduce 0%
13/03/07 19:29:31 INFO mapred.JobClient: Job complete: job_local_0010
13/03/07 19:29:31 INFO mapred.JobClient: Counters: 0
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at org.apache.mahout.clustering.kmeans.KMeansDriver.clusterData(KMeansDriver.java:260)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
at com.ifm.dataclustering.SequencePrep.<init>(SequencePrep.java:95)
at com.ifm.dataclustering.App.main(App.java:8)
here is my code
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path vector_path = new Path("E:/Thesis/Experiments/Mahout dataset/input/vector_input");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, vector_path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector outputVec : vector) {
vec.set(outputVec);
writer.append(new Text(outputVec.getName()), vec);
}
writer.close();
// create initial cluster
Path cluster_path = new Path("E:/Thesis/Experiments/Mahout dataset/clusters/part-00000");
SequenceFile.Writer cluster_writer = new SequenceFile.Writer(fs, conf, cluster_path, Text.class, Kluster.class);
// number of cluster k
int k=4;
for(i=0;i<k;i++) {
NamedVector outputVec = vector.get(i);
Kluster cluster = new Kluster(outputVec, i, new EuclideanDistanceMeasure());
// System.out.println(cluster);
cluster_writer.append(new Text(cluster.getIdentifier()), cluster);
}
cluster_writer.close();
// set cluster output path
Path output = new Path("E:/Thesis/Experiments/Mahout dataset/output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("E:/Thesis/Experiments/Mahout dataset/input"), new Path("E:/Thesis/Experiments/Mahout dataset/clusters"),
output, new EuclideanDistanceMeasure(), 0.001, 10,
true, 0.0, false);
SequenceFile.Reader output_reader = new SequenceFile.Reader(fs,new Path("E:/Thesis/Experiments/Mahout dataset/output/" + Kluster.CLUSTERED_POINTS_DIR+ "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (output_reader.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
reader.close();
}

The paths to your input/output data seem incorrect. The MapReduce job runs on a cluster. Thus the data is read from HDFS and not from your local hard disk.
The error message:
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
gives you a hint about the incorrect path.
Before running the job, make sure that you fist upload the input data to HDFS:
hadoop fs -mkdir input
hadoop fs -copyFromLocal E:\\file input
...
then instead of:
new Path("E:/Thesis/Experiments/Mahout dataset/input")
you should use the HDFS path:
new Path("input")
or
new Path("/user/<username>/input")
EDIT:
Use FileSystem#exists(Path path) In order to check, whether a Path is valid or not.

Related

Unable to write data in HDFS "datanode" - Node added in excluded list

I'm running "namenode" and "datanode" in the same jvm, when I try to write data I'm getting the following exception
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:836)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:724)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:631)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:591)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:490)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:421)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:297)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:164)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2127)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2771)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
final File file = new File("C:\\ManageEngine\\test\\data\\namenode");
final File file1 = new File("C:\\ManageEngine\\test\\data\\datanode1");
BasicConfigurator.configure();
final HdfsConfiguration nameNodeConfiguration = new HdfsConfiguration();
FileSystem.setDefaultUri(nameNodeConfiguration, "hdfs://localhost:5555");
nameNodeConfiguration.set(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_KEY, file.toURI().toString());
nameNodeConfiguration.set(DFSConfigKeys.DFS_REPLICATION_KEY, "1" );
final NameNode nameNode = new NameNode(nameNodeConfiguration);
final HdfsConfiguration dataNodeConfiguration1 = new HdfsConfiguration();
dataNodeConfiguration1.set(DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY, file1.toURI().toString());
dataNodeConfiguration1.set(DFSConfigKeys.DFS_DATANODE_ADDRESS_KEY, "localhost:5556" );
dataNodeConfiguration1.set(DFSConfigKeys.DFS_REPLICATION_KEY, "1" );
FileSystem.setDefaultUri(dataNodeConfiguration1, "hdfs://localhost:5555");
final DataNode dataNode1 = DataNode.instantiateDataNode(new String[]{}, dataNodeConfiguration1);
final FileSystem fs = FileSystem.get(dataNodeConfiguration1);
Path hdfswritepath = new Path(fileName);
if(!fs.exists(hdfswritepath)) {
fs.create(hdfswritepath);
System.out.println("Path "+hdfswritepath+" created.");
}
System.out.println("Begin Write file into hdfs");
FSDataOutputStream outputStream=fs.create(hdfswritepath);
//Cassical output stream usage
outputStream.writeBytes(fileContent);
outputStream.close();
System.out.println("End Write file into hdfs");
Request data - Image
You cannot have the number of replicas higher than the number of datanodes.
If you want run on a single node, set dfs.replication to 1 in your hdfs-site.xml.

Map Reduce job on EMR successfully running but no output data on S3

Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?

Reading from a sequence file placed in DistributedCache Hadoop

How can I read sequence files from distributed cache?
I have tried some things, but I'm always getting FileNotFoundException.
I'm adding file to distributed cache like this
DistributedCache.addCacheFile(new URI(currentMedoids), conf);
And reading from it in mapper's setup method
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
Path[] paths = DistributedCache.getLocalCacheFiles(conf);
List<Element> sketch = new ArrayList<Element>();
SequenceFile.Reader medoidsReader = new SequenceFile.Reader(fs, paths[0], conf);
Writable medoidKey = (Writable) medoidsReader.getKeyClass().newInstance();
Writable medoidValue = (Writable) medoidsReader.getValueClass().newInstance();
while(medoidsReader.next(medoidKey, medoidValue)){
ElementWritable medoidWritable = (ElementWritable)medoidValue;
sketch.add(medoidWritable.getElement());
}
It seems that I should have used getCacheFiles(), which returns URI[] instead of getLocalCacheFiles(), which returns Path[].
Now it works after making that change.

hadoop mapreduce job on Cassandra result is wrong

I have 7 node cassandra (1.1.1) and hadoop (1.03) cluster ( tasktracker install same on every cassandra node).
and my column family use wide row pattern. 1 row contains about 200k columns (max about 300k).
My problem is when we use Hadoop to run analytic jobs ( count numbers of occurrence of a word) the result i received is wrong ( result is too lower as I expected in test records)
there 's one strange when we monitoring on job tracker is map progress task indicate wrong ( in my image below ) , And number of "Map input records" when i rerun job ( same data) is not same.
here is my init job code:
Job job = new Job(conf);
job.setJobName(this.jobname);
job.setJarByClass(BannerCount.class);
job.setMapperClass(BannerViewMapper.class);
job.setReducerClass(BannerClickReducer.class);
FileSystem fs = FileSystem.get(conf);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "192.168.23.114,192.168.23.115,192.168.23.116,192.168.23.117,192.168.23.121,192.168.23.122,192.168.23.123");
ConfigHelper.setInputPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
ConfigHelper.setRangeBatchSize(job.getConfiguration(), 500);
SlicePredicate predicate = new SlicePredicate();
SliceRange sliceRange = new SliceRange();
sliceRange.setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setFinish(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setCount(200000);
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
String outPathString = "BannerViewResultV3" + COLUMN_FAMILY;
if (fs.exists(new Path(outPathString)))
fs.delete(new Path(outPathString), true);
FileOutputFormat.setOutputPath(job, new Path(outPathString));
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(28);
job.waitForCompletion(true);
return 1;

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Resources