how to hold objects in array in hadoop - hadoop

How can we have array of our own objects in hadoop? Does Hadoop have containers like List, Set, LinkedList etc. similar to java? Are the following lines good?
Text[] textArray = new Text[2];
textArray[0] = new Text(maxSalaryDeptEmployee.getEmployeeName());
textArray[1] = new Text(Integer.toString(maxSalaryDeptEmployee.getEmployeeSalary()));
ArrayWritable arrayWritable = new ArrayWritable(Text.class,textArray);

Your code snippet looks good. Hadoop by default supports only ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable and EnumSetWritable as "containers".
You can also implement custom Writables, a good reference is here.

Related

Can I write different types of messages to a chronicle-queue?

I would like to write different types of messages to a chronicle-queue, and process messages in consumers depending on their types.
How can I do that?
Chronicle-Queue provides low level building blocks you can use to write any kind of message so it is up to you to choose the right data structure.
As example, you can prefix the data you write to a chronicle with a small header with some meta-data and then use it as discriminator for data processing.
To achieve this I use Wire
try (DocumentContext dc = appender.writingDocument())
{
final Wire wire = dc.wire();
final ValueOut v = wire.getValueOut();
valueOut.typePrefix(m.getClass());
valueOut.marshallable(m);
}
When reading back I:
try (DocumentContext dc = tailer.readingDocument())
{
final Wire wire = dc.wire();
final ValueIn valueIn = wire.getValueIn();
final Class clazz = valueIn.typePrefix();
// msgPool is a prealloacted hashmap containing the messages I read
final ReadMarshallable readObject = msgPool.get(clazz);
valueIn.readMarshallable(readObject)
// readObject can now be used
}
You can also write/read a generic object. This will be slightly slower than using your own scheme, but is it a simple way to always read the type you wrote.

Two arrays or one in Map structure?

I'm trying to create a Map where the data will be static and not change after the program starts (actually loaded from a server)
Is it better to have two arrays, e.g. in Java:
String keys[] = new String[10];
String values[] = new String[10];
where keys[i] corresponds to values[i]?
or to keep them in a single array, e.g.
String[][] map[] = new String[10][2];
where map[i][0] is the key and map[i][1] is the value?
Personally, the first makes more sense to me, but the second makes more sense to my partner. Is either better performance-wise? Easier to understand?
Update: I'm looking to do this in JavaScript where Map and KeyValuePairs don't exist
Using a Map implementation (in Java) would make this easier to understand as the association is clearer:
static final Map<String, String> my_map;
static
{
my_map = new HashMap<String, String>();
// Populate.
}
A Hashtable looks like what you need. It hashes the keys in such a way that lookup can happen in O(1).
So, you're looking to do this in javascript. Any array or object in js in a map, so you could just do
var mymap = {'key1':'value1','key2':'value2'};

Map Support in Shell Scripting

I am new in Shell Scripting, however i am friendly with Java Maps. I Just wanted to know that how can i use Map facility in Shell Scripting. Below is the facility i need to use in shell-
HashMap<String, ArrayList<String>> users = new HashMap<String, ArrayList<String>>();
String username = "test_user1";
String address = "test_user1_address";
String emailId = "test_user1_emailId";
ArrayList<String> values = new ArrayList<String>();
values.add(address);
values.add(emailId);
users.put(username, values);
String anotherUser = "test_user2";
if (users.containsKey(anotherUser)) {
System.out.println("Do some stuff here");
}
In short, i want to use a Map, which has String as key, either Vector or ArrayList as value (otherwise i have live with Arrays instead of ArrayList and manually take care of indexes) , put method to insert and one more method to check the presence of the key in the existing Map.
The above code is a sample code.
Thank you in advance.
bash does not support nested structures like this. Either use separate variables for each array, or use something more capable such as Python.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Why not mapper/reducer for hadoop TeraSort

I am planning to insert some code into the mapper of the TeraSort class in Hadoop 0.20.2. However, after reviewing the source code, I cannot locate the segment that mapper is implemented.
Normally, we will see a method called job.setMapperClass() which indicates the mapper class. However, for the TeraSort, I can only see thing like setInputformat, setOutputFormat. I canno t find where the mapper and reduce methods are called?
can any one please give some hints about this? Thanks,
The source code is something like this,
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
Path inputDir = new Path(args[0]);
inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
// TeraOutputFormat.setFinalSync(job, true);
job.setNumReduceTasks(0);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
For other classes, like TeraValidate, we can find the code like,
job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);
I cannot see such methods for TeraSort.
Thanks,
Why should a sort need to set the Mapper and Reducer class for it?
The default value is the standard Mapper (former identity Mapper) and standard Reducer.
These are the classes you usually inherit from.
You can basically say, that you're just emitting everything from the input and let Hadoop do its own sorting stuff. So sorting is working by "default".
Thomas answer is right i.e mapper and reducers are identity since shuffled data is sorted before applying your reduce function . Whats special about terasort is its custom partitioner (which is not default hash function). You should read more about it from here Hadoop's implementation for Terasort. It states
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

Resources