Which files are ignored as input by mapper? - hadoop

I'm chaining multiple MapReduce jobs and want to pass along/store some meta information (e.g. configuration or name of original input) with the results. At least the file "_SUCCESS" and also anything in the directory "_logs" seams to be ignored.
Are there any filename patterns which are by default ignored by the InputReader? Or is this just a fixed limited list?

The FileInputFormat uses the following hiddenFileFilter by default:
private static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
So if you uses any FileInputFormat (such as TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat), the hidden files (the file name starts with "_" or ".") will be ignored.
You can use FileInputFormat.setInputPathFilter to set your custom PathFilter. Remember that the hiddenFileFilter is always active.

Related

How to change the program path in Ghidra

I have a Ghidra project with an imported binary file, which was created on computer A, then I want to move this project to Computer B. However, the path of the binary file isn't the same as A. How do I change the path setting in Ghidra?
Edited:
Error Message (Black blocks are the original path in computer A.)
It seems that Ghidra uses the information from currentProgram.getExecutablePath() which takes the value from the options stored with the binary information inside the project:
Code snippet from ghidra.program.database.ProgramDB#getExecutablePath:
#Override
public String getExecutablePath() {
String path = null;
Options pl = getOptions(PROGRAM_INFO);
path = pl.getString(EXECUTABLE_PATH, UNKNOWN);
return path == null ? UNKNOWN : path;
}
#Override
public void setExecutablePath(String path) {
Options pl = getOptions(PROGRAM_INFO);
pl.setString(EXECUTABLE_PATH, path);
changed = true;
}
To change this you should be able to simply use the corresponding setExecutablePath method, e.g. by running
currentProgram.setExecutablePath("/new/path/to/binary.elf")
inside the Python REPL.

SuperCSV with null delimiter

I'm creating a file that isn't really a csv file, but SuperCSV can help me to make the creation of this file easier. The structure of the file uses different lengths for each line, following a layout that don't separate the different information. So, to know which information has in one line you need look at the first 2 characters (the name of the register), count the characters and extract it by size.
I've configured SuperCSV to use empty delimiter, however, the created file is using a space where it should have nothing.
public class TarefaGerarArquivoRegistrosFiscais implements ITarefa {
private static final CsvPreference FORMATO_ANEXO_IV = new CsvPreference.Builder('"', '\0' , "\r\n").build();
public void processar() {
try {
writer = new CsvListWriter(getFileWriter(), FORMATO_ANEXO_IV);
writer.write(geradorRegistroU1.gerar());
} finally {
if (writer != null)
writer.close();
}
}
}
I'm doing something wrong? '\0' is the correct code for a null char?
It's probably not what you want to hear, but I wouldn't recommend using Super CSV for this (and I'm a committer!). Its sole purpose is to deal with delimited files - and you're not using delimiters.
You could misuse Super CSV by creating a wrapper object (containing your List) whose toString() method simply concatenates all of the values together, then passing that single object to writer.write(), but it's an awful hack.
I'd recommend either finding another library more suited to your problem, or writing your own solution.

How does Mapper class identify the SequenceFile as inputfile in hadoop?

In my one MapReduce task, I override the BytesWritable as KeyBytesWritable, and override the ByteWritable as ValueBytesWritable. Then I output the result using SequenceFileOutputFormat.
My question is when I start the next MapReduce task, I want to use this SequenceFile as inputfile. So how could I set the jobclass, and how the Mapper class could identify the key and value in the SequenceFile which I overrided before?
I understand that I could SequenceFile.Reader to read the key and value.
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
But I don't know how to use this Reader to pass the key and value into Mapper class as Parameters. How could I set conf.setInputFormat to SequenceFileInputFormat and then let Mapper get the key and values?
Thanks
You do not need to manually read the sequence file. Just set the
input format class to sequence file:
job.setInputFormatClass(SequenceFileInputFormat.class);
and set the input path to the directory containing yor sequence files.
FileInputFormat.setInputPaths(<path to the dir containing your sequence files>);
You will need to pay attention to the (Key,Value) types of the inputs on the parameterized types of your Mapper class to match the (key,value) tuples inside your sequence file.

Multiple Custom Writable formats

I have multiple input sources and I have used Sqoop's codegen tool to generate custom classes for each input source
public class SQOOP_REC1 extends SqoopRecord implements DBWritable, Writable
public class SQOOP_REC2 extends SqoopRecord implements DBWritable, Writable
On the Map side, based on the input source, I create objects of the above 2 classes accordingly.
I have the key as type "Text" and since I have 2 different types of values, I kept the value output type as "Writable".
On the reduce side, I accept the value type as Writable.
public class SkeletonReduce extends Reducer<Text,Writable, Text, Text> {
public void reduce(Text key, Iterable<Writable> values, Context context) throws IOException,InterruptedException {
}
}
I also set
job.setMapOutputValueClass(Writable.class);
During execution, it does not enter the reduce function at all.
Could someone tell me if it possible to do this? If so, what am I doing wrong?
You can't specify Writable as your output type; it has to be a concrete type. All records need to have the same (concrete) key and value types, in Mappers and Reducers. If you need different types you can create some kind of hybrid Writable that contains either an "A" or "B" inside. It's a little ugly but works and is done a lot in Mahout for example.
But I don't know why any of this would make the reducer not run; this is likely something quite separate and not answerable based on this info.
Look into extending GenericWritable for your value type. You need to define the set of classes which are allowed (SQOOP_REC1 and SQOOP_REC2 in your case), and it's not as efficient because it creates new object instances in the readFields method (but you can override this if you have a small set of classes, just have instance variables of both types, and a flag which denotes which one is valid)
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/io/GenericWritable.html
Ok, I think I figured out how to do this. Based on a suggestion give by Doug Cutting himself
http://grokbase.com/t/hadoop/common-user/083gzhd6zd/multiple-output-value-classes
I wrapped the class using ObjectWritable
ObjectWritable obj = new ObjectWritable(SQOOP_REC2.class,sqoop_rec2);
And then on the Reduce side, I can get the name of the wrapped class and Cast it back to the original class.
if(val.getDeclaredClass().getName().equals("SQOOP_REC2")){
SQOOP_REC2temp = (SQOOP_REC2) val.get();
And don't forget
job.setMapOutputValueClass(ObjectWritable.class);

Partitioner of Hadoop for first two words of key

When I perform Hadoop streaming. There's the output of mapper (Key, Value)
The key is a word sequence that separated with white-space.
I'd like to use partitioner that returns hash value of first two words.
So, implemented as
public static class CounterPartitioner extends Partitioner<Text, IntWritable> {
#Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String[] line = key.toString().split(" ");
String prefix = (line.length > 1) ? (line[0] + line[1]) : line[0];
return (prefix.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
My question is
is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
stream.num.map.output.key.fields
map.output.key.field.separator
mapred.text.key.comparator.options
...
Thanks in advance.
When I perform Hadoop streaming. There's the output of mapper (Key, Value) The key is a word sequence that separated with white-space.
My question is is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
Built-in Hadoop library is based on Java and the purpose of streaming is to use other languages besides Java which talks to STDIO/STDOUT.
I don't see the purpose of changing the streaming related properties using Hadoop API which is built using Java.
BYW, Configuration#set can be used to set the configuration properties besides setting them in the configuration files and from the command prompt.

Resources