InputFormat Decision - hadoop

I am trying to figure out which of the given answers suits best the question:
Given a directory of files with the following structure: line number,
tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which
InputFormat should you use to complete the line: conf.setInputFormat
(____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat
My analysis:
Option A is a format I found to exist, but I'm not sure of the correct usage of it and if it suits as an answer.
Option B is not possible since SequenceFiles are file of binary data (K,V) pairs of binary data, and thus will not be suitable..
Option C is not possible because there is no KeyValueFileInputFormat, though here, if it is a typo and it actually is KeyValuetextInputFormat, than I think it will be a good choice. Or isn't it?
Option D is not possible because there is no BDBInputFormat and even if it is a typo and it actually is BDInputFormat than it wouldn't suit the case.
Thank You!
D

The answer is Option C. It may be a typo
KeyValueTextInputFormat helps you to get line splitted with TAB.
So line number will be the key and the string will be the value.

It maybe a typo in the option C as you guessed, and it should be https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html.
See for more details: How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

Related

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

How to handle the Nominal Data by Weka J48

When I ran J48 of weka with binary split option, such decision tree was built.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
Input explanation variable is 1 nominal data which was made by question id + answer id.
1 nominal data, 1 transaction.
I'm wondering why the tree is at only one side.
Is it caused by my data set or table definition or original binary splits way?
I'd like the tree to have node on both sides.
If you know such a option please show me.
!Sample Data! Please ignore dot '・'
usr,qa,class
A,11,1
A,21,1
A,31,1
B,12,2
B,22,2
B,32,2
C,13,3
C,23,3
C,33,3
D,11,4
D,22,4
D,31,4
E,11,1
E,23,1
E,31,1
F,12,2
F,22,2
F,33,2
G,13,3
G,22,3
G,32,3
H,12,4
H,21,4
H,33,4
There's no error in the tree built and no option would really modify it. If your question is related to your same Akinator project, please reformat your data to get all questions (ie. 11,21,31) on the same instance/line and the answer as target class.
PS: if you import those data as CSV, Weka will take those data as numerical (not as as nominal). You should then add a non digit character (ie. #1,#2,#3...) so that Weka will take those data as nominal.

Generate code in code_128 format from another code

I have a list of the following codes and their corresponding codes in format code_128. I want to given a string, be able to generate the corresponging code in CODE_128 format. Based on this list, how could I generate a code_128 number to the string A4Y9387VY34, for example?
code code in code_128
A4Y9387VY34 ????
ADN38Y644YT7 9611019020018632869509
AXCW99QYTD34 9622019021500078083444
A9YQC44W9J3K 9611083009710754539701
AT8V7T3G3874 9622083021255845940154
A7K444N4FKB8 9622083033510467186874
AYCHFW448HTQ 9611005019246067403120
AY63CWBMTDCC 9622005028182439033426
ANY7TF46NGQ3 9622005031345848081170
AYY48TBVQ3FH 9611200003793988696055
AT8Q4CF4DQ9Q 9611200021606968867090
A764WYQFJWTT 9622200022706968919275
AC649ND7N8B6 9622148007265209832185
A4VDPTJ99YN4 9611148013412173923039
AHDYK498BD6T 9622148021309216149530
A4YYYNY7C3DJ 9611017021934363499071
AYG6XWVCCQ89 9622017031009914238743
A68YJHGQKCCM 9622017031138587166053
APMB7XG9XQC9 9611021011608391750002
AGP8C44Y8VYK 9622021021608111646113
A7C68B9T69XB 9622021021958603678086
AJYYWKR6BDGN 9611010022528724015883
AKMNVXDT9PYN 9622010027475034102229
AXPXMK9QMDFD 9622010031475028243694
I read a lot about it, but I didn't come to any solution. Thanks in advance!!
Well, this is a pretty open question, I will give you my suggestions:
If it is a finite list, you can use a Hash or a Dictionary, where
the keys are the Codes and map them to the corresponding value, in
your case, Code_128
Some scanners have software installed that allow you to change what
has been read to a new value, format it, etc.
If you need a bigger insight please, give us more detail about the environment you are using.
Hope that helps,
I decided to create a new answer because now I get your point. Well, if you are talking about a GS1-128 Code (please see www.gs1.org) please do not start without visiting Wikipedia info about it. as you can see, there is a thorough explanation about how to work with that type of code. That code is composed by several application identifiers followed by their corresponding values. There is a better way of encoding them by using special characters as parenthesis. Here is other info that may help you.
Hope it helps,

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources