Implementation issue in cascading while reading data from hdfs - hadoop

Suppose I have these files in hdfs directory
500/Customer/part-001
500/Customer/part-002
500/Customer/part-003
Can it be possible to check from which part file the tuple is coming?
Note:I have researched but got nothing.

your question is not very clear.
Let's say your output is in following layout and the delimiter is ';'
id;name;age
1;Jordan;22
2;Nathan;33
and so on
You could use awk or grep or both to get the record
for example, if you want to search for the record Nathan, try the file command
grep -r "Nathan" part*
above command will search for the string "Nathan" and if the string is present in any part file then the first entry (word) in output will be the name of the file.
if you don't want the file name you could use
grep -hr "Nathan" part*
Please be more clear when questioning.

I got answer how to get from which part file tuple file are coming.I solved my problem using code below.
String fileName = flowProcess.getProperty("cascading.source.path").toString();
Thanks,

Related

From all the files that their name is composed of 4 letters, which contain the string “user” in their content?

I have to answer this question as an exercise.
Sample input: No sample, just trying to select and filter files using Unix shell according to some conditions
Sample output: a list of files that their name is composed of 4 letters and which contain the string “user” in their content.
I tried to use the basename ~/ command to get the file name of some files, then tried to combine it with wc by doing, for example, basename /etc/ |wc -c.
Finally, I tried grep user file_test.txt on an arbitrary file to see if it contains the word "user".
I am trying to combine all the required commands to answer the question.
I am supposed to use substitutions which I am not used to.
Could someone please help me?

Delete a string in a file using bash script

We have a file which has unformatted xml content in a single line
<sample:text>Report</sample:text><sample:user name="11111111" guid="163g673"/><sample:user name="22222222" guid="aknen1763y82bjkj18"/><sample:user name="33333333" guid="q3k4nn5k2nk53n6"/><sample:user name="44444444" guid="34bkj3b5kjbkq"/><sample:user name="55555555" guid="k4n5k34nlk6n711kjnk5253"/><sample:user name="66666666" guid="1n4k14nknl1n4lb1"/>
If we find a particular string suppose "22222222", i want to remove the entire string that surrounds the matched string. In our case the entire portion around 22222222 i.e., <sample:user name="22222222" guid="aknen1763y82bjkj18"/> should be removed and the file has to be saved.
How can we do it? Please help
You can do it using sed utility by invoking it like this:
sed -i file -e 's/<[^<]*"22222222"[^>]*>//'

How to remove a string from a text file in shell command?

I have a non.txt file and want to write a shell script to remove a string from the entire file. File has the following data :-
24321,247,654,"^A","91350","JEFFR2",21714,,1,243,654,"^A","91350","JEFFR2",21714,,1,654,0,"P","N","1140828","CA",,,,,"06037","C016","14","7",0,"21714 JEFFERS LN","","SANTA CLARITA","CA","913503917","","","","20140828"
And from the above set of data i want to remove "^A".
Please help me to find out the solution.
Try this:
sed -i 's/"^A"//g' non.txt

Shell script file takes partial path from parameter file

I have a parameter file(parameter.txt) which contain like below:
SASH=/home/ec2-user/installers
installer=/home/hadoop/path1
And My shell script(temp_pull.sh) is like below:
EPATH=`cat $1|grep 'SASH' -w| cut -d'=' -f2`
echo $EPATH
${EPATH}/data-integration/kitchen.sh -file="$KJBPATH/hadoop/temp/maxtem/temp_pull.kjb"
When I run my temp_pull.sh like below:
./temp_pull.sh parameter.txt
$EPATH gives me correct path, but 3rd line of code takes only partial path.
Error code pasted below:
/home/ec2-user/installers-->out put of 2nd line
/data-integration/kitchen.sh: No such file or directory**2-user/installer** -->out put of 3rd line
There is no need to manually parse the values of the file, because it already contains data in the format variables are defined: var=value.
Hence, if the file is safe enough, you can source the file so that SASH value will be available just by saying $SASH.
Then, you can use the following:
source "$1" # source the file given as first parameter
"$SASH"/data-integration/kitchen.sh -file="$KJBPATH/hadoop/temp/maxtem/temp_pull.kjb"
The problem is file which we were using was copied from windows to UNIX.So delimiter issue are the root cause.
By using dos2unix paramfile.txt we are able to fix the isue.
command:
dos2unix paramfile.txt
This will convert all the delemeter of windows to unix format.

grep search string and copy line from tsv to another file

I have a tsv file with thousands of tab delimmited lines and I need to search for someones name and then copy the entire line to a seperate file over and over. Can anyone help? Thanks!!
Question is vague but this should generally work:
grep "some-name" *.tsv > ouput
It sounds very simple:
grep "someone's name" tsv-file > separate-file
What's the catch? Is the name in one or two fields? Middle initials?

Resources