splitting file1 using AWK and then name the new files based on lines from file2 - bash

File 1 has long string of data between <PhotoField1> multiple times.
Example:
<PhotoField1>alidkfjaeijwoeij<PhotoField1>akdfjalskdfasd<PhotoField1>
File 2 has list of IDs that i want to use to label the Files
Example:
A00565415
A00505050
A54531245
I have an AWK command to parse each string between <PhotoField1> from File1 into its own file, but it only labels the files temp with numbers:
awk -v RS="<PhotoField1>" '{ print $0 > "temp" NR }' File1.xml
I need to replace the temp* part with the a line from the 2nd file
So the new files would be named A00565415, A00505050, A54531245, etc..
- it would be excellent if I could add a .txt to the end of the files: A54531245.txt
The awk command works great for separating it into different files, but i need to be able to name them based on the File2 list.

awk 'NR==FNR{fname[NR]=$0".txt";next} {print > fname[FNR]}' File2.list RS="<PhotoField1>" File1.xml

You can use this awk:
awk -v RS="<PhotoField1>|\n" 'FNR==NR{a[NR]=$0; next}
NF{ print $0 > a[FNR] ".txt" }' file2 file1

Related

How to save the "awk" command output files into a new directory

I have a directory "d_files" in which there is a text file with many columns. I applied awk command to split the text file based on the column 16.
I gave the command like following inside directory "d_files":
awk 'NR==1{ h=$0 }NR>1{ print (!a[$16]++? h ORS $0 : $0) > substr($16, 1, 15)".txt" }' input_file
This gave me all the required files. But I want them to be in a different directory. I tried like following but it didn't work.
awk 'NR==1{ h=$0 }NR>1{ print (!a[$16]++? h ORS $0 : $0) ("mkdir split_files")> split_files/substr($16, 1, 15)".txt" }' input_file

ksh shell script to print and delete matched line based on a string

I have 2 files like below. I need a script to find string from file2 in file1 and delete the line which contains the string from file1 and put it in another file (output1.txt). Also it shld print the lines deleted and the string if the string doesn't exist in File1 (Ouput2.txt).
File1:
Apple
Boy: Goes to school
Cat
File2:
Boy
Dog
I need output like below.
Output1.txt:
Apple
Cat
Output2.txt:
Dog
Can anyone help please
If you have awk available on your system:
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File2 File1 > Output1.txt
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File1 File2 > Output2.txt
The script is storing in an array a the first element $1 of the first file given in argument.
If the first parameter of the second file is not part of the array, print it.
Note that the delimiter is either a space or a :

Redirect output of one command to different files in shell script

I have a tab seperated string.
I want to copy 1 column to one file and the remaining columns to other file in one go..as that string can modify in between if I use 2 different commands.
I tried:
tab_seperated_string | awk -F"\t" '{ print $2"\t"$3"\t"$4"\t"$5} {print $1}'
2,3,4,5 should go to one file and 1 should go to another file.
You can do like this:
tab_seperated_string | awk -F"\t" '{print $2,$3,$4,$5 > "file2"; print $1 > "file1"}' OFS="\t"
It will then save data to two different files.
By setting OFS to \t, you do not need all the \t in the print statement.
Here is another way if you have many fields that go to one file and first field to another:
awk -F"\t" '{print $1 > "file1"; sub(/[^\t]+\t/,""); print $0 > "file2"}' OFS="\t"
The sub(/[^\t]+\t/,"") removes first field and first tab.

awk: Remove duplicates and create a new csv file

I have following CSV file:
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en
I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:
I started with following which actually finds the duplicates but I want to create a new file instead of just printing:
sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","
Nut 100% sure what you like, but this will only get the last input if there are equals:
awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr
If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:
awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
To get the duplicated into a new file:
awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353
This will show only the first entry for a given first column value:
awk -F, '!(seen[$1]++)' file > newfile

Split file into different parts based on the data using awk

I need to split the data in file 1 based on it´s data in $4 using awk. The target file-names should be taken from a mapping file 2.
File 1
text;text;text;AB;text
text;text;text;AB;text
text;text;text;CD;text
text;text;text;CD;text
text;text;text;EF;text
text;text;text;EF;text
File 2
AB;valid
CD;not_valid
EF;not_specified
Desired output where the file names are the value of $2 in file 2.
File valid
text;text;text;AB;text
text;text;text;AB;text
File not_valid
text;text;text;CD;text
text;text;text;CD;text
File not_specified
text;text;text;EF;text
text;text;text;EF;text
Any suggestions on how to perform the split?
Using awk:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}
$4 != p {if (p) close(a[p]); p=$4}' file2 file1
It seems that just the first part of the code will work:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}' file2 file1
So, why the last half code:
$4 != p {if (p) close(a[p]); p=$4
is needed? Thanks!

Resources