split up (with specified delimiter) a selected column - shell

I have a tab-delimited file and want to modify it. The last column is pipe-delimited and I'd like to split that column up (from pipe to tab) while avoiding splitting up other columns with pipes.
This works to convert pipe to tab but I'm unable to have it do the splitting only on a selected column 13. Is there a way to have this work just on the last column without having to specify it?
awk -F'|' '$13=$13' OFS="\t" inputfile.tsv > split.tsv

Let's consider this tab-delimited test file:
$ cat file
a|b c|d e|f g
one two three four
To break up the third column on |:
$ awk -F'\t' '{gsub(/[|]/, "\t", $3)} 1' OFS='\t' file
a|b c|d e f g
one two three four
For your file, you will want to replace $3 with $13.
awk -F'\t' '{gsub(/[|]/, "\t", $13)} 1' OFS='\t' file
Or, to replace the last column, whatever column it is, use:
awk -F'\t' '{gsub(/[|]/, "\t", $NF)} 1' OFS='\t' file
How it works
-F'\t' sets the field separator on input to a tab.
gsub(/[|]/, "\t", $13) replaces | with a tab in field $13.
1 is awk's cryptic short-hand for print-the-line.
OFS='\t' tells awk to use a tab as the field separator on output.
Alternate form
It may be clearer and easier to maintain if \t is coded just once instead of three times. In that case (hat tip: Ed Morton):
awk 'BEGIN{FS=OFS="\t"} {gsub(/[|]/, OFS, $NF)} 1' file

Related

awk rounding the float numbers

My first line of file a.txt contains following and fields are separated by (,)
ab,b1,c,d,5.986627e738,e,5.986627e738
cd,g2,h,i,7.3423542344,j,7.3423542344
ef,l3,m,n,9.3124234323,o,9.3124234323
when I issue the below command
awk -F"," 'NR>-1{OFS=",";gsub($5,$5+10);OFS=",";print }' a.txt
it is printing
ab,b1,c,d,inf,e,inf
cd,g2,h,i,17.3424,j,17.3424
ef,l3,m,n,19.3124,o,19.3124
Here I have two issues
I asked awk to add 10 to only 5th column but it has added to 7th column as well due to duplicate entries
It is rounding up the numbers, instead, I need decimals to print as it is
How can I fix this?
awk 'BEGIN {FS=OFS=","}{$5=sprintf("%.10f", $5+10)}7' file
in your data, the $5 from line#1 has an e, so it was turned into 10.0000... in output.
you did substitution with gsub, therefore all occurrences will be replaced.
printf/sprintf should be considered to output in certain format.
tested with gawk
If you want to set the format in printf dynamically:
kent$ cat f
ab,b1,c,d,5.9866,e,5.986627e738
cd,g2,h,i,7.34235,j,7.3423542344
ef,l3,m,n,9.312423,o,9.3124234323
kent$ awk 'BEGIN {FS=OFS=","}{split($5,p,".");$5=sprintf("%.*f",length(p[2]), $5+10)}7' f
ab,b1,c,d,15.9866,e,5.986627e738
cd,g2,h,i,17.34235,j,7.3423542344
ef,l3,m,n,19.312423,o,9.3124234323
what you did is replacement on the whole record, what you really want to do is
awk 'BEGIN {FS=OFS=","}
{$5+=10}1' a.txt

How can I replace a character in a specific column? [duplicate]

I have a text file and I'm trying to replace a specific character (.) in the first column to another character (-). Every field is delimited by comma. Some of the lines have the last 3 columns empty, so they have 3 commas at the end.
Example of text file:
abc.def.ghi,123.4561.789,ABC,DEF,GHI
abc.def.ghq,124.4562.789,ABC,DEF,GHI
abc.def.ghw,125.4563.789,ABC,DEF,GHI
abc.def.ghe,126.4564.789,,,
abc.def.ghr,127.4565.789,,,
What I tried was using awk to replace '.' in the first column with '-', then print out the contents.
ETA: Tried out sarnold's suggestion and got the output I want.
ETA2: I could have a longer first column. Is there a way to change ONLY the first 3 '.' in the first column to '-', so I get the output
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,
. is regexp notation for "any character". Escape it with \ and it means .:
$ awk -F, '{gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi 123.4561.789 ABC DEF GHI
abc-def-ghq 124.4562.789 ABC DEF GHI
abc-def-ghw 125.4563.789 ABC DEF GHI
abc-def-ghe 126.4564.789
abc-def-ghr 127.4565.789
$
The output field separator is a space, by default. Set OFS = "," to set that:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi,123.4561.789,ABC,DEF,GHI
abc-def-ghq,124.4562.789,ABC,DEF,GHI
abc-def-ghw,125.4563.789,ABC,DEF,GHI
abc-def-ghe,126.4564.789,,,
abc-def-ghr,127.4565.789,,,
This still allows changing multiple fields:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); gsub("1", "#",$2); print}' textfile.csv
abc-def-ghi,#23.456#.789,ABC,DEF,GHI
abc-def-ghq,#24.4562.789,ABC,DEF,GHI
abc-def-ghw,#25.4563.789,ABC,DEF,GHI
abc-def-ghe,#26.4564.789,,,
abc-def-ghr,#27.4565.789,,,
I don't know what -OFS, does, but it isn't a supported command line option; using it to set the output field separator was a mistake on my part. Setting OFS within the awk program works well.
This might work for you:
awk -F, -vOFS=, '{for(n=1;n<=3;n++)sub(/\./,"-",$1)}1' file
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,

Compare two csv files, use the first three columns as identifier, then print common lines

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.
Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]
awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

Redirect output of one command to different files in shell script

I have a tab seperated string.
I want to copy 1 column to one file and the remaining columns to other file in one go..as that string can modify in between if I use 2 different commands.
I tried:
tab_seperated_string | awk -F"\t" '{ print $2"\t"$3"\t"$4"\t"$5} {print $1}'
2,3,4,5 should go to one file and 1 should go to another file.
You can do like this:
tab_seperated_string | awk -F"\t" '{print $2,$3,$4,$5 > "file2"; print $1 > "file1"}' OFS="\t"
It will then save data to two different files.
By setting OFS to \t, you do not need all the \t in the print statement.
Here is another way if you have many fields that go to one file and first field to another:
awk -F"\t" '{print $1 > "file1"; sub(/[^\t]+\t/,""); print $0 > "file2"}' OFS="\t"
The sub(/[^\t]+\t/,"") removes first field and first tab.

Cut and replace bash

I have to process a file with data organized like this
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
etc
Columns can have different length but lines always have the same number of columns.
I want to be able to cut a specific column of a given line and change it to the value I want.
For example I'd apply my command and change the file to
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
I know how to select a specific line with sed and then cut the field but I have no idea on how to replace the field with the value I have.
Thanks
Here's a way to do it with awk:
Going with your example, if you wanted to replace the 3rd field of the 1st line:
awk 'BEGIN{FS=OFS=":"} {if (NR==1) {$3 = "XXXX"}; print}' input_file
Input:
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Output:
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Explanation:
awk: invoke the awk command
'...': everything enclosed by single-quotes are instructions to awk
BEGIN{FS=OFS=":"}: Use : as delimiters for both input and output. FS stands for Field Separator. OFS stands for Output Field Separator.
if (NR==1) {$3 = "XXXX"};: If Number of Records (NR) read so far is 1, then set the 3rd field ($3) to "XXXX".
print: print the current line
input_file: name of your input file.
If instead what you are trying to accomplish is simply replace all occurrences of CCC with XXXX in your file, simply do:
sed -i 's/CCC/XXXX/g` input_file
Note that this will also replace partial matches, such as ABCCCDD -> ABXXXXDD
This might work for you (GNU sed):
sed -r 's/^(([^:]*:?){2})CCC/\1XXXX/' file
or
awk -F: -vOFS=: '$3=="CCC"{$3="XXXX"};1' file

Resources