Cut and replace bash - bash

I have to process a file with data organized like this
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
etc
Columns can have different length but lines always have the same number of columns.
I want to be able to cut a specific column of a given line and change it to the value I want.
For example I'd apply my command and change the file to
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
I know how to select a specific line with sed and then cut the field but I have no idea on how to replace the field with the value I have.
Thanks

Here's a way to do it with awk:
Going with your example, if you wanted to replace the 3rd field of the 1st line:
awk 'BEGIN{FS=OFS=":"} {if (NR==1) {$3 = "XXXX"}; print}' input_file
Input:
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Output:
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Explanation:
awk: invoke the awk command
'...': everything enclosed by single-quotes are instructions to awk
BEGIN{FS=OFS=":"}: Use : as delimiters for both input and output. FS stands for Field Separator. OFS stands for Output Field Separator.
if (NR==1) {$3 = "XXXX"};: If Number of Records (NR) read so far is 1, then set the 3rd field ($3) to "XXXX".
print: print the current line
input_file: name of your input file.
If instead what you are trying to accomplish is simply replace all occurrences of CCC with XXXX in your file, simply do:
sed -i 's/CCC/XXXX/g` input_file
Note that this will also replace partial matches, such as ABCCCDD -> ABXXXXDD

This might work for you (GNU sed):
sed -r 's/^(([^:]*:?){2})CCC/\1XXXX/' file
or
awk -F: -vOFS=: '$3=="CCC"{$3="XXXX"};1' file

Related

Print part of a comma-separated field using AWK

I have a line containing this string:
$DLOAD , 123 , Loadcase name=SUBCASE_1
I am trying to only print SUBCASE_1. Here is my code, but I get a syntax error.
awk -F, '{n=split($3,a,"="); a[n]} {printf(a[1]}' myfile
How can I fix this?
1st solution: In case you want only to get last field(which contains = in it) then with your shown samples please try following
awk -F',[[:space:]]+|=' '{print $NF}' Input_file
2nd solution: OR in case you want to get specifically 3rd field's value after = then try following awk code please. Simply making comma followed by space(s) as field separator and in main program splitting 3rd field storing values into arr array, then printing 2nd item value of arr array.
awk -F',[[:space:]]+' '{split($3,arr,"=");print arr[2]}' Input_file
Possibly the shortest solution would be:
awk -F= '{print $NF}' file
Where you simply use '=' as the field-separator and then print the last field.
Example Use/Output
Using your sample into in a heredoc with the sigil quoted to prevent expansion of $DLOAD, you would have:
$ awk -F= '{print $NF}' << 'eof'
> $DLOAD , 123 , Loadcase name=SUBCASE_1
> eof
SUBCASE_1
(of course in this case it probably doesn't matter whether $DLOAD was expanded or not, but for completeness, in case $DLOAD included another '=' ...)

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

Append and replace using awk/sed

I have this file:
2016,05,P,0002 ,CJGLOPSD8
00,BBF,BBDFTP999,051000100,GBP, , -2705248.00
00,BBF,BBDFTP999,059999998,GBP, , -3479679.38
00,BBF,BBDFTP999,061505141,GBP, , -0.40
00,BBF,BBDFTP999,061505142,GBP, , 6207621.00
00,BBF,BBDFTP999,061505405,GBP, , -0.16
00,BBF,BBDFTP999,061552000,GBP, , -0.24
00,BBF,BBDFTP999,061559010,GBP, , -0.44
00,BBF,BBDFTP999,062108021,GBP, , -0.34
00,BBF,BBDFTP999,063502007,GBP, , -0.28
I want to programmatically (in unix, or informatica if possible) grab the first two fields in the top row, concatenate them, append them to the end of each line and remove that first row.
Like so:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This is my current attempt:
awk -vvar1=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f1` -vvar2=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f2` 'BEGIN {FS=OFS=","} {print $0, var 1var2}' OF\ OPSDOWN8.CSV> OF_OPSDOWN8.csv
Any pointers? I've tried looking around the forum but can only find answers to part of my question.
Thanks for your help.
Use this awk:
awk 'BEGIN{FS=OFS=","} NR==1{val=$1$2;next} {gsub(/ */,"");print $0,val}' file
Explanation:
BEGIN{FS=OFS=","} - This block will set FS (Field Separator) and OFS (Output Field Separator) as ,.
NR==1 - Working with line number 1. Here, $1 and $2 denotes field number.
print $0,val - Printing $0 (whole line) and stored value from val.
I would use the following awk command:
awk 'NR==1{d=$1$2;next}{$(NF+1)=d;gsub(/[[:space:]]/,"")}1' FS=, OFS=, file
Explanation:
NR==1{d=$1$2;next} applies on line 1 and set's a variable d(ate) to the value of the first and the second field. The variable is being used when processing the remaining lines. next tells awk to go ahead with the next line right away without processing further instructions on this line.
{$(NF+1)=d;gsub(/[[:space:]]/,"")}1 appends a new field to the line (NF is the number of fields, assigning d to $(NF+1) effectively adds a field. gsub() is used to removing spaces. 1 at the end always evaluates to true and makes awk print the modified line.
FS=, is a command line argument. It set's the input field delimiter to ,.
OFS=, is a command line argument. It set's the output field delimiter to ,.
Output:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
With sed :
sed '1{s/\([^,]*\),\([^,]*\),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
in ERE mode :
sed -r '1{s/([^,]*),([^,]*),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
Output :
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This might work for you (GNU sed):
sed '1s/,//;1s/,.*//;1h;1d;s/ //g;G;s/\n/,/' file
For the first line only: remove the first comma, remove from the next comma to the end of the line, store the amended line in the hold space (HS) and then delete the current line (the d abruptly ends processing). For subsequent lines: remove all spaces, append the HS and replace the newline (from the G command) with a comma.
Or if you prefer:
sed '1{s/,//;s/,.*//;h;d};s/ //g;G;s/\n/,/' file
If you want to use Informatica for this, use two Source Qualifiers. Read the file twice - just one line in one SQ (filter out the rest) and in the second SQ read the whole file except the first line (skip header). Join the two on dummy port and you're done.

how to add prefix in every 2nd delimeted field in every record in shell script?

I have a file with following records which is comma delimited:
143849998,+4564656
6345353,000345345
754656,0345345
64555546,3453452345
The requirement is to add a certain prefix to every 2nd field in every record. The prefix is different in different conditions. The logic is :
If the second field starts with "+", then leave it as it is.
If the second field starts with "0" (Any number of zeroes, does not matter), replace all zeroes with "+".
If any other condition prefix "+234".
The output should be something like this:
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
How can I achieve this using AWK? I am able to perform the last condition, the first condition is straight forward, but I am failing when I am trying to club all the conditions in one awk command.
this line should do
awk -F, -v OFS="," '$2!~/^\+/{if(!sub(/^0+/,"+",$2))$2="+234"$2}7' file
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
or it could be this too:
awk -F, -v OFS="," '$2!~/^\+/&&!sub(/^0+/,"+",$2){$2="+234"$2}7' file
Using sed:
sed 's|,\([1-9]\)|,+234\1|; s|,0\+|,+|' file
Synonymously in awk:
awk '{ sub(/,([1-9])/, ",+234\1"); sub(/,0+/, ",+") } 1' file
Output:
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
$ awk -F, '{print $1",+"($2~/^[+0]/?"":234)$2+0}' file
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
Adding 0 to $2 strips off any leading zeros and/or plus sign since it's doing an arithmetic operation on it and so the natural result will not have a sign or leading zeros.
Note that this approach will convert +03 to +3 and -3 to +-3 so if those can occur in your input and that's not the desired behavior, update your question to show those cases in the sample input/output.
Using awk:
awk 'BEGIN{FS=OFS=","} $2~/^0/{sub(/^0+/, "+", $2);} !($2~/^\+/){$2="+234" $2}1' file
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
OR using non-regex based checks:
awk 'BEGIN{FS=OFS=","} substr($2,1,1)=="0"{sub(/^0+/, "+", $2);}
substr($2,1,1)!="+"{$2="+234" $2}1' file
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
If your data is as it is:
awk -F",|,[0]+" '/,\+/{ print $0;next } /,0/ {print $1",+"$2; next} {print $1",+234"$2} ' file
gives
143849998,+4564656
6345353,+345345
754656,+345345
64555546,+2343453452345
It works as follows:
either , or ,000 (with an arbitrary number of 0s) is a field separator. Then we have the three rules:
match ,+ (literally) -> print the line, go to next line
match ,0 -> split the line (awk removes all the , and 0s for us), print separated, go to next line
if no match until now, prefix the second with "+234"
You can use the following awk script:
awk -F, 'BEGIN{OFS=","}{sub(/^0+/, "+", $2)}!($2~/[0+]/){$2="+234"$2}1'
This might work for you (GNU sed):
sed '/,+/b;/,00*/s//,+/;t;s/,/&+234/' file
As per spec.

Filtering data in a text file in bash

I am trying to filter the data in a text file. There are 2 fields in the text file. The first one is text while 2nd one has 3 parts seperated by _. The first part in the second file is date in yyyyMMdd format and the next 2 are string:
xyz yyyyMMdd_abc_lmn
Now I want to filter the lines in the file based on the date in the second field. I have come up with the following awk command but it doesn't seems to work as it is outputting the entire file definitely I am missing something.
Awk command:
awk -F'\t' -v ldate='20140101' '{cdate=substr($2, 1, 8); if( cdate <= ldate) {print $1'\t\t'$2}}' label
Try:
awk -v ldate='20140101' '{split($2,fld,/_/); if(fld[1]<=ldate) print $1,$2}' file
Note:
We are using split function which basically splits the field based on regex provided as the third element and stores the fields in the array defined as second element.
You don't need to set -F'\t unless your input file is tab-delimited. The default value of FS is space, so defining it to tab might throw it off in interpreting $2.
To output with two tabs you can set the OFS variable like:
awk -F'\t' -v OFS='\t\t' -v ldate='20140101' '{split($2,fld,/_/); if(fld[1]<=ldate) print $1,$2}' file
Try this:
awk -v ldate='20140101' 'substr($NF,1,8) <= ldate' label

Resources