Extract the last three columns from a text file with awk - bash

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?

Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.

You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile

With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

Related

Bash: Keep all lines with duplicate values in column X

I have a file with a few thousand lines and 20+ columns. I now want to keep only the lines that have the same e-mail address in column 3 as in other lines.
file: (First Name; Last Name; E-Mail; ...)
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Jennifer;Lopez;jennifer#lopez.com
Andre;Agassi;tom#boyden.com
Paul;Walker;paul#walker.com
I want to keep ALL lines that have a matching e-mail address. In this case the expected output would be
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If I use
awk -F';' '!seen[$3]++' file
I will lose the first instance of the e-mail address, in this case line 1 and 2 and will keep ONLY the duplicates.
Is there a way to keep all lines?
This awk one-liner will help you:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file
It passes the file twice, the first time it calculates the count of occurrence, the 2nd pass will check and output.
With the given input example, it prints:
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If the output order doesn't matter, here's a one-pass approach:
$ awk -F';' '$3 in first{print first[$3] $0; first[$3]=""; next} {first[$3]=$0 ORS}' file
Mike;Tyson;mike#tyson.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Tom;Boyden;tom#boyden.com
Andre;Agassi;tom#boyden.com
Could you please try following, in a single read Input_file in single awk.
awk '
BEGIN{
FS=";"
}
{
mail[$3]++
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
for(i in mailVal){
if(mail[i]>1){ print mailVal[i] }
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=";" ##Setting field separator as ; here.
}
{
mail[$3]++ ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0 ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{ ##Starting END block of this program from here.
for(i in mailVal){ ##Traversing through mailVal here.
if(mail[i]>1){ print mailVal[i] } ##Checking condition if value is greater than 1 then printing its value here.
}
}
' Input_file ##Mentioning Input_file name here.
I think #ceving just needs to go a little further.
ASSUMING the chosen column is NOT the first or last -
cut -f$col -d\; file | # slice out the right column
tr '[[:upper:]]' '[[:lower:]]' | # standardize case
sort | uniq -d | # sort and output only the dups
sed 's/^/;/; s/$/;/;' > dups # save the lowercased keys
grep -iFf dups file > subset.csv # pull matching records
This breaks if the chosen column is the first or last, but should otherwise preserve case and order from the original version.
If it might be the first or last, then pad the stream to that last grep and clean it afterwards -
sed 's/^/;/; s/$/;/;' file | # pad with leading/trailing delims
grep -iFf dups | # grab relevant records
sed 's/^;//; s/;$//;' > subset.csv # strip the padding
Find the duplicate e-mail addresses:
sed -s 's/^.*;/;/;s/$/$/' < file.csv | sort | uniq -d > dups.txt
Report the duplicate csv rows:
grep -f dups.txt file.csv
Update:
As "Ed Morton" pointed out the above commands will fail, when the e-mail addresses contain characters, which have a special meaning in a regular expression. This makes it necessary to escape the e-mail addresses.
One way to do so is to use Perl compatible regular expression. In a PCRE the escape sequences \Q and \E mark the beginning and the end of a string, which should not be treated as a regular expression. GNU grep supports PCREs with the option -P. But this can not be combined with the option -f. This makes it necessary to use something like xargs. But xargs interprets backslashes and ruins the regular expression. In order to prevent it, it is necessary to use the option -0.
Lessen learned: it is quite difficult to get it right without programming it in AWK.
sed -s 's/^.*;/;\\Q/;s/$/\\E$/' < file.csv | sort | uniq -d | tr '\n' '\0' > dups.txt
xargs -0 -i < dups.txt grep -P '{}' file.csv

How to print the csv file excluding first column till end using awk

I have a csv file with dynamic columns.
I've tried to use awk -F , 'NF>1' resul1.txt but it still prints all columns.
Since it has dynamic columns.
Its quite difficult to print using print $1 till end.
Try this awk command:
awk -F, '{$1=""}1' input.txt | awk -vOFS=, '{$1=$1}1' > output.txt
Make the 1st field empty
Print out entire line again
try substr function :
substr(string, start [, length ])
Return a length-character-long substring of string, starting at character number start. The first character of a string is character
number one.For example, substr("washington", 5, 3) returns "ing".*
awk -F, '{print substr($0,length($1)+1+length(FS))}' file
You can use cut:
cut -d',' -f2- yourfile.csv > output.csv
Explanation:
-d - setting delimiter to ,
-f - fields to print
2- - from 2 field to end of line
With awk:
awk -F, '{sub(/[^,]+,/,"",$0);}1' OFS=, yourfile.csv > output.csv
With sed:
sed -i.bak 's/^[^,]\+,//g' yourfile.csv
-i - in-place edit

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

awk: csv split works, but ignores the last field in the row

I have a sample file that looks like:
Sample.csv
Data_1,0,289,292,293,300,306
Data_2,0,294,3,306
Data_3,0,294,305,306
Data_4,0,294,305,306
And Im running awk on it:
scr.sh:
awk -F ',' -v tId="$1" '{for(i=3; i<NF; i++){if($i==tId) print}}' $2
By calling
./scr.sh 300 Sample.csv
That works fine and returns me exactly one row that matches.
UK_4_AB34,0,289,292,293,300,306
Original Problem statement: From the 3rd column onwards, if any of the column data matches the number given, then the line should get printed.
But if I call:
./scr.sh 306 Sample.csv
That returns me NOTHING!
I've double checked the lines in Sample.csv and confirmed that there are NO trailing spaces on any of the lines.
Any clues? Thanks.
This awk will do what you're looking for:
awk -F ',' -v tId="$1" '$0 ~ "(^|,)" tId "(,|$)"' file
Alternatively this egrep will also do the job:
egrep '(^|,)306(,|$)' file
UPDATE: Based on your comments below you can use:
awk -v tId="$1" 'BEGIN{FS=OFS=","} {p=$0; $1=$2=""} $0 ~ "(^|,)" tId "(,|$)"{print p}' file
Here is a simple solution to your problem.
Lets say your argument is stored in a variable named var
ie var=$1;
Therefore run the following command to find the occurences in your file
grep -E "^${var},|,${var},|,${var}$" yourfilename

Cut and replace bash

I have to process a file with data organized like this
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
etc
Columns can have different length but lines always have the same number of columns.
I want to be able to cut a specific column of a given line and change it to the value I want.
For example I'd apply my command and change the file to
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
I know how to select a specific line with sed and then cut the field but I have no idea on how to replace the field with the value I have.
Thanks
Here's a way to do it with awk:
Going with your example, if you wanted to replace the 3rd field of the 1st line:
awk 'BEGIN{FS=OFS=":"} {if (NR==1) {$3 = "XXXX"}; print}' input_file
Input:
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Output:
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Explanation:
awk: invoke the awk command
'...': everything enclosed by single-quotes are instructions to awk
BEGIN{FS=OFS=":"}: Use : as delimiters for both input and output. FS stands for Field Separator. OFS stands for Output Field Separator.
if (NR==1) {$3 = "XXXX"};: If Number of Records (NR) read so far is 1, then set the 3rd field ($3) to "XXXX".
print: print the current line
input_file: name of your input file.
If instead what you are trying to accomplish is simply replace all occurrences of CCC with XXXX in your file, simply do:
sed -i 's/CCC/XXXX/g` input_file
Note that this will also replace partial matches, such as ABCCCDD -> ABXXXXDD
This might work for you (GNU sed):
sed -r 's/^(([^:]*:?){2})CCC/\1XXXX/' file
or
awk -F: -vOFS=: '$3=="CCC"{$3="XXXX"};1' file

Resources