Get the contents of one column given another column - bash

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.

Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile

awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1

You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Related

Comparing 2 files with a for loop in bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!
I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.
Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

passing a parameter in awk command won't work

I run the script bellow with ./command script.sh 11, the first line of code bellow stores the output (321) successfully in parameter x (checked with echo on line 2). On line 3 I try to use parameter x to retrieve the last two columns on all lines where the value in the first column is equal to x (in doc2.csv). This won't work but when I replace z=$x by z=321it works fine. Why won't this code work when passing the parameter?
#!/bin/bash
x="$(awk -v y=$1 -F\; '$1 == y' ~/Documents/doc1.csv | cut -d ';' -f2)"
echo $x
awk -v z=$x -F, '$1 == z' ~/Documents/doc2.csv | cut -d ',' -f2,3
doc1.csv (all columns have unique values)
33;987
22;654
11;321
...
doc2.csv
321,156843,ABCD
321,637253,HYEB
123,256843,BHJN
412,486522,HDBC
412,257843,BHJN
862,256843,BHLN
...
Like others have mentioned there is probably some extra characters coming along for the ride in field 2 of your cut command.
If you just use awk to print the column you want instead of the entire line and cutting that you shouldn't have any problems. If you still do then you will need to look into dos2unix.
n=33;
x=$(awk -v y=$n -F\; '$1 == y {print $2}' d1);
echo ${x};
awk -v z=$x -F, '$1 == z' d2
d1 and d2 contain doc1 and doc2 contents as you outlined.
As you can see all I did was stop using cut on the output of awk and just told awk to print the second field if the first field is equal to the input variable.
By the way awk is pretty powerful if you weren't aware... You can do this entire program within awk.
n=11; awk -v x=$n -F\; 'NR==FNR{ if($1==x){ y[$2]; } next} $1 in y{print $2, $3}' d1 <( sed 's/,/;/g' d2)
NR==FNR Is a trick that effectively says "If we are still in the first file, do this"... the key is not forgetting to use next to skip the rest of the awk command. Once we get to the second file FNR flips back to 1 but NR keeps incrementing up so they'll never be equal again.
So for the first file we just load up the second column values into an array where the first column matches our passed variable. You could optimize this since you said d1 was always unique lines.
So once we get into the next file the logic skips everything and runs $1 in y. This just checks if the first column is in the array we have created. If it is awk prints column 2 and 3.
<( sed 's/,/;/g' d2) just means we want to treat the output of the sed command as a file. The sed command is just converting the commas in d2 to semicolons so that it matches the FS that awk expects.
Hopefully you've learned a bit about awk, read more here http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/ and a great redirection cheat sheet is available here http://www.catonmat.net/download/bash-redirections-cheat-sheet.pdf .

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

awk - split only by first occurrence

I have a line like:
one:two:three:four:five:six seven:eight
and I want to use awk to get $1 to be one and $2 to be two:three:four:five:six seven:eight
I know I can get it by doing sed before. That is to change the first occurrence of : with sed then awk it using the new delimiter.
However replacing the delimiter with a new one would not help me since I can not guarantee that the new delimiter will not already be somewhere in the text.
I want to know if there is an option to get awk to behave this way
So something like:
awk -F: '{print $1,$2}'
will print:
one two:three:four:five:six seven:eight
I will also want to do some manipulations on $1 and $2 so I don't want just to substitute the first occurrence of :.
Without any substitutions
echo "one:two:three:four:five" | awk -F: '{ st = index($0,":");print $1 " " substr($0,st+1)}'
The index command finds the first occurance of the ":" in the whole string, so in this case the variable st would be set to 4. I then use substr function to grab all the rest of the string from starting from position st+1, if no end number supplied it'll go to the end of the string. The output being
one two:three:four:five
If you want to do further processing you could always set the string to a variable for further processing.
rem = substr($0,st+1)
Note this was tested on Solaris AWK but I can't see any reason why this shouldn't work on other flavours.
Some like this?
echo "one:two:three:four:five:six" | awk '{sub(/:/," ")}1'
one two:three:four:five:six
This replaces the first : to space.
You can then later get it into $1, $2
echo "one:two:three:four:five:six" | awk '{sub(/:/," ")}1' | awk '{print $1,$2}'
one two:three:four:five:six
Or in same awk, so even with substitution, you get $1 and $2 the way you like
echo "one:two:three:four:five:six" | awk '{sub(/:/," ");$1=$1;print $1,$2}'
one two:three:four:five:six
EDIT:
Using a different separator you can get first one as filed $1 and rest in $2 like this:
echo "one:two:three:four:five:six seven:eight" | awk -F\| '{sub(/:/,"|");$1=$1;print "$1="$1 "\n$2="$2}'
$1=one
$2=two:three:four:five:six seven:eight
Unique separator
echo "one:two:three:four:five:six seven:eight" | awk -F"#;#." '{sub(/:/,"#;#.");$1=$1;print "$1="$1 "\n$2="$2}'
$1=one
$2=two:three:four:five:six seven:eight
The closest you can get with is with GNU awk's FPAT:
$ awk '{print $1}' FPAT='(^[^:]+)|(:.*)' file
one
$ awk '{print $2}' FPAT='(^[^:]+)|(:.*)' file
:two:three:four:five:six seven:eight
But $2 will include the leading delimiter but you could use substr to fix that:
$ awk '{print substr($2,2)}' FPAT='(^[^:]+)|(:.*)' file
two:three:four:five:six seven:eight
So putting it all together:
$ awk '{print $1, substr($2,2)}' FPAT='(^[^:]+)|(:.*)' file
one two:three:four:five:six seven:eight
Storing the results of the substr back in $2 will allow further processing on $2 without the leading delimiter:
$ awk '{$2=substr($2,2); print $1,$2}' FPAT='(^[^:]+)|(:.*)' file
one two:three:four:five:six seven:eight
A solution that should work with mawk 1.3.3:
awk '{n=index($0,":");s=$0;$1=substr(s,1,n-1);$2=substr(s,n+1);print $1}' FS='\0'
one
awk '{n=index($0,":");s=$0;$1=substr(s,1,n-1);$2=substr(s,n+1);print $2}' FS='\0'
two:three:four five:six:seven
awk '{n=index($0,":");s=$0;$1=substr(s,1,n-1);$2=substr(s,n+1);print $1,$2}' FS='\0'
one two:three:four five:six:seven
Just throwing this on here as a solution I came up with where I wanted to split the first two columns on : but keep the rest of the line intact.
Comments inline.
echo "a:b:c:d::e" | \
awk '{
split($0,f,":"); # split $0 into array of fields `f`
sub(/^([^:]+:){2}/,"",$0); # remove first two "fields" from `$0`
print f[1],f[2],$0 # print first two elements of `f` and edited `$0`
}'
Returns:
a b c:d::e
In my input I didn't have to worry about the first two fields containing escaped :, if that was a requirement, this solution wouldn't work as expected.
Amended to match the original requirements:
echo "a:b:c:d::e" | \
awk '{
split($0,f,":");
sub(/^([^:]+:)/,"",$0);
print f[1],$0
}'
Returns:
a b:c:d::e

Resources