Comparing 2 files with a for loop in bash - bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!

I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.

Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

how to use cut command -f flag as reverse

This is a text file called a.txt
ok.google.com
abc.google.com
I want to select every subdomain separately
cat a.txt | cut -d "." -f1 (it select ok From left side)
cat a.txt | cut -d "." -f2 (it select google from left side)
Is there any way, so I can get result from right side
cat a.txt | cut (so it can select com From right side)
There could be few ways to do this, one way which I could think of right now could be using rev + cut + rev solution. Which will reverse the input by rev command and then set field separator as . and print fields as per they are from left to right(but actually they are reversed because of the use of rev), then pass this output to rev again to get it in its actual order.
rev Input_file | cut -d'.' -f 1 | rev
You can use awk to print the last field:
awk -F. '{print $NF}' a.txt
-F. sets the record separator to "."
$NF is the last field
And you can give your file directly as an argument, so you can avoid the famous "Useless use of cat"
For other fields, but counting from the last, you can use expressions as suggested in the comment by #sundeep or described in the users's guide under
4.3 Nonconstant Field Numbers. For example, to get the domain, before the TLD, you can substract 1 from the Number of Fields NF :
awk -F. '{ print $(NF-1) }' a.txt
You might use sed with a quantifier for the grouped value repeated till the end of the string.
( Start group
\.[^[:space:].]+ Match 1 dot and 1+ occurrences of any char except a space or dot
){1} Close the group followed by a quantifier
$ End of string
Example
sed -E 's/(\.[^[:space:].]+){1}$//' file
Output
ok.google
abc.google
If the quantifier is {2} the output will be
ok
abc
Depending on what you want to do after getting the values then you could use bash for splitting your domain into an array of its components:
#!/bin/bash
IFS=. read -ra comps <<< "ok.google.com"
echo "${comps[-2]}"
# or for bash < 4.2
echo "${comps[${#comps[#]}-2]}"
google

bash shell how to cut the first column out of a file

so I have a file named 'file' that contains these characters
a 1 z
b 2 y
c 3 x
how can I cut the first column and put it in it's own file?
I know how to do the rest using the space as a delimiter like this:
cut -f1 -d ' ' file > filecolumn1
but I'm not sure how to cut just the first column since there isn't any character in the front that I can use as a delimiter.
The delimiter doesn't have to be before the column, it's between the columns. So use the same delimiter, and specify field 1.
cut -f1 -d ' ' file > filecolumn1
Barmar's got a good option. Another option is awk:
awk '{print $1}' file > output.txt
If you have delimiter, you could use -F switch and provide a delimiter. For example, if your data was like this:
a,1,2
b,2,3
c,3,4
you can use awk's -F switch in this manner:
awk -F',' '{print $1}' file > output.txt

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Exclude a column when pasting two data files

I have one file "dat1.txt" which is like:
0 5.71159e-01
1 1.92632e-01
2 -4.73603e-01
and another file "dat2.txt" which is:
0 5.19105e-01
1 2.29702e-01
2 -3.05675e-01
to write combine these two files into one I use
paste dat1.txt dat2.txt > data.txt
But I do not want the 1st column of the 2nd file in the output file. How do I modify the unix command?
If your files are in sorted order along column 1, you could try:
join dat[12].txt
You could try this in awk itself,
$ awk 'FNR==NR {a[FNR]=$0;next} {print a[FNR],$2}' data1.txt data2.txt
0 5.71159e-01 5.19105e-01
1 1.92632e-01 2.29702e-01
2 -4.73603e-01 -3.05675e-01
Use cut to remove the first column and then pipe to paste.
cut -d' ' -f 1 --complement dat2.txt | paste dat1.txt - > data.txt
Note that the - in the past ecommand means to read from stdin in place of the second file.
If cut is broken on OSX, awk might work.
awk '{for (i=2; i<=NF; i++) print $i}' dat2.txt | paste dat1.txt - > data.txt
paste dat1.txt <(cut -d" " -f2- dat2.txt)
Using cut to remove column 1, and using process substitution to use its output in paste
Output:
0 5.71159e-01 5.19105e-01
1 1.92632e-01 2.29702e-01
2 -4.73603e-01 -3.05675e-01

Resources