Related
I have two files,
File 1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
File 2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
i want to look up column 7 value from File 1 to check if it matches with column 7 value from File 2 and when matched, replace the that line in file 2 with corresponding line in file 1
So the output would be
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Thanks in advance.
You can do that with the following script:
BEGIN { FS="," }
NR==FNR {
lookup[$7] = $0
next
}
{
if (lookup[$7] != "") {
$0 = lookup[$7]
}
print
}
END {
print ""
print "Lookup table used was:"
for (i in lookup) {
print " Key '"i"', Value '"lookup[i]"'"
}
}
The BEGIN section simply sets the field separator to , so individual fields can be easily processed.
The NR and FNR variables are, respectively, the line number of the full input stream (all files) and the line number of the current file in the input stream. When you are processing the first (or only) file, these will be equal, so we use this as a means to simply store the lines from the first file, keyed on field seven.
When NR and FNR are not equal, it's because you've started the second file and this is where we want to replace lines if their key exists in the first file.
This is done by simply checking if a line exists in the lookup table with the desired key and, if it does, replacing the current line the lookup table line. Then we print the (original or replaced) line.
The END section is there just for debugging purposes, it outputs the lookup table that was created and used, and you can remove it once you're satisfied the script works as expected.
You'll see the output in the following transcript, illustrating hopefully that it is working correctly:
pax$ cat file1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
pax$ cat file2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
pax$ awk -f sudarshan.awk file1 file2
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Lookup table used was:
Key '36', Value '3,3,0,0,Test3,1540591243,36'
Key '52', Value '2,1,1,1,Test1,1540584051,52'
Key '54', Value '6,5,1,1,Test2,1540579206,54'
If you need it as a "short as possible" one-liner to use from your script, just use:
awk -F, 'NR==FNR{x[$7]=$0;next}{if(x[$7]!=""){$0=x[$7]};print}' file1 file2
though I prefer the readable version myself.
This might work for you (GNU sed):
sed -r 's|^([^,]*,){6}([^,]*).*|/^([^,]*,){6}\2/s/.*/&/p|' file1 | sed -rnf - file2
Turn file1 into a sed script and using the 7th field as a key lookup replace any line in file2 that matches.
In your example the 7th field is the last one, so a short version of the above solution is:
sed -r 's|.*,(.*)|/.*,\1/s/.*/&/p|' file1 | sed -nf - file2
I have a big data text file (more than 100,000 rows) in this format:
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc
0.00118343;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00276134;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;
0.0607495;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00670611;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
Now, each row contains a gene name, such as in initial 4 rows there is CLCNKA gene. I am using grep command to count the frequency of each gene name in this data file, as:
grep -w "CLCNKA" my_data_file | wc -l
There are about 300 genes in a separate file which are to be searched in above data file. Can some expert please write a simple shell script with a loop to take gene name from a list one by one, and store its frequency in a separate file. So, the output file would be like this:
CLCNKA 4
XDH 2
GRK4 4
You've confused us. I and some others think all you want is a count of each gene in the file since that's what your input/output and some of your descriptive text states (count the frequency of each gene name in this data file) which would just be this:
$ awk -F'[=;]' '{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' file
GRK4 4
CLCNKA 4
XDH 2
while everyone else thinks you want a count of specific genes that exist in a different file since that's what your Subject line, proposed algorithm and the rest of your text states.
If everyone else is right then you'd need this tweak to read the "genes" file first and only count the genes in "file" that were listed in "genes":
awk -F'[=;]' 'NR==FNR{genes[$0]; next} $11 in genes{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' genes file
GRK4 4
CLCNKA 4
XDH 2
Your example doesn't help since it would produce the same output with either interpretation of your requirements so edit your question to clarify what it is you want. In particular if there are genes that you do NOT want counted then include lines containing those in the sample input.
awk is your friend
awk '{sub(/^.*Gene\.refGene=/,"");sub(/;.*$/,"");
genelist[$0]++}END{for(i in genelist){print i,genelist[i]}}' file
Output
GRK4 4
CLCNKA 4
XDH 2
Sidenote: This may not give you the gene name frequency in the order in which they appear in the file. I guess that is not a requirement afterall.
This can also be done in pure bash, by using the associative array feature to count the frequencies:
#!/bin/bash
# declare assoc array
declare -A freq
# split stdin input csv
for gene in $(cut -d ';' -f 6|cut -d = -f 2);do
let freq[$gene]++
done
# loop over array keys
for key in ${!freq[#]}; do
echo ${key} ${freq[$key]}
done
A simpler solution relying on the uniq command:
#!/bin/bash
cut -d ';' -f 6|cut -d = -f 2|sort|uniq -c|while read -a kv;do
echo ${kv[1]} ${kv[0]}
done
Here is one-liner:
sed "s/.*Gene.refGene=//;s/\;.*//" test | sort | uniq -c | awk '{print $2,$1}'
sed - will remove everything from line except gene name
sort will do sorting by name
uniq -c - will count number of gene repeats
awk with swap uniq output (by default it : count pattern)
To preserve order provided input file is sorted as given in sample:
$ perl -lne '
($g) = /Gene\.refGene=([^;]+)/;
if($g ne $p && $. > 1)
{
print "$p\t$c";
$c = 0;
}
$c++; $p = $g;
END { print "$p\t$c" }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
If not, use hash variable to increment gene name used as key and an array to store key order
$ perl -lne '
($k) = /Gene\.refGene=([^;]+)/;
push(#o, $k) if !$h{$k}++;
END { print "$_\t$h{$_}" foreach (#o) }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
if you only search for a list of genes, an inefficient but straightforward way
read g; do echo -n $g " "; grep -c $g file; done < genes
assuming your genes are listed one at a time in the genes file.
If your file structure is fixed, a more efficient version will be
awk 'NR==FNR{genes[$1];next}
{sub(/Gene.refGene=/,"",$6)}
$6 in genes{count[$6]++}
END{for(g in count) print g,count[g]}' genes FS=';' file
I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.
I have 100 text files which look like this:
File title
4
Realization number
variable 2 name
variable 3 name
variable 4 name
1 3452 4538 325.5
The first number on the 7th line (1) is the realization number, which SHOULD relate to the file name. i.e. The first file is called file1.txt and has realization number 1 (as shown above). The second file is called file2.txt and should have realization number 2 on the 7th line. file3.txt should have realization number 3 on the 7th line, and so on...
Unfortunately every file has realization=1, where they should be incremented according to the file name.
I want to extract variables 2, 3 and 4 from the 7th line (3452, 4538 and 325.5) in each of the files and append them to a summary file called summary.txt.
I know how to extract the information from 1 file:
awk 'NR==7,NR==7{print $2, $3, $4}' file1.txt
Which, correctly gives me:
3452 4538 325.5
My first problem is that this command doesn't seem to give the same results when run from a bash script on multiple files.
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR=7,NR==7{print $2, $3, $4}' File$((i)).txt
done
I get multiple lines being printed to the screen when I use the above script.
Secondly, I would like to output those values to the summary file along with the CORRECT preceeding realization number. i.e. I want a file that looks like this:
1 3452 4538 325.5
2 4582 6853 158.2
...
100 4865 3589 15.15
Thanks for any help!
You can simplify some things and get the result you're after:
#!/bin/bash
for ((i=1;i<=100;i++))
do
echo $i $(awk 'NR==7{print $2, $3, $4}' File$i.txt)
done
You really don't want to assign to NR=7 (as you did) and you don't need to repeat the NR==7,NR==7 either. You also really don't need the $((i)) notation when $i is sufficient.
If all the files are exactly 7 lines long, you can do it all in one awk command (instead of 100 of them):
awk 'NR%7==0 { print ++i, $2, $3, $4}' Files*.txt
Notice that you have only one = in your bash script. Does all the files have exactly 7 lines? If you are only interested in the 7th line then:
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR==7{print $2, $3, $4}' File$((i)).txt
done
Since your realization number starts from 1, you can simply add that using nl command.
For example, if your bash script is called s.sh then:
./s.sh | nl > summary.txt
will get you the result with the expected lines in summary.txt
Here's one way using awk:
awk 'FNR==7 { print ++i, $2, $3, $4 > "summary.txt" }' $(ls -v file*)
The -v flag simply sorts the glob by version numbers. If your version of ls doesn't support this flag, try: ls file* | sort -V instead.
i have small file with around 50 lines and 2 fields like below
file1
-----
12345 8373
65236 7376
82738 2872
..
..
..
i have some around 100 files which are comma"," separated as below:
file2
-----
1,3,4,4,12345,,,23,3,,,2,8373,1,1
each file has many lines similar to the above line.
i want to extract from all these 100 files whose
5th field is eqaul to 1st field in the first file and
13th field is equal to 2nd field in the first file
I want to search all the 100 files using that single file?
i came up with the below in case of a single comma separated file.i am not even sure whether this is correct!
but i have multiple comma separated files.
awk -F"\t|," 'FNR==NR{a[$1$2]++;next}($5$13 in a)' file1 file2
can anyone help me pls?
EDIT:
the above command is working fine in case of a single file.
Here is another using an array, avoiding multiple work files:
#!/bin/awk -f
FILENAME == "file1" {
keys[$1] = ""
keys[$2] = ""
next
}
{
split($0, fields, "," )
if (fields[5] in keys && fields[13] in keys) print "*:",$0
}
I am using split because the field seperator in the two files are different. You could swap it around if necessary. You should call the script thus:
runit.awk file1 file2
An alternative is to open the first file explicitly (using "open") and reading it (readline) in a BEGIN block.
Here is a simple approach. Extract each line from the small file, split it into fields and then use awk to print lines from the other files which match those fields:
while read line
do
f1=$(echo $line | awk '{print $1}')
f2=$(echo $line | awk '{print $2}')
awk -v f1="$f1" -v f2="$f2" -F, '$5==f1 && $13==f2' file*
done < small_file