Change file information based on another file - shell

I have two simple text files:
The first, the reference file, looks like this -the first letter of every row is the important one.
G A
C A
G A
The second one looks like this:
G G G G
A A A A
A A A G
The second file is the one I want to change based on the information of the first.
For example, if the first two columns contain the letter G, that is G G, because this letter was the first letter on my reference file, I want to convert the two columns to a single column with the number 2 (indicating there were two Gs). The third and fourth columns, also have two G, so I want to convert these two columns also to a single columns with the number 2.
In the last row of the second file, the first two columns have the letters A and A, but because the first letter of the last row of my reference file was a G I want to convert these two columns to the number 0 (indicating there were zero G - the first letter of the reference file is the one I am counting). The third and fourth columns have an A and a G, because there is one G, I want two convert this two columns to a single column with the number 1.
The converted file should look like this:
2 2
0 0
0 1
Any help would be appreciated. Handling two files at a time and doing such conversions is not within my programming skills.
NOTE: My real file contain the letters A,C,G and T

Assuming that the first file is called ref and the second file called data:
$ awk 'NR==FNR{a[FNR]=$1; next} {print (a[FNR]==$1)+(a[FNR]==$2), (a[FNR]==$3)+(a[FNR]==$4)}' ref data
2 2
0 0
0 1
Explanation:
NR==FNR{a[FNR]=$1; next}
NR is the number of lines that have been read in so far and FNR is the number of lines that have been read in so far from the current file. So, when NR==FNR, we know that awk is still processing the first file. In that case, we save the first letter on the line in the array a. The next statement tells awk to skip the rest of the commands and go on to the next line.
print (a[FNR]==$1)+(a[FNR]==$2), (a[FNR]==$3)+(a[FNR]==$4)
Because of the next command above, this command is only executed if we are processing the second file. If so, we print out how many letters in the first two columns match the first letter on the corresponding row in the ref file, and then do the same for the third and fourth columns.
Handling missing data
Suppose that missing data is indicated by 0 0. As an example, take this data file:
$ cat data2
G G G G
0 0 C A
A G 0 0
The following awk script has been extended to show "?" where the data is missing:
$ awk 'NR==FNR{a[FNR]=$1; next} {print ($1==0)?"?":(a[FNR]==$1)+(a[FNR]==$2), ($3==0)?"?":(a[FNR]==$3)+(a[FNR]==$4)}' ref data2
2 2
? 1
1 ?
(The same ref file was used as before.)
Handling an arbitrary number of columns
awk 'NR==FNR{a[FNR]=$1; next} {s="";for (i=1;i<NF;i=i+2) {s=s OFS (($i==0)?"?":((a[FNR]==$i)+(a[FNR]==$(i+1))))}; print s}' ref3 data3

Related

split one file into multiple files according to columns using bash cut or awk

I have a 2TB text table file separated by tab, and one column further separated by ";". Yeah, it is in fact a very large vcf file.
Using Tab delimiter, we have 8 columns, and using ";" delimiter, we can split the 8th column into another 12 columns.
For easier statistical analysis, I need to split the files into the 19 files, each file contains one column. And preferably I can just go through the files once (since the file is large and I have 100 of those large files, IO cost is really high) and get the 19 columns write into 19 separate files.
I have achieve the problme in an efficient way, basically
cut 1-2 file.txt > column12.txt
but to get this 19 columns,I need to go through the file for 19 times and it is not efficient.
I am wondering if there is an efficient way to go through file once and get it write to 19 files?
Thanks very much indeed for your help.
The file example is like below
a b c d e f g;h;i;j;k
m n o p q l x;y;z;o;p
a b c d e f g;h;i;j;k
a b c d e f g;h;i;j;k
then I want files contains
a
m
a
a
With awk:
awk -F '[\t;]' '{for(i=1; i<=NF; i++) print $i >> "column" i ".txt"}' file
Use tab and semicolon as field separator. NF contains the number of last column in the current row. $i contains content of current column and i number of current column.
This creates 11 files. column11.txt contains:
k
p
k
k

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

Bash/Awk: Reformat uneven columns with multiple deliminators

I have a CSV where I need to reformat a single column's contents.
The problem is that each cell has completely different lengths to reformat.
Current column looks like (these are two lines of single column) :
Foo*foo*foo*1970,1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970,1975,1980
Result should look like (still two lines one column)
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
this is what I'm trying to do
#!/bin/bash
cat foocol | \
awk -F'+' \
'{for i in NF print $i}' \
| awk -F'*' \
'{$Foo=$1"*"$2"*"$3"*" print $4}' \
\
| awk -v Foo=$Foo -F',' \
'{for j in NF do \
print Foo""$j"+" }' \
> newcol
The idea is to iterate over the multiple '+' delimited data, while the first three '*' delimited values are to be grouped for every ',' delimited year, with a '+' between them
But I'm just getting syntax errors everywhere.
Thanks
$ awk --re-interval -F, -v OFS=+ '{match($1,/([^*]*\*){3}/);
prefix=substr($0,RSTART,RLENGTH);
for(i=2;i<=NF;i++) $i=prefix $i }1' file
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
perhaps add validation with if(match(...
Solution in TXR:
$ txr reformat.txr data
Foo*foo*foo*1970+Foo*foo*foo*1980+Bar*bar*bar*1970
Foobar*Foobar*foobarbar*1970+Foobar*Foobar*foobarbar*1975+Foobar*Foobar*foobarbar*1980
Code in reformat.txr:
#(repeat)
# (coll)#/\+?/#a*#b*#c*#(coll)#{x /[^,+]+/}#(until)+#(end)#(end)
# (output :into items)
# (repeat)
# (repeat)
#a*#b*#c*#x
# (end)
# (end)
# (end)
# (output)
# {items "+"}
# (end)
#(end)
This solution is based on regarding the data to have nested syntax: groups of records are delimited by newlines. Records within groups are separated by + and within records there are four fields separated by *. The last field contains comma-separated items. The data is to be normalized by expanding copies of the records such that the comma-separated items are distributed across the copies.
The outer #(repeat) handles walking over the lines. The outer #(coll) iterates over records, collecting the first three fields into variables a, b and c. Then an inner #(coll) gets each comma separated item into the variable x. The inner #(coll) collects the x-s into a list, and the outer #(coll) also collects all the variables into lists, so a, b, c become lists of strings, and x is a list of lists of strings.
The :into items keyword parameter in the output causes the lines which would normally go the standard output device to be collected into a list of strings, and bound to a variable. For instance:
#(output :into lines)
a
b
cd
#(end)
establishes a variable lines which contains the list ("a" "b" "cd").
So here we are getting the output of the doubly-nested repeat as a bunch of lines, where each line represents a record, stored in a variable called items. Then we output these using the #{items "+"}, a syntax which outputs the contents of a list variable with the given separator.
The doubly nested repeat handles the expansion of records over each comma separated item from the fourth field. The outer repeat implicitly iterates over the lists a, b, c and x. Inside the repeat, these variables denote the items of their respective lists. Variable x is a list of lists, and so the inner repeat iterates over that. Inside the outer repeat, variables a, b, c are already scalar, and stay that way in the scope of the inner repeat: only x varies, which is exactly what we want.
In the data collection across each line, there are some subtleties:
# (coll)#/\+?/#a*#b*#c*#(coll)#{x /[^,+]+/}#(until)+#(end)#(end)
Firstly, we match an optional leading plus with the /\+?/ regex, thereby consuming it. Without this, the a field of every record, except for the first one, would include that separating + and we would get double +-s in the final output. The a, b, c variables are matched simply. TXR is non-greedy with regard to the separating material: #a* means match some characters up to the nearest * and bind them to a variable a. Collecting the x list is more tricky. Here was use a positive-regex-match variable: #{x /[^,+]+/} to extract the sub-field. Each x is a sequence of one or more characters which are not pluses or commas, extracted positively without regard for whatever follows, much like a tokenizer extracts a token. This inner collect terminates when it encounters a +, which is what the #(until)+ clause ensures. It will also implicitly terminate if it hits the end of the line; the #(until) match isn't mandatory (by default). That terminating + stays in the input stream, which is why we have to recognize it and discard it in front of the #a.
It should be noted that #(coll), by default, scans for matches and skips regions of text that do not match, just like its cousin #(collect) does with lines. For instance if we have #(coll)#{foo /[a-z]+/}#(end), which collects sequences of lower-case letters into foo, turning foo into a list of such strings, and if the input is 1234abcd-efgh.... ijk, then foo ends up with the list ("abcd" "efgh" "ijk"). This is why there is no explicit logic in the inner #(coll) to consume the separating commas: they are implicitly skipped.

Get common lines, for only specific fields, from multiple files

I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:
n = split(rec[R], t, "/")
to
n = split(rec[R$1], t, "/")
to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.
Specifically, for the files mentioned in the example in the link:
if file 1 is:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
and file 2 is:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
I would like to pull out:
file1/file2 --> chr1 33251518 AK2 H191D
I found this code at the following link:
http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.
Thank you very much for any insight on this!
The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0]), and the values are filename1/filename3/... for those filenames in which the given line $0 is present. You can hack it up to just work with $1,$2 and $4 like so:
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1) {
split(R,R1R2R4,SUBSEP)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
}
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the partial content of the current record
# (special concatenation of $1, $2 and $4)
# concatenating the filenames separated by / as values
rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
}' file[a-d]
this solution makes use of multidimensional arrays: we create rec[$1,$2,$4] instead of rec[$0]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=.... Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END block.
The first part of the code also has to be changed: now for (R in rec) loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4. This is good while indexing, but you need to split R at the SUBSEP characters to obtain again the printable fields $1, $2, $4. These are put into the array R1R2R4, which can be used to print the necessary output: instead of %s,...,R we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. In effect we're doing sprintf ...%s,...,$1,$2,$4; with pre-saved fields $1, $2, $4. For your input example this will print
records found in 2 files:
foo11.inp1/foo11.inp2 --> chr1 33251518 AK2
Note that the output is missing H191D but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).
A bit of explanation for the original code:
rec is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1 contains a line "foo bar", then rec["foo bar"]=="file1" initially. If then file2 also contains this line, then rec["foo bar"]=="file1/file2". Note that there are no checks for multiplicity, so if file1 contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2 and obtain 3 for the number of files containing this line.
R goes over the indices of the array rec after it has been fully built. This means that R will eventually assume each unique line of every input file, allowing us to loop over rec[R], containing the filenames in which that specific line R was present.
n is a return value from split, which splits the value of rec[R] --- that is the filename list corresponding to line R --- at each slash. Eventually the array t is filled with the list of files, but we don't make use of this, we only use the length of the array t, i.e. the number of files in which line R is present (this is saved in the variable n). If n==1, we don't do anything, only if there are multiplicities.
the loop over n creates classes according to the multiplicity of a given line. n==2 applies to lines that are present in exactly 2 files. n==3 to those which appear thrice, and so on. What this loop does is that it builds an array dup, which for every multiplicity class (i.e. for every n) creates the output string "filename1/filename2/... --> R", with each of these strings separated by RS (the record separator) for each value of R that appears n times total in the files. So eventually dup[n] for a given n will contain a given number of strings in the form of "filename1/filename2/... --> R", concatenated with the RS character (by default a newline).
The loop over D in dup will then go through multiplicity classes (i.e. valid values of n larger than 1), and print the gathered output lines which are in dup[D] for each D. Since we only defined dup[n] for n>1, D starts from 2 if there are multiplicities (or, if there aren't any, then dup is empty, and the loop over D will not do anything).
first you'll need to understand the 3 blocks in an AWK script:
BEGIN{
# A code that is executed once before the data processing start
}
{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}
END{
# A code that is executed once after all data processing finished
}
so you'll probably need to edit this part of the script:
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}

How to check number of common values betweeon two columns from different files

Suppose I have two files
$A
a b
1 5
2 6
3 7
4 8
$B
a b
1 5
2 6
5 6
My question is, in Shell or Terminal, How to calculate the total number of values of B's first column (1,2,5) in the A's first column(1,2,3,4)? (here the answer is 2 (1,2).
The following awk solution counts column1 entries of file2 in file1:
awk 'FNR==1{next}NR==FNR{a[$1]=$b;next}$1 in a{count++}END{print count}' file1 file2
2
Skip the first line from both files using FNR==1{next}. You can remove this if you don't have header fields (a b) in your actual data files.
Read the entire first file into an array using NR==FNR{a[$1]=$b;next}. I am assigning column2 here if you wish to scale the solution to match both columns. You can also do a[$1]++ if you are not interested in column2 at all. Wont hurt either ways.
If the value of column1 from second file is in our array, increment a count variable
In the END block print the count variable.

Resources