awk to combine all lines in file with another - bash

The below awk combines the target.txt with the out_parse.txt and the output is GJ-53.txt. If there are multiple lines in out_parse how can they both be written to GJ-53.txt? As of now the first line of out_parse saves to a text file GJ-53, but the second line does not. Than you :).
awk '{close(fname)} (getline fname<f)>0 {print>fname}' f=target.txt out_parse.txt
Contents of out_parse.txt
13 20763612 20763612 C T
13 20763620 20763620 A G
Contents of target.txt
GJ-53.txt
cat -v out_parse.txt
13 20763612 20763612 C T
13 20763620 20763620 A G

If I understand correctly, you want to copy the contents of out_parse.txt to a new file, whose name is given in the file target.txt. To do that, you don't really need to use awk at all:
cp out_parse.txt "$(< target.txt)"
In bash, $(< file) can be use as a substitution for the contents of file. It achieves the same thing as $(cat file).
If you wanted to use awk, you could do something like this:
awk 'NR==FNR{f=$0;next}{print>f}' target.txt out_parse.txt
The first block applies to the first file, where the total record number NR is equal to the current file's record number FNR. It saves the content of the line (i.e. the filename) to f and skips any further instructions. The second block applies only to the second file and prints every line to the filename saved in f.

Related

Copying first lines of multiple text files into single file

Using single bash command (pipes, stdio allowed)
copy first line of each file whose name begins with ABC to file named DEF.
Example:
Input:
ABC0:
qwe\n
rty\n
uio\n
ABC1:
asd\n
fgh\n
jkl\n
ABC2:
zxc\n
bvn\n
m,.\n
Result:
DEF:
qwe\n
asd\n
zxc\n
Already tried cat ABC* | head -n1 but it takes only first line from first file, others are omitted.
You would want head -n1 ABC* to let head take the first line from each file. Reading from standard input, head know nothing about where its input comes from.
head, though, adds its own header to identify which file each line comes from, so use awk instead:
awk 'FNR == 1 {print}' ./ABC* > DEF
FNR is the variable containing the line number of the current line of the input, reset to 0 each time a new file is opened. Using ./ABC* instead of ABC* guards against filenames containing an = (which awk handles specially if the part before = is a valid awk variable name. HT William Pursell.)
Assuming that the file names don't contain spaces or newlines, and that there are no directories with names starting with ABC:
ls ABC* | xargs -n 1 head -n 1
The -n 1 ensures that head receives only one name at a time.
If the aforementioned conditions are not met, use a loop like chepner suggested, but explicitly guard against directory entries which are not plain files, to avoid error messages issued by head.

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

File comparison

I am a beginner. I am looking for a basic shell script solving what looks a simple problem:
I have one long file, file A that looks like below:
I would like to generate a new file (Target file C ) that is essentially file A, but with an extra field on the first line, say "Comment" where all lines whose items of the first field that match any of the items in column 1 of file B are identified by a mark, say "SHARED". Files A and B are csv files
I have tried awk and a basic shell script that is easier for me to understand, but I could not get it to work. I could generate a blank target file, with the target
first line containing the 3 fields if necessary.
File A
"Part Number","Description"
"1468896-1","MCD-MXSER-21-P-X-0209"
"1495581-1","MC-P-15S5127854ST1"
"1497458-3","MC -N1-P-569RT1"
File B
"1466826-1"
"1495582-1"
"1495581-1"
Desired target file C
"Part Number","Description","Comment"
"1468896-1","MCD-MXSER-21-P-X-0209"
"1495581-1","MC-P-15S5127854ST1","SHARED"
"1497458-3","MC -N1-P-569RT1"
this one-liner should do the job:
awk -F, -v c='"Comment"' -v s='"SHARED"'
'NR==FNR{a[$1]=1;next}FNR==1{$0=$0 FS c}FNR>1&&a[$1]{$0=$0 FS s}7' fileb filea
If you want to do it in bash
#!/bin/bash
while IFS=, read f1 line
do
if grep -qw "$f1" fileB ; then
echo $f1,$line,\"SHARED\"
fi
echo $f1,$line
done < fileA
You can do it like this:
awk -F, 'FNR==NR{a[i++]=$1;next} {extra="";for(t in a)if($1==a[t])extra=",\"SHARED\"";print $0,extra}' fileB fileA
You will see both fileA and fileB are passed into awk. The processing in {} following FNR==NR only applies to fileB. It stores the first element of each line in an array a[] and then skips to the next line.
The processing in the second set of {} only applies to fileA. Basically it pre-sets a string called extra to nothing. It then tests if the first field of the current record is in array a[]. If it is, it sets extra to ",SHARED". It then prints the current record and the string extra which may, or may not, be ",SHARED".

Shell: searching for a pattern in a file and extracting a data block which contains the pattern

Given is a file with the following structure:
a 123
b 876234
c 56456
d 65765
a 9879
b 9361
c 6527
d 823468
So there are blocks of data (lines beginning with a,b,c,d and an empty line divides two blocks) and I'm looking for a pattern (e.g. 9361) inside this file. If this pattern is found inside I want to copy the whole block (line starting with a 9879 and ending with d 823468) and write it to another file.
This pattern can be found zero or more times inside a file. If there are more than one results each block should be written to the other file.
How would you do it?
You can do it with gawk.
gawk 'BEGIN {RS=""} /here goes your pattern/ { print $0}' INPUTFILE > OUTPUTFILE
This sets gawk's record separator to the empty line(s).
HTH
Assuming your data is in file.txt, this is a solution in perl.
perl -00 -ne "print if /9361/" file.txt
The result is on stdout.
-00 causes perl to read the file a paragraph at a time.
-n causes perl to read filename arguments one by one.
-e is to specify a perl command.
Ruby(1.8+)
ruby -00 -ne 'print if /9361/' file

Resources