How to build a tab-delimited text file with many calculated values? - bash

Background
From a set of similar reference files (sample1.txt, sample2.txt, etc.), I am calculating or retrieving data for many different parameters (pc.genes, pc.transcripts, pc.genes.antisense, etc.).
A simplified example of a single ref.file (e.g., sample1.txt):
word1 word2 word3 405438 409170 . Y . word4; word5
word1 word2 word3 405438 409170 . N . word4; word5
word1 word2 word3 409006 409170 . N . word4; word5
word1 word2 word3 405438 408401 . Y . word4; word5
word1 word2 word3 407099 408361 . N 0 word4; word5
A calculation for “avg.exons” parameter might look like:
$ awk '$3 == "word3"' sample1.txt | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}'
5.96732
A retrieval for “pc.genes” parameter might look like:
$ awk '$3 == "word3"' sample1.txt | grep -c "word4"
19062
These are just examples in case the solution requires the commands to be piped to a function that transfers/adds them to a table. The output value of these commands is always a single number.
Desired Output
I would like to put these calculated/retrieved values into an organized table format (preferably a tab-delimited text file) so that I can generate plots from the data:
ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength
sample1.txt 19062 116573 2585576 1318321 5.96732 3732.57
sample2.txt 19753 138563 5834759 1433785 5.84654 4023.89
sample3.txt 19376 124576 2871235 1983263 6.78929 3890.32
Is this possible? And if so, how can I achieve this?
Attempt
for file in sample*.txt
do
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
pc.genes=$(awk '$3 == "word3"' ${file} | grep -c "word4")
avg.exons=$(awk '$3 == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
... # get rest of desired values
done > table.txt
Resulting Errors
-bash: pc.genes=19062: command not found
... # other errors with corresponding CORRECT value outputs
-bash: avg.exons=5.96732: command not found
... # the errors even continue into the other sample*.txt files, which is good
-bash: pc.genes=19753: command not found
...
All of the values corresponding to a given parameter (i.e., "=###") are correct, but the error is preventing them from being put in the table.

Based solely on the details provided by the OP, and assuming a looping construct is being used to process a single file at a time, something like:
# print header
printf "ref.file\tpc.genes\tpc.transcripts\tpc.genes.antisense\tpc.genes.sense\tavg.exons\tavg.genelength\n"
while read -r fn
do
aexons=$(awk '$3 == "word1"' ${fn} | sed -n 's/.*word2 \([^;]*\).*word3 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
pgenes=$(awk '$3 == "word1"' ${fn} | grep -c "word2")
... # get rest of desired values
# print tab-delimited output to stdout; adjust formats as needed
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${fn} ${pgenes} .... ${axeons} ...
done < <('ls' sample*.txt) # replace with whatever logic OP is using to find desired files
While the above should work, it's not very efficient what with all of the subprocess calls ($(...); piped commands), and the need to process each input file (${fn}) 6x times (for 6x values).
A more efficient method would look at processing each input file (${fn}) just once.
An additional step might be to eliminate the loop in favor of a single program to process all files in one pass.
Since awk is capable of parsing data (from multiple files), calculating sums/averages, and generating (tab-delimited) output, I'd probably be leaning towards a single awk command/invocation as a more efficient solution ... but can't tell for sure without sample data and more details on the desired calculations.

The following answer works perfectly and comes from the combined suggestions of markp-fuso and KamilCuk. Thank you both!
# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk '$3 == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk '$3 == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)
# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt

Related

How can one dynamically create a new csv from selected columns of another csv file?

I dynamically iterate through a csv file and select columns that fit the criteria I need. My CSV is separated by commas.
I save these indexes to an array that looks like
echo "${cols_needed[#]}"
1 3 4 7 8
I then need to write these columns to a new file and I've tried the following cut and awk commands, however, as the array is dynamically created, I cant seem to find the right commands that can select them all at once. I have tried cut, awk and paste commands.
awk -v fields=${cols_needed[#]} 'BEGIN{ n = split(fields,f) }
{ for (i=1; i<=n; ++i) printf "%s%s", $f[i], (i<n?OFS:ORS) }' test.csv
This throws an error as it cannot split the fields unless I hard code them (even then, it can only do 2), split on spaces.
fields="1 2’
I have tried to dynamically create -f parameters, but can only do so with one variable in a loop like so
for item in "${cols_needed[#]}";
do
cat test.csv | cut -f$item
done
which outputs one column at a time.
And I have tried to dynamically create it with commas - input as 1,3,4,7...
cat test.csv | cut -f${cols_needed[#]};
which also does not work!
Any help is appreciated! I understand awk does not work like bash and we cannot pass variables around in the same way. I feel like I'm going around in circles a bit! Thanks in advance.
Your first approach is ok, just:
change -v fields=${cols_needed[#]} to -v fields="${cols_needed[*]}", to pass the array as a single shell word
add FS=OFS="," to BEGIN, after splitting (you want to split on spaces, before FS is changed to ,)
ie. BEGIN {n = split(fields, f); FS=OFS=","}
Also, if there are no commas embedded in quoted csv fields, you can use cut:
IFS=,; cut -d, -f "${cols_needed[*]}" test.csv
If there are embedded commas, you can use gawk's FPAT, to only split fields on unquoted commas.
Here's an example using that.
# prepend $ to each number
for i in "${cols_needed[#]}"; do
fields[j++]="\$$i"
done
IFS=,
gawk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, "{print ${fields[*]}}"
Injecting shell code in to an awk command is generally not great practice, but it's ok here IMO.
Expanding on my comments re: passing the bash array into awk:
Passing the array in as an awk variable:
$ cols_needed=(1 3 4 7 8)
$ typeset -p cols_needed
declare -a cols_needed=([0]="1" [1]="3" [2]="4" [3]="7" [4]="8")
$ awk -v fields="${cols_needed[*]}" 'BEGIN{n=split(fields,f); for (i=1;i<=n;i++) print i,f[i]}'
1 1
2 3
3 4
4 7
5 8
Passing the array in as a 'file' via process substitution:
$ awk 'FNR==NR{f[++n]=$1;next} END {for (i=1;i<=n;i++) print i,f[i]}' <(printf "%s\n" "${cols_needed[#]}")
1 1
2 3
3 4
4 7
5 8
As for OP's main question of extracting a specific set of columns from a .csv file ...
Borrowing dawg's .csv file:
$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28
Expanding on the suggestion for passing the bash array in as an awk variable:
awk -v fields="${cols_needed[*]}" '
BEGIN { FS=OFS=","
n=split(fields,f," ")
}
{ pfx=""
for (i=1;i<=n;i++) {
printf "%s%s", pfx, $(f[i])
pfx=OFS
}
print ""
}
' file.csv
NOTE: this assumes OP has provided a valid list of column numbers; if there's some doubt as to the validity of the input (column) numbers then OP can add some logic to address said doubts (eg, are they integers? are they positive integers? do they reference a field (in file.csv) that actually exists?, etc)
This generates:
1,3,4,7,8
11,13,14,17,18
21,23,24,27,28
Suppose you have this variable in bash:
$ echo "${cols_needed[#]}"
3 4 7 8
And this CSV file:
$ cat file.csv
1,2,3,4,5,6,7,8
11,12,13,14,15,16,17,18
21,22,23,24,25,26,27,28
You can select columns of that csv file in awk this way:
awk '
BEGIN{FS=OFS=","}
FNR==NR{split($0, cols," "); next}
{
s=""
for (e=1;e<=length(cols); e++)
s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
print s
}' <(echo "${cols_needed[#]}") file.csv
Prints:
3,4,7,8
13,14,17,18
23,24,27,28
Or, you can do:
awk -v cw="${cols_needed[*]}" '
BEGIN{FS=OFS=","; split(cw, cols," ")}
{
s=""
for (e=1;e<=length(cols); e++)
s=e<length(cols) ? s $(cols[e]) OFS : s $(cols[e])
print s
}' file.csv
# same output
BTW, you can do this entirely with cut:
cut -d ',' -f $(IFS=, ; echo "${cols_needed[*]}") file.csv
3,4,7,8
13,14,17,18
23,24,27,28

Bash: Working with CSV file to build a loop and save the result

Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
address#somewhere.com,21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
something#that.com,17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1
For example, if we put address#somewhere.com in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with another#here.com if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
#!/bin/bash
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<identifier>SomeTitleNameGoesHere</identifier>
<mediatype>audio</mediatype>
<collection>opensource_movies</collection>
<description>example <br /></description>
<subject>testing</subject>
<title>Some Title Name Goes Here</title>
<uploader>another#here.com</uploader>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator]email#address.com[/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
close(cmd);
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
next
}
FNR > 1 {$4 = length(tally[$1])}
1
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
next
}
# this block will execute for eaxh XML file
{
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
}
END {
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
}
}
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.
#! /bin/bash
file_old=accounts.csv
file_new=${file_old/csv/new.csv}
delimiter=","
x=1
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
tally=$cnt
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
fi
((x++))
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
2 address#somewhere.com
1 something#that.com
$ cat accounts.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,-1,blah
something#that.com,bar2,foo3,-1,blah
another#here.com,bar4,foo5,-1,blah
# output
$ ./test.sh
$ cat accounts.new.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,2,blah
something#that.com,bar2,foo3,1,blah
another#here.com,bar4,foo5,0,blah

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Explode to Array

I put together this shell script to do two things:
Change the delimiters in a data file ('::' to ',' in this case)
Select the columns and I want and append them to a new file
It works but I want a better way to do this. I specifically want to find an alternative method for exploding each line into an array. Using command line arguments doesn't seem like the way to go. ANY COMMENTS ARE WELCOME.
# Takes :: separated file as 1st parameters
SOURCE=$1
# create csv target file
TARGET=${SOURCE/dat/csv}
touch $TARGET
echo #userId,itemId > $TARGET
IFS=","
while read LINE
do
# Replaces all matches of :: with a ,
CSV_LINE=${LINE//::/,}
set -- $CSV_LINE
echo "$1,$2" >> $TARGET
done < $SOURCE
Instead of set, you can use an array:
arr=($CSV_LINE)
echo "${arr[0]},${arr[1]}"
The following would print columns 1 and 2 from infile.dat. Replace with
a comma-separated list of the numbered columns you do want.
awk 'BEGIN { IFS='::'; OFS=","; } { print $1, $2 }' infile.dat > infile.csv
Perl probably has a 1 liner to do it.
Awk can probably do it easily too.
My first reaction is a combination of awk and sed:
Sed to convert the delimiters
Awk to process specific columns
cat inputfile | sed -e 's/::/,/g' | awk -F, '{print $1, $2}'
# Or to avoid a UUOC award (and prolong the life of your keyboard by 3 characters
sed -e 's/::/,/g' inputfile | awk -F, '{print $1, $2}'
awk is indeed the right tool for the job here, it's a simple one-liner.
$ cat test.in
a::b::c
d::e::f
g::h::i
$ awk -F:: -v OFS=, '{$1=$1;print;print $2,$3 >> "altfile"}' test.in
a,b,c
d,e,f
g,h,i
$ cat altfile
b,c
e,f
h,i
$

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources