Column to row translation using sed - ruby

I'm writing a bash script to get data from procmail and produce an output CSV file.
At the end, I need to translate from this:
---label1: 123456
---label2: 654321
---label3: 246810
---label4: 135791
---label5: 101010
to this:
label1,label2,label3,label4,label5
123456,654321,246810,135791,101010
I could do this easily with a ruby script but I don't want to call another script other than the bash script. So I've though of doing it with sed.
I can extract the data I want like this:
sed -nr 's/^---(\S+): (\S+)$/\1,\2/p'
But I don't know how to transpose it. Can you help me?

If you're running sed I'd argue that you are calling another script. So why not write it in Ruby if that's easier to write and maintain?
If you're worried about having multiple files could embed the Ruby code in the bash script as a here document (assuming Ruby can read a script from the Standard Input).

you can just do everything in awk
awk 'BEGIN{
FS=": "
}
{
gsub("---","")
label[++c] = $1
num[++d] =$2
}
END{
for(i=1;i<c;i++){
printf label[i]","
}
print label[c]
for(i=1;i<d;i++){
printf num[i]","
}
print num[d]
}' file
output
# ./test.sh
label1,label2,label3,label4,label5
123456,654321,246810,135791,101010
redirect the output to csv file as needed

Perhaps this is what you are looking for?
(from unix.com)
num=$(awk -F"," 'NR==1 { print NF }' data)
print $num
i=1
while (( $i > tmpdata
(( i = i + 1 ))
done
mv tmpdata data

Finally I resolved it using ruby, as suggested buy Dave Webb but as a one-liner not a here document with the following script:
ruby -ne 'BEGIN{#head=[];#data=[]}; #head << $1 && #data << $2 if $_.match(/^---(\S+): (\S+)$/); END{puts #head.join(",");puts #data.join(",")}' $FILE
I didn't know that I could use BEGIN, END blocks to set the variables and output the results.

you said you needed bash script..
#!/bin/bash
labels=`awk '/---/{printf("%s,",$1)}' file.txt `
values=`awk '/---/{printf("%s,",$2)}' file.txt `
labels=`echo $labels|sed 's/---//g;s/\://g;s/\,$//'`
values=`echo $values|sed 's/\,$//`
echo $labels
echo $values

You can do something like this from within the shell script:
cat demo | awk -F':' '{print $1}' | sed -e s/'---'// | tr '\n' ',' > csv_file
cat demo | awk '{print $2}' | tr '\n' ',' >> csv_file

Related

How to write a bash script that dumps itself out to stdout (for use as a help file)?

Sometimes I want a bash script that's mostly a help file. There are probably better ways to do things, but sometimes I want to just have a file called "awk_help" that I run, and it dumps my awk notes to the terminal.
How can I do this easily?
Another idea, use #!/bin/cat -- this will literally answer the title of your question since the shebang line will be displayed as well.
Turns out it can be done as pretty much a one liner, thanks to #CharlesDuffy for the suggestions!
Just put the following at the top of the file, and you're done
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
So for my awk_help example, it'd be:
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
# Basic form of all awk commands
awk search pattern { program actions }
# advanced awk
awk 'BEGIN {init} search1 {actions} search2 {actions} END { final actions }' file
# awk boolean example for matching "(me OR you) OR (john AND ! doe)"
awk '( /me|you/ ) || (/john/ && ! /doe/ )' /path/to/file
# awk - print # of lines in file
awk 'END {print NR,"coins"}' coins.txt
# Sum up gold ounces in column 2, and find out value at $425/ounce
awk '/gold/ {ounces += $2} END {print "value = $" 425*ounces}' coins.txt
# Print the last column of each line in a file, using a comma (instead of space) as a field separator:
awk -F ',' '{print $NF}' filename
# Sum the values in the first column and pretty-print the values and then the total:
awk '{s+=$1; print $1} END {print "--------"; print s}' filename
# functions available
length($0) > 72, toupper,tolower
# count the # of times the word PASSED shows up in the file /tmp/out
cat /tmp/out | awk 'BEGIN {X=0} /PASSED/{X+=1; print $1 X}'
# awk regex operators
https://www.gnu.org/software/gawk/manual/html_node/Regexp-Operators.html
I found another solution that works on Mac/Linux and works exactly as one would hope.
Just use the following as your "shebang" line, and it'll output everything from line 2 on down:
test.sh
#!/usr/bin/tail -n+2
hi there
how are you
Running this gives you what you'd expect:
$ ./test.sh
hi there
how are you
and another possible solution - just use less, and that way your file will open in searchable gui
#!/usr/bin/less
and this way you can grep if for something too, e.g.
$ ./test.sh | grep something

for loop and if statements in awk

I am a biologist that is starting to have to learn some elementary scripting skills to deal with large DNA sequence data sets. So please go easy on me. I am doing this all in bash. I have a file with my data formatted like this:
CLocus_58919_Sample_25_Locus_33235_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
What I need is to do is loop through this file and write all the sequences from the same sample into their own file. Just to be clear, these sequences come from samples 25 and 9. So my idea was to use awk to reformat my file in the following way:
CLocus_58919_Sample_25_Locus_33235_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
then pipe this into another awk if statement to say "if sample=$i then write out that entire line to a file named sample.$i" Here is my code so far:
#!/bin/bash
a=`ls /scratch/tkchafin/data/raw | wc -l`;
b=1;
c=$((a-b));
mkdir /scratch/tkchafin/data/phylogenetics
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ '{if($4==$i) print}' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;
I understand this is not working because $i is in single quotes so bash is not recognizing it. I know awk has a -v option for passing external variables to it, but I don't know how I would apply that in this case. I tried to move the for loop inside the awk statement but this does not produce the desired result either. Any help would be much appreciated.
You can have awk write directly to the desired output file, without a shell loop:
awk -F_ '(NR % 2) == 1 { line1 = $0; fn="/scratch/tkchafin/data/phylogenetics/sample."$4; }
(NR % 2) == 0 { print line1"_"$0 > fn; }' "$1"
But to show how you would use -v in your version, it would be:
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ -v i=$i '$4 == i' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

bash awk first 1st column and 3rd column with everything after

I am working on the following bash script:
# contents of dbfake file
1 100% file 1
2 99% file name 2
3 100% file name 3
#!/bin/bash
# cat out data
cat dbfake |
# select lines containing 100%
grep 100% |
# print the first and third columns
awk '{print $1, $3}' |
# echo out id and file name and log
xargs -rI % sh -c '{ echo %; echo "%" >> "fake.log"; }'
exit 0
This script works ok, but how do I print everything in column $3 and then all columns after?
You can use cut instead of awk in this case:
cut -f1,3- -d ' '
awk '{ $2 = ""; print }' # remove col 2
If you don't mind a little whitespace:
awk '{ $2="" }1'
But UUOC and grep:
< dbfake awk '/100%/ { $2="" }1' | ...
If you'd like to trim that whitespace:
< dbfake awk '/100%/ { $2=""; sub(FS "+", FS) }1' | ...
For fun, here's another way using GNU sed:
< dbfake sed -r '/100%/s/^(\S+)\s+\S+(.*)/\1\2/' | ...
All you need is:
awk 'sub(/.*100% /,"")' dbfake | tee "fake.log"
Others responded in various ways, but I want to point that using xargs to multiplex output is rather bad idea.
Instead, why don't you:
awk '$2=="100%" { sub("100%[[:space:]]*",""); print; print >>"fake.log"}' dbfake
That's all. You don't need grep, you don't need multiple pipes, and definitely you don't need to fork shell for every line you're outputting.
You could do awk ...; print}' | tee fake.log, but there is not much point in forking tee, if awk can handle it as well.

Explode to Array

I put together this shell script to do two things:
Change the delimiters in a data file ('::' to ',' in this case)
Select the columns and I want and append them to a new file
It works but I want a better way to do this. I specifically want to find an alternative method for exploding each line into an array. Using command line arguments doesn't seem like the way to go. ANY COMMENTS ARE WELCOME.
# Takes :: separated file as 1st parameters
SOURCE=$1
# create csv target file
TARGET=${SOURCE/dat/csv}
touch $TARGET
echo #userId,itemId > $TARGET
IFS=","
while read LINE
do
# Replaces all matches of :: with a ,
CSV_LINE=${LINE//::/,}
set -- $CSV_LINE
echo "$1,$2" >> $TARGET
done < $SOURCE
Instead of set, you can use an array:
arr=($CSV_LINE)
echo "${arr[0]},${arr[1]}"
The following would print columns 1 and 2 from infile.dat. Replace with
a comma-separated list of the numbered columns you do want.
awk 'BEGIN { IFS='::'; OFS=","; } { print $1, $2 }' infile.dat > infile.csv
Perl probably has a 1 liner to do it.
Awk can probably do it easily too.
My first reaction is a combination of awk and sed:
Sed to convert the delimiters
Awk to process specific columns
cat inputfile | sed -e 's/::/,/g' | awk -F, '{print $1, $2}'
# Or to avoid a UUOC award (and prolong the life of your keyboard by 3 characters
sed -e 's/::/,/g' inputfile | awk -F, '{print $1, $2}'
awk is indeed the right tool for the job here, it's a simple one-liner.
$ cat test.in
a::b::c
d::e::f
g::h::i
$ awk -F:: -v OFS=, '{$1=$1;print;print $2,$3 >> "altfile"}' test.in
a,b,c
d,e,f
g,h,i
$ cat altfile
b,c
e,f
h,i
$

Resources