Edit text format with shell script - bash

I am trying to make a script for text editing. In this case I have a text file named text.csv, which reads:
first;48548a;48954a,48594B
second;58757a;5875b
third;58756a;58576b;5867d;56894d;45864a
I want to make text format to like this:
first;48548a
first;48954a
first;48594B
second;58757a
second;5875b
third;58756a
third;58576b
third;5867d
third;56894d
third;45864a
What is command should I use to make this happen?

I'd do this in awk.
Assuming your first line should have a ; instead of a ,:
$ awk -F\; '{for(n=2; n<=NF; n++) { printf("%s;%s\n",$1,$n); }}' input.txt
Untested.

Here is a pure bash solution that handles both , and ;.
while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
for item in "${data[#]}"; do
printf '%s;%s\n' "$id" "$item"
done
done < input.txt
UPDATED - alternate printing method based on chepner's suggestion:
while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
printf "$id;%s\n" "${data[#]}"
done < input.txt

awk -v FS=';' -v OFS=';' '{for (i = 2; i <= NF; ++i) { print $1, $i }}'
Explanation: awk implicitly splits data into records(by default separeted by newline, i.e. line == record) which then are split into numbered fields by given field separator(FS for input field separator and OFS for output separator).
For each record this script prints first field(which is record name), along with i-th field, and that's exactly what you need.

while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
printf "$id;%s\n" "${data[#]}"
done < input.txt
or
awk -v FS=';' -v OFS=';' '{for (i = 2; i <= NF; ++i) { print $1, $i }}'
And
$ awk -F\; '{for(n=2; n<=NF; n++) { printf("%s;%s\n",$1,$n); }}' input.txt
thanks all for your suggestions, :d. It's really give me a new knowledge..

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

BASH How to get minimum value from each row

I have csv file like this:
-0.106992, -0.106992, -0.059528, -0.059528, -0.028184, -0.028184, 0.017793, 0.017793, 0.0, 0.220367
-0.094557, -0.094557, -0.063707, -0.063707, -0.020796, -0.020796, 0.003707, 0.003707, 0.200767, 0.200767
-0.106038, -0.106038, -0.056540, -0.056540, -0.015119, -0.015119, 0.032954, 0.032954, 0.237774, 0.237774
-0.049499, -0.049499, -0.006934, -0.006934, 0.026562, 0.026562, 0.067442, 0.067442, 0.260149, 0.260149
-0.081001, -0.081001, -0.039581, -0.039581, -0.008817, -0.008817, 0.029912, 0.029912, 0.222084, 0.222084
-0.046782, -0.046782, -0.000180, -0.000180, 0.030788, 0.030788, 0.075928, 0.075928, 0.266452, 0.266452
-0.082107, -0.082107, -0.026791, -0.026791, 0.001874, 0.001874, 0.052341, 0.052341, 0.249779, 0.249779
enter image description here
I want to get the minimum value from each row.
Expected output must be:
-0.106992
-0.094557
-0.106038
-0.049499
-0.08100
-0.046782
-0.082107
I tried get it by awk but awk doesn't give minimum values:
awk command:
awk '{m=$1; for (i=2; i<=NF; i++) if ($i < m) m = $i; print m}' file_name
output:
-0.028184,
-0.020796,
-0.015119,
-0.006934,
-0.008817,
-0.000180,
-0.026791,
Perl makes short work of this:
perl -MList::Util=min -F', ' -E 'say min #F' file.csv
-0.106992
-0.094557
-0.106038
-0.049499
-0.081001
-0.046782
-0.082107
Using any awk in any shell on every Unix box whether you have blanks after each comma or not:
$ awk -F', *' '{min=$1; for (i=2;i<=NF;i++) if ($i<min) min=$i; print min}' file
-0.106992
-0.094557
-0.106038
-0.049499
-0.081001
-0.046782
-0.082107
with ruby :-D
ruby -F', ' -ane 'puts $F.map(&:to_f).min' file.csv
Your code is correct:
awk '{m=$1; for (i=2; i<=NF; i++) if ($i < m) m = $i; print m}' file_name
Except that you must add a comma to the field separator:
awk -F '[[:blank:],]' '{m=$1; for (i=2; i<=NF; i++) if ($i < m) m = $i; print m}' file_name
[[:blank:],] is spaces, tabs, and commas.

How to split second column by ';' and append first column value

What is the best and the simplest way to do it?
I have tsv file with two columns:
id1<\tab>name1;name2;name3
id2<\tab>name11;name22;name3
id3<\tab>name111;name2;name3333
I want to change columns order ((names)<\tab>id), split first column by ';' and append corresponding id to each row. I mean something like that:
name1<\tab>id1
name2<\tab>id1
name3<\tab>id1
name11<\tab>id2
name22<\tab>id2
name3<\tab>id2
name111<\tab>id3
name2<\tab>id3
name3333<\tab>id3
Thank You for help!
Using any awk in any shell on every Unix box, one option would be to set the field separator to include both the tab character, and the semicolon.
awk -F'[\t;]' -v OFS='\t' '{for (i=2; i<=NF; i++) print $i, $1}' file
Sample run:
$ cat -A file
id1^Iname1;name2;name3$
id2^Iname11;name22;name3$
id3^Iname111;name2;name3333$
$ awk -F'[\t;]' -v OFS='\t' '{for (i=2; i<=NF; i++) print $i, $1}' file | cat -A
name1^Iid1$
name2^Iid1$
name3^Iid1$
name11^Iid2$
name22^Iid2$
name3^Iid2$
name111^Iid3$
name2^Iid3$
name3333^Iid3$
In the bash interpreter -
while IFS="$IFS;" read -a c;do for n in 1 2 3; do echo "${c[$n]} ${c[0]}"; done<file
or
while IFS="$IFS;" read id n1 n2 n3; do printf "%s\t%s\n" $n1 $id $n2 $id $n3 $id; done<file
I could have said printf "%s\t$id\n" $n1 $n2 $n3 but it's usually a bad idea to embed a variable into a format string...

awk field count arithmetic

I am trying to do a simple column addition of column $i and column $((i+33)), I am not sure the syntax is correct or not.
Two files are first pasted together, and then a column addition across two files are performed.
Thank you!
paste DOS.tmp DOS.tmp2 | awk '{ printf "%12.8f",$1 OFS; for(i=2; i<33; i++) printf "%12.8f",$i+$((i+33)) OFS; if(33) printf "%12.8f",$33+$66; printf ORS}' >| DOS.tmp3
In awk, unlike in bash, variable expansion does not require a dollar sign ($) in front of the variable name. Variables are defined like a = 2 and used like print a.
Dollar sign ($) is used to refer to (input) fields. So, print $1 will print the first field, and print $a will print the field referenced by variable a, in our case the second field. Similarly, print $a, $(a+3) will print the second and fifth field (separated by the OFS).
All this taken together, makes your program look like:
awk '{ out = sprintf("%12.8f", $1)
for (i=2; i<=33; i++) out = out sprintf("%s%12.8f", OFS, $i+$(i+33))
print out }' numbers
Notice we use sprintf to print all values to the output line variable out first, concatenating like out = out val, and then printing the complete output record with print.
Are you trying to add column i in file_1 and file_2? In this case, I provide an example:
paste <(seq -s' ' 33) <(seq -s' ' 33) | awk '{ for(i=1; i<=33; i++) { printf "%f",$i+$((i+33)) ; if(i!=33) printf OFS;} printf ORS}'

Get next field/column width awk

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?
#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'
Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.
You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.
Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Resources