How to split second column by ';' and append first column value - bash

What is the best and the simplest way to do it?
I have tsv file with two columns:
id1<\tab>name1;name2;name3
id2<\tab>name11;name22;name3
id3<\tab>name111;name2;name3333
I want to change columns order ((names)<\tab>id), split first column by ';' and append corresponding id to each row. I mean something like that:
name1<\tab>id1
name2<\tab>id1
name3<\tab>id1
name11<\tab>id2
name22<\tab>id2
name3<\tab>id2
name111<\tab>id3
name2<\tab>id3
name3333<\tab>id3
Thank You for help!

Using any awk in any shell on every Unix box, one option would be to set the field separator to include both the tab character, and the semicolon.
awk -F'[\t;]' -v OFS='\t' '{for (i=2; i<=NF; i++) print $i, $1}' file
Sample run:
$ cat -A file
id1^Iname1;name2;name3$
id2^Iname11;name22;name3$
id3^Iname111;name2;name3333$
$ awk -F'[\t;]' -v OFS='\t' '{for (i=2; i<=NF; i++) print $i, $1}' file | cat -A
name1^Iid1$
name2^Iid1$
name3^Iid1$
name11^Iid2$
name22^Iid2$
name3^Iid2$
name111^Iid3$
name2^Iid3$
name3333^Iid3$

In the bash interpreter -
while IFS="$IFS;" read -a c;do for n in 1 2 3; do echo "${c[$n]} ${c[0]}"; done<file
or
while IFS="$IFS;" read id n1 n2 n3; do printf "%s\t%s\n" $n1 $id $n2 $id $n3 $id; done<file
I could have said printf "%s\t$id\n" $n1 $n2 $n3 but it's usually a bad idea to embed a variable into a format string...

Related

How to print keys from all key-value pairs

Text file looks like this:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
How can I extract keys so that:
key11|key12|key13
key21|key22|key23
I have tried unsuccessfully :
awk '{ gsub(/[^[|]=]+=/,"") }1' file.txt
gives back the actual data:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
Since you tagged bash
while IFS='=|' read -ra words; do
n=${#words[#]}
for ((i=1; i<n; i+=2)); do
unset words[i]
done
( IFS='|'; echo "${words[*]}" )
done < file
gawk
This can be done by awk, by setting FS and OFS :
kent$ awk -F'=[^|]*' -v OFS="" '$1=$1' file
key11|key12|key13
key21|key22|key23
or safer: awk -F.... '{$1=$1}1' file
substitution (by sed for example):
kent$ sed 's/=[^|]*//g' file
key11|key12|key13
key21|key22|key23
Here's one solution
echo "key11=val1|key12=val2|key13=val3" \
| awk -F'[=|]' '{
for (i=1;i<=NF;i+=2){
printf("%s%s", $i, (i<(NF-1))?"|":"")
}
print""
}'
output
key11|key12|key13
It should also work by passing in the filename as an argument to awk, i.e.
awk -F'[=|]' '{for (i=1;i<=NF;i+=2){printf("%s%s", $i, (i<(NF-1))?"|":"") }print""}' file1 [file_more_as_will_fit]
Discussion
We use a multiple character value for FS (FieldSeperator) so each = and | char mark the beginning of a new field.
-F'[=|]'
Because we know we want to start with field1 for output and skip every other field, we use
for (i=1;i<=NF;i+=2)
printf formats the output as defined by the format string '%s%s' . There area a zillion options available for printf format strs, but you only need the value for $i (the looping value that generates the key) and whether to print a | char or not.
printf("%s%s", $i ...)
And we use awk's ternary operator, which evaluates what element number is being processed (i<..). As long as it is not the 2nd to last field, the | char is emitted.
(i<(NF-1))?"|":""
IHTH
sed
I did this with sed:
sed -r 's/([[:alnum:]]*)=[[:alnum:]]*/\1/g' < file.txt
tested here and got:
key11|key12|key13
key21|key22|key23
s/<pattern>/<subst>/ means "replace <pattern> by <subst>", and with the g in the end it will do it for every pattern found in the line.
The [[:alnum:]]* is equivalent to [0-9a-zA-Z]*, and means any number of letters or digits.
The first pattern between parentesis will correspond to \1 in the substitution, the second \2 and so on.
So, it will match every "key=value" and replace it by "key".
awk -F'[=|]' '{print $1,$3,$5}' OFS="|" file
key11|key12|key13
key21|key22|key23

Splitting csv file into multiple files with 2 columns in each file

I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt

Sum all values in each column bash

I have a csv file which looks like this:
ID_X,1,2,7,8
ID_Y,6,9,3,5
ID_Z,7,12,4,4
My goal is to create a csv file with the sum of all the values in each single column (from second column on), so in this case, that file will look like this:
SUM,14,23,14,17
So far, I am able to do it for one column at a time using awk. For instance, for the first column with numbers:
awk 'BEGIN {FS=OFS=","} ; {sum+=$2} END {print sum}' test.txt
14
Is there any way to achieve what I am looking for?
Many thanks!
You are almost there.
With awk you could say:
awk ' BEGIN {FS=OFS=","}
{for (i=2; i<=NF; i++) {sum[i]+=$i} len=NF}
END {$1="SUM"; for (i=2; i<=len; i++) $i=sum[i]; print}
' file.csv
Using datamash:
echo -n SUM,; datamash -t, sum 2,3,4,5 < file.csv
Using numsum:
printf 'SUM%.0s,%s,%s,%s,%s\n' `numsum -s, -c file.csv`
or, if the number of columns in file.csv is variable:
numsum -s, -c file.csv | sed 's/^0/SUM/;y/ /,/'
Output:
SUM,14,23,14,17

awk field count arithmetic

I am trying to do a simple column addition of column $i and column $((i+33)), I am not sure the syntax is correct or not.
Two files are first pasted together, and then a column addition across two files are performed.
Thank you!
paste DOS.tmp DOS.tmp2 | awk '{ printf "%12.8f",$1 OFS; for(i=2; i<33; i++) printf "%12.8f",$i+$((i+33)) OFS; if(33) printf "%12.8f",$33+$66; printf ORS}' >| DOS.tmp3
In awk, unlike in bash, variable expansion does not require a dollar sign ($) in front of the variable name. Variables are defined like a = 2 and used like print a.
Dollar sign ($) is used to refer to (input) fields. So, print $1 will print the first field, and print $a will print the field referenced by variable a, in our case the second field. Similarly, print $a, $(a+3) will print the second and fifth field (separated by the OFS).
All this taken together, makes your program look like:
awk '{ out = sprintf("%12.8f", $1)
for (i=2; i<=33; i++) out = out sprintf("%s%12.8f", OFS, $i+$(i+33))
print out }' numbers
Notice we use sprintf to print all values to the output line variable out first, concatenating like out = out val, and then printing the complete output record with print.
Are you trying to add column i in file_1 and file_2? In this case, I provide an example:
paste <(seq -s' ' 33) <(seq -s' ' 33) | awk '{ for(i=1; i<=33; i++) { printf "%f",$i+$((i+33)) ; if(i!=33) printf OFS;} printf ORS}'

Edit text format with shell script

I am trying to make a script for text editing. In this case I have a text file named text.csv, which reads:
first;48548a;48954a,48594B
second;58757a;5875b
third;58756a;58576b;5867d;56894d;45864a
I want to make text format to like this:
first;48548a
first;48954a
first;48594B
second;58757a
second;5875b
third;58756a
third;58576b
third;5867d
third;56894d
third;45864a
What is command should I use to make this happen?
I'd do this in awk.
Assuming your first line should have a ; instead of a ,:
$ awk -F\; '{for(n=2; n<=NF; n++) { printf("%s;%s\n",$1,$n); }}' input.txt
Untested.
Here is a pure bash solution that handles both , and ;.
while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
for item in "${data[#]}"; do
printf '%s;%s\n' "$id" "$item"
done
done < input.txt
UPDATED - alternate printing method based on chepner's suggestion:
while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
printf "$id;%s\n" "${data[#]}"
done < input.txt
awk -v FS=';' -v OFS=';' '{for (i = 2; i <= NF; ++i) { print $1, $i }}'
Explanation: awk implicitly splits data into records(by default separeted by newline, i.e. line == record) which then are split into numbered fields by given field separator(FS for input field separator and OFS for output separator).
For each record this script prints first field(which is record name), along with i-th field, and that's exactly what you need.
while IFS=';,' read -a data; do
id="${data[0]}"
data=("${data[#]:1}")
printf "$id;%s\n" "${data[#]}"
done < input.txt
or
awk -v FS=';' -v OFS=';' '{for (i = 2; i <= NF; ++i) { print $1, $i }}'
And
$ awk -F\; '{for(n=2; n<=NF; n++) { printf("%s;%s\n",$1,$n); }}' input.txt
thanks all for your suggestions, :d. It's really give me a new knowledge..

Resources