bash command for splitting cell content by delimiter into multiple rows in the cell column - bash

To draw a task. I have dataframe:
x y1;y2;y3 z1;z2;z3
a b1;b2 c1;c2
I need:
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2
Column 1 has one instance always. Number of instances in a cell can be from one to many but always equal between column 2,3. Thanks

In awk:
$ awk -F"(\t|;)" '{
for(i=2;i<=4;i++)
if($i!="")
print $1, $i, $(i+3)
}' file
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2
Edit: Another version:
$ awk -F"(\t+|;)" '{ # FS tabs or semicolon
for(i=2;i<=int(NF/2)+1;i++)
print $1,$i,$(i+int(NF/2))
}' file
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2

Something like this should make it:
declare -a cols=() # array for individual columns (line fields)
IFS=' ;' # fields separators
while read -a cols; do
n=${#cols[#]} # number of fields in current line
if (( n < 3 || n % 2 != 1 )); then # skip invalid lines
printf "skipping invalid line: %s\n" "${cols[*]}"
continue
fi
for (( i = 1; i <= n / 2; i += 1 )); do # loop over pairs of fields
# printf line
printf "%s %s %s\n" "${cols[0]}" "${cols[i]}" "${cols[n/2+i]}"
done
done < data.txt
Explanations:
IFS is the list of characters used by read to split a line in fields. In your case spaces and ; seem to be the separators.
read -a cols assigns the fields of the read line to the cols array, starting at cell 0.
Example of run:
$ cat data.txt
x y1;y2;y3 z1;z2;z3
a b1;b2 c1;c2
$ ./foo.sh
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2

Related

Using awk to print data from a specific sequence of lines as data arranged as a row

I have a file organized like this:
a b c d
x1
x2
x3
e f g h
x4
x5
x6
and so on. I would like to use awk to write another file as follows:
x1 x2 x3
x4 x5 x6
and so on. I am struggling since I'm still beginning to learn awk and sed. Any suggestions?
I would harness GNU AWK for this task following way, let file.txt content be
a b c d
x1
x2
x3
e f g h
x4
x5
x6
then
awk 'BEGIN{ORS=" "}NR==1{next}NF==1{print $1}NF>1{printf "\n"}' file.txt
gives output
x1 x2 x3
x4 x5 x6
Explanation: I inform GNU AWK to use space as output row separator (ORS), then for 1st row to go next row (skipping first row), if row has 1 field I do print 1st record ($1) which gets trailing space rather than newline, as I set ORS to space. If there is more than one field I just printf newline. Observe that printf does not add trailing space as opposed to print. If you want to know more about ORS or NR or NF then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)

Combining 2 lines together but "interlaced"

I have 2 lines from an output as follow:
a b c
x y z
I would like to pipe both lines from the last command into a script that would combine them "interlaced", like this:
a x b y c z
The solution should work for a random number of columns from the output, such as:
a b c d e
x y z x y
Should result in:
a x b y c z d x e y
So far, I have tried using awk, perl, sed, etc... but without success. All I can do, is to put the output into one line, but it won't be "interlaced":
$ echo -e 'a b c\nx y z' | tr '\n' ' ' | sed 's/$/\n/'
a b c x y z
Keep fields of odd numbered records in an array, and update the fields of even numbered records using it. This will interlace each pair of successive lines in input.
prog | awk 'NR%2{split($0,a);next} {for(i in a)$i=(a[i] OFS $i)} 1'
Here's a 3 step solution:
$ # get one argument per line
$ printf 'a b c\nx y z' | xargs -n1
a
b
c
x
y
z
$ # split numbers of lines by 2 and combine them side by side
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' '
a x
b y
c z
$ # combine all input lines into single line
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z
$ printf 'a b c d e\nx y z 1 2' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z d 1 e 2
Could you please try following, it will join every 2 lines in "interlaced" fashion as follows.
awk '
FNR%2!=0 && FNR>1{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
delete a
}
}
{
for(i=1;i<=NF;i++){
a[i]=(a[i]?a[i] OFS:"")$i}
}
END{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
}
}' Input_file
Here is a simple awk script
script.awk
NR == 1 {split($0,inArr1)} # read fields frrom 1st line into arry1
NR == 2 {split($0,inArr2); # read fields frrom 2nd line into arry2
for (i = 1; i <= NF; i++) printf("%s%s%s%s", inArr1[i], OFS, inArr2[i], OFS); # ouput interlace fields from arr1 and arr2
print; # terminate output line.
}
input.txt
a b c d e
x y z x y
running:
awk -f script.awk input.txt
output:
a x b y c z d x e y x y z x y
Multiline awk solution:
interlaced.awk
{
a[NR] = $0
}
END {
split(a[1], b)
split(a[2], c)
for (i in b) {
printf "%s%s %s", i==1?"":OFS, b[i], c[i]
}
print ORS
}
Run it like this:
foo_program | awk -f interlaced.awk
Perl will do the job. It was invented for this type of task.
echo -e 'a b c\nx y z' | \
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
 
a x b y c z
You can of course print out the meshed output any way you want.
Check out http://metacpan.org/pod/List::MoreUtils#mesh
You could even make it into a shell function for easy use:
function meshy {
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
}
$ echo -e 'X Y Z W\nx y z w' |meshy
X x Y y Z z W w
$
Ain't Perl grand?
This might work for you (GNU sed):
sed -E 'N;H;x;:a;s/\n(\S+\s+)(.*\n)(\S+\s+)/\1\3\n\2/;ta;s/\n//;s// /;h;z;x' file
Process two lines at time. Append two lines in the pattern space to the hold space which will introduce a newline at the front of the two lines. Using pattern matching and back references, nibble away at the front of each of the two lines and place the pairs at the front. Eventually, the pattern matching fails, then remove the first newline and replace the second by a space. Copy the amended line to hold space, clean up the pattern space ready for the next couple of line (if any) and print.

Get non-monotonically increasing fields in Bash

Let's say I have a file with multiple columns and I want to get several fields but they may be not in increasing order. Field indexes are in an array, indexes can be in any order or not order at all and the number of indexes is unknown, for example:
arr=(1 3 2) #indexes, unknown length
echo 'c1 c2 c3' | cut -d " " -f "${arr[*]}"
The output of that is
c1 c2 c3
but I want
c1 c3 c2
So it seems cut is sorting the fields before reading them, I don't want that. I am not restricted to cut, any other command can be used.
However, I am restricted to this, rather old, version of bash:
GNU bash, version 2.05b.0(1)-release (i586-suse-linux)
Copyright (C) 2002 Free Software Foundation, Inc.
EDIT Solved thanks to Benjamin W and Glenn Jackman
echo "1 2 3" | awk -v fields="${arr[*]}" 'BEGIN{ n = split(fields,f) } { for (i=1; i<=n; ++i) printf "%s%s", $f[i], (i<n?OFS:ORS) }'
It is important to reference the array with '*' instead of '#'.
This may or may not work with bash 2.05:
arr=(1 3 2)
set -f # disable filename generation
while read line; do
set -- $line # unquoted: taking advantage of word splitting,
# store the words as positional parameters
for i in "${arr[#]}"; do
printf "%s " "${!i}" # indirect variable expansion
done
echo
done < file
Or, perl
$ cat file
c1 c2 c3
$ perl -slane '
BEGIN {#a = map {$_ - 1} split " ", $arr}
print join " ", #F[#a]
' -- -arr="${arr[*]}" file
c1 c3 c2
Using awk
$ arr=(1 3 2)
$ echo 'c1 c2 c3' | awk -v arr="${arr[*]}" '
BEGIN {
split(arr, idx," ");
}
{
for(i=1; i<=length(idx); ++i)
printf("%s ",$idx[i])} ;
END {
printf("\n")
}
'
First, split arr by ' ' and assign to idx
Then, print based on each index i
Using awk:
$ cat file
a b c d
a b c d
a b c d
a b c d
$ awk -v ord="1 4 3 2" 'BEGIN { split(ord, order, " ") }
{
split($0, line, FS)
for (i = 1; i <= length(order); ++i)
$i = line[order[i]]
print
}' file
a d c b
a d c b
a d c b
a d c b
The order is given by the ord variable passed on the command line. This variable is assumed to hold as many space-separated values as are available in the input file.
In the BEGIN block, an array, order, is created from ord by splitting it on spaces.
In the default block, the current input line is split into the array line on FS (whitespace by default). The fields are then rearranged according to the order array and then the re-constructed line is printed out.
No test is made that the passed-in value in ord is sane. If the input has N columns, it must contain all the integers from 1 to N is some order.
Abstract your read from your print so you can name the parts and order them accordingly.
$: cat x
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
$: cut -f 1,3,2 x |
> while read a b c
> do printf "$a $c $b\n"
> done
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
This puts the read loop into the bash interpreter, which isn't as fast as a binary, but doesn't require another tool that you were already using.
I don't see much point in using awk if you have perl, so if the file is big enough you need a faster solution, try this:
perl -a -n -e 'print join " ", #F[0,2,1],"\n"' x
Assumes a lot, and adds a space before the newline, but should give you a working place to start.

complex line copying&modifying on-the-fly with grep or sed

Is there a way to do the followings with either grep, or sed: read each line of a file, and copy it twice and modify each copy:
Original line:
X Y Z
A B C
New lines:
Y M X
Y M Z
B M A
B M C
where X, Y, Z, M are all integers, and M is a fixed integer (i.e. 2) we inject while copying! I suppose a solution (if any) will be so complex that people (including me) will start bleeding after seeing it!
$ awk -v M=2 '{print $2,M,$1; print $2,M,$3;}' file
Y 2 X
Y 2 Z
B 2 A
B 2 C
How it works
-v M=2
This defines the variable M to have value 2.
print $2,M,$1
This prints the second column, followed by M, followed by the first column.
print $2,M,$3
This prints the second column, followed by M, followed by the third column.
Extended Version
Suppose that we want to handle an arbitrary number of columns in which we print all columns between first and last, followed by M, followed by the first, and then print all columns between first and last, followed by M, followed by the last. In this case, use:
awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file
As an example, consider this input file:
$ cat file2
X Y1 Y2 Z
A B1 B2 C
The above produces:
$ awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file2
Y1 Y2 2 X
Y1 Y2 2 Z
B1 B2 2 A
B1 B2 2 C
The key change to the code is the addition of the following command:
for (i=2;i<NF;i++)printf "%s "
This command prints all columns from the i=2, which is the column after the first to i=NF-1 which is the column before the last. The code is otherwise similar.
Sure; you can write:
sed 's/\(.*\) \(.*\) \(.*\)/\2 M \1\n\2 M \3/'
With bash builtin commands:
m=2; while read a b c; do echo "$b $m $a"; echo "$b $m $c"; done < file
Output:
Y 2 X
Y 2 Z
B 2 A
B 2 C

Print awk output into new column

I have lot of files modified (after filtration) and I need to print NR and characters about new files into column - lets see example:
input files: x1, x2, x3, y1, y2, y3, z1, z2, z3 ...
script:
for i in x* y* z*
do awk -v h=$i 'END{c+=lenght+1 ;print h "\t" NR "\t" c}' >> stats.txt
done;
my output looks like:
x1 NR c
x2 NR c
x3 NR c
y1 NR c
y2 NR c
y3 NR c
z1 NR c
z2 NR c
z3 NR c
And I need to save each loop to new column no line:
x1 NR c y1 NR c z1 NR c
x2 NR c y2 NR c z2 NR c
x3 NR c y3 NR c z3 NR c
to keep corresponding files (after filtration) on the same line. I hope I am clear. I need to do this in BASH and awk. Thank you for any help!!
EDITED:
the real output look like:
x 0.457143 872484
y 0.527778 445759
z 0.416667 382712
x 0.457143 502528
y 0.5 575972
z 0.444444 590294
x 0.371429 463939
y 0.694444 398033
z 0.56565 656565
.
.
.
and I need:
x 0.457143 872484 0.457143 502528 0.371429 463939
y 0.52777 445759 0.5 575972 0.694444 398033
.
.
.
I hope it is clear..
Try this:
cat data | tr -d , | awk '{for (i = 1; i <= NF; i += 3) print $i " NR c " $(i+1) " NR c " $(i+2) " NR c"}'
Output:
x1 NR c x2 NR c x3 NR c
y1 NR c y2 NR c y3 NR c
z1 NR c z2 NR c z3 NR c
Same table but transposed (for your task variant):
cat data | tr -d , | awk '{for (i = 1; i <= NF/3; i += 1) print $i " NR c " $(i+3) " NR c " $(i+6) " NR c"}'
Output:
x1 NR c y1 NR c z1 NR c
x2 NR c y2 NR c z2 NR c
x3 NR c y3 NR c z3 NR c
For your task update check the following solution (using bash):
cat data | sort | while read L;
do
y=`echo $L | cut -f1 -d' '`;
{
test "$x" = "$y" && echo -n " `echo $L | cut -f2- -d' '`";
} ||
{
x="$y";echo -en "\n$L";
};
done
(from my solution for similar problem)
Updated script after comment:
sort data | while read L
do
y="`echo \"$L\" | cut -f1 -d' '`"
if [ "$x" = "$y" ]
then
echo -n " `echo \"$L\" | cut -f2- -d' '`"
else
x="$y"
echo -en "\n$L"
fi
done

Resources