Filter and sort columm [closed] - sorting

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I learnt awk and sed, but I'm stuck at this problem, can anyone help me?
I have a table like this:
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
So I want to filter value at odd and even columns like this:
table 1:
a1
a3
b1
b3
c1
c3
and table 2:
a2
a4
b2
b4
c2
c4
How can I do this?

It's easy to work in awk using:
awk '{ for (i = 1; i <= NF; i += 2) print $i > "table.1"
for (i = 2; i <= NF; i += 2) print $i > "table.2" }' data
For each line, the first loop writes the odd fields to table.1 and the second loop writes the even fields to table.2. It will even work with different numbers of columns in each line if the input data is not wholly consistent. A single pass through the input data generates both output files.

If you have the maximum number of fields (say, 100+) just use cut:
$ echo 'a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4' | cut -d' ' -f $(seq -s, 2 2 100) | tr ' ' '\n'
a2
a4
b2
b4
c2
c4
and for the odd ones seq would just start at 1.
Here's the same thing in awk (i=1 for the odd ones):
echo ... | awk '{for(i=2; i<=NF;i+=2){ print $i}}'

This might work for you (GNU sed):
sed 's/ \+/\n/g;s/^\n\|\n$//' file | sed -ne '1~2w table1' -e '2~2w table2'
Replace space(s) by newlines and remove leading or trailing newlines.
Pipe output into a second invocation of sed which directs odd lines to table1 and even lines to table2.
Or you may prefer to use:
paste -sd' ' file | tr -s ' ' '\n' | sed -ne '1~2w table1' -e '2~2w table2'

$ awk '{for (i=1; i<=NF; i++) print $i > ("table" (i+1)%2+1)}' file
$ head table*
==> table1 <==
a1
a3
b1
b3
c1
c3
==> table2 <==
a2
a4
b2
b4
c2
c4

Related

Get non-monotonically increasing fields in Bash

Let's say I have a file with multiple columns and I want to get several fields but they may be not in increasing order. Field indexes are in an array, indexes can be in any order or not order at all and the number of indexes is unknown, for example:
arr=(1 3 2) #indexes, unknown length
echo 'c1 c2 c3' | cut -d " " -f "${arr[*]}"
The output of that is
c1 c2 c3
but I want
c1 c3 c2
So it seems cut is sorting the fields before reading them, I don't want that. I am not restricted to cut, any other command can be used.
However, I am restricted to this, rather old, version of bash:
GNU bash, version 2.05b.0(1)-release (i586-suse-linux)
Copyright (C) 2002 Free Software Foundation, Inc.
EDIT Solved thanks to Benjamin W and Glenn Jackman
echo "1 2 3" | awk -v fields="${arr[*]}" 'BEGIN{ n = split(fields,f) } { for (i=1; i<=n; ++i) printf "%s%s", $f[i], (i<n?OFS:ORS) }'
It is important to reference the array with '*' instead of '#'.
This may or may not work with bash 2.05:
arr=(1 3 2)
set -f # disable filename generation
while read line; do
set -- $line # unquoted: taking advantage of word splitting,
# store the words as positional parameters
for i in "${arr[#]}"; do
printf "%s " "${!i}" # indirect variable expansion
done
echo
done < file
Or, perl
$ cat file
c1 c2 c3
$ perl -slane '
BEGIN {#a = map {$_ - 1} split " ", $arr}
print join " ", #F[#a]
' -- -arr="${arr[*]}" file
c1 c3 c2
Using awk
$ arr=(1 3 2)
$ echo 'c1 c2 c3' | awk -v arr="${arr[*]}" '
BEGIN {
split(arr, idx," ");
}
{
for(i=1; i<=length(idx); ++i)
printf("%s ",$idx[i])} ;
END {
printf("\n")
}
'
First, split arr by ' ' and assign to idx
Then, print based on each index i
Using awk:
$ cat file
a b c d
a b c d
a b c d
a b c d
$ awk -v ord="1 4 3 2" 'BEGIN { split(ord, order, " ") }
{
split($0, line, FS)
for (i = 1; i <= length(order); ++i)
$i = line[order[i]]
print
}' file
a d c b
a d c b
a d c b
a d c b
The order is given by the ord variable passed on the command line. This variable is assumed to hold as many space-separated values as are available in the input file.
In the BEGIN block, an array, order, is created from ord by splitting it on spaces.
In the default block, the current input line is split into the array line on FS (whitespace by default). The fields are then rearranged according to the order array and then the re-constructed line is printed out.
No test is made that the passed-in value in ord is sane. If the input has N columns, it must contain all the integers from 1 to N is some order.
Abstract your read from your print so you can name the parts and order them accordingly.
$: cat x
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
$: cut -f 1,3,2 x |
> while read a b c
> do printf "$a $c $b\n"
> done
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
This puts the read loop into the bash interpreter, which isn't as fast as a binary, but doesn't require another tool that you were already using.
I don't see much point in using awk if you have perl, so if the file is big enough you need a faster solution, try this:
perl -a -n -e 'print join " ", #F[0,2,1],"\n"' x
Assumes a lot, and adds a space before the newline, but should give you a working place to start.

How to split lines that contain multiple records using bash scripting

I have a file of the form:
Heading1 Heading2 A1 A2 B1 B2
Heading3 Heading4 A3 A4 B3 B4 C1 C2
etc
Each lines contains multiple records belonging to the same headers. What I'm trying to do is split these records preserving their headings. In the example above I would like to produce the following:
Heading1 Heading2 A1 A2
Heading1 Heading2 B1 B2
Heading3 Heading4 A3 A4
Heading3 Heading4 B3 B4
Heading3 Heading4 C1 C2
My main problem is that the number of records per line is not constant.
Edit: Every line has 2 headings and N number of records each one of them is represented with 2 fields. So the general form of lines' length is 2+2*N. So it is always even
Short awk solution:
awk '{ for(i=3;i<=NF;i+=2) print $1,$2,$i,$(i+1) }' file
The output:
Heading1 Heading2 A1 A2
Heading1 Heading2 B1 B2
Heading3 Heading4 A3 A4
Heading3 Heading4 B3 B4
Heading3 Heading4 C1 C2
for(i=3;i<=NF;i+=2) - iterating through the fields starting from the 3rd (i+=2 - iterating pairwise)
awk '{for(i=3;i<=NF;i+=2)print $1,$2,$i,$(i+1)}' file
NF is number of fields in line and $i you can use field with number i.
Here's a solution in pure bash:
#!/bin/bash
while read -r; do
read -r h1 h2 rest <<< "$REPLY"
while [ -n "$rest" ]; do
read -r x1 x2 rest <<< "$rest"
printf '%s %s %s %s\n' "$h1" "$h2" "$x1" "$x2"
done
done
With GNU sed:
sed -E 's/^(([^ ]* ){2})(([^ ]* ){2})/\1\3\n\1/;P;D;' file

Reshaping from wide to long format

I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:
current file:
A1 A2 A3 A4,A5,A6
B1 B2 B3 B4,B5,B6
C1 C2 C3 C4,C5,C6
goal:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return
sed 's/,/&\n/' data.frame
I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.
sed 's/,/&\n{print $1, $2, $3}/' data.frame
Not to my surprise, the output looked like this:
A1 A2 A3 A4
{print $1, $2, $3} A5
{print $1, $2, $3} A6
B1 B2 B3 B4
{print $1, $2, $3} B5
{print $1, $2, $3} B6
C1 C2 C3 C4
{print $1, $2, $3} C5
{print $1, $2, $3} C6
It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.
NR==FNR{a[$1, $2, $3]=1}
Thanks in advance for your thoughts on this.
You can a write simple read loop for this and use brace expansion for parsing the comma delimited field:
#!/bin/bash
while read -r f1 f2 f3 c1; do
# split the comma delimited field 'c1' into its constituents
for c in ${c1//,/ }; do
printf "$f1 $f2 $f3 $c\n"
done
done < input.txt
Output:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As solution without calling an external program :
#!/bin/bash
data_file="d"
while IFS=" " read -r f1 f2 f3 r
do
IFS="," read f4 f5 f6 <<<"$r"
printf "$f1 $f2 $f3 $f4\n$f1 $f2 $f3 $f5\n$f1 $f2 $f3 $f6\n"
done <"$data_file"
In the great Miller there is the nest verb to do it
With
mlr --nidx --ifs "\t" nest --explode --values --across-records -f 4 --nested-fs "," input.tsv
you will have
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
If you don't need the output to be in any particular order within a group of the fourth column, the following awk one-liner might do:
awk '{split($4,a,","); for(i in a) print $1,$2,$3,a[i]}' input.txt
This works by splitting your 4th column into an array, then for each element of the array, printing the "new" four columns.
If order is important -- that is, A4 must come before A5, etc, then you can use a classic for loop:
awk '{split($4,a,","); for(i=1;i<=length(a);i++) print $1,$2,$3,a[i]}' input.txt
But that's awk. And you're asking about bash.
The following might work:
#!/usr/bin/env bash
mapfile -t arr < input.txt
for s in "${arr[#]}"; do
t=($s)
mapfile -t -d, u <<<"${t[3]}"
for v in "${u[#]}"; do
printf '%s %s %s %s\n' "${t[#]:0:3}" "${v%$'\n'}"
done
done
This copies your entire input file into the elements of an array, and then steps through that array, mapping each 4th-column into a second array. It then steps through that second array, printing the first three columns from the first array, along with the current field from the second array.
It's obviously similar in structure to the awk alternative, but much more cumbersome to read and code.
Note the ${v%$'\n'} on the printf line. This strips off the last field's trailing newline, which doesn't get stripped by mapfile because we're using an alternate delimiter.
Note also that there's no reason you have to copy all your input into an array, I just did it that way to demonstrate a little more of mapfile. You could of course use the old standard,
while read s; do
...
done < input.txt
if you prefer.

Using awk to shuffle text files

I am hoping to write a short bash script that uses awk to shuffle the contents of three files and adds an extra line with some text as the fourth line. For example:
File 1:
one.0
one.1
one.2
one.3
File 2:
two.0
two.1
two.2
two.3
File 3:
three.0
three.1
three.2
three.3
Desired Results:
one.0
two.0
three.0
sometext
one.1
two.1
three.1
sometext
one.2
two.2
three.2
sometext
one.3
two.3
three.3
sometext
Thanks for your help!
$ cat a
a0
a1
a2
a3
$ cat b
b0
b1
b2
b3
$ cat c
c0
c1
c2
c3
$ paste -d '\n' a b c | awk '1; NR % 3 == 0 {print "some text"}'
a0
b0
c0
some text
a1
b1
c1
some text
a2
b2
c2
some text
a3
b3
c3
some text
This awk codes is gonna do what you want:
awk 'f!=FILENAME{++idx;f=FILENAME}{a[idx][FNR]=$0}
END{rows=length(a[1])
for(r=1;r<=rows;r++){
for(i=1;i<=idx;i++) {
print a[i][r]
}
print "sometext"
}
}' f1 f2 f3
One in awk using getline:
$ awk 'BEGIN {
while( (getline line < ARGV[i+1]) >0 ) { # read from file refering ARGV
i=++i%(ARGC-1); # ARGV iteration
print line (i?"":"\nsome text") # print line and "some text" when needed
}
}' f1 f2 f3
one.0
two.0
three.0
some text
one.1
two.1
three.1
some text
one.2
two.2
three.2
some text
one.3
two.3
three.3
some text

substitute several lines in a file with lines from another file

I have two files (A.txt and B.txt):
A.txt:
many lines of text
marker
A1 A2 A3
A4 A5 A6
...
... AN
many line of text
B.txt:
some text
B1 B2 B3
B4 B5 B6
...
... BN
I want to substitue the A1 - AN block in A.txt by B1 - BN from B.txt.
The requested result is:
many lines of text
marker
B1 B2 B3
B4 B5 B6
...
... BN
many line of text
I know how to find the A block:
grep -n "marker" A.txt |cut -f1 -d:
I know how to get the B block:
sed -n '2,+6p' B.txt
I might even be able write a script with something like:
for ...
var=$(sed -n '2p' B.txt)
sed -i "34s/.*/$var/" A.txt
But I'm looking for something simple and elegant
You can start getting the filtered constent of B.txt:
sed -n '/B1/,$p' B.txt
Then you can append it after marker in A.txt with the r command:
sed "/marker/r"<(sed -n '/B1/,$p' B.txt) A.txt
And then you can delete from A1 until AN in A.txt:
sed "/A1/,/AN/d" A.txt
Altogheter:
sed -e "/marker/r"<(sed -n '/B1/,$p' B.txt) -e "/A1/,/AN/d" A.txt
Example
$ sed -e "/marker/r"<(sed -n '/B1/,$p' B.txt) -e "/A1/,/AN/d" A.txt
many lines of text
marker
B1 B2 B3
B4 B5 B6
...
... BN
many line of text
If I understand you correctly and you know the length of the block in advance, something like this springs to mind:
START=$(grep -n "marker" A.txt | cut -f 1 -d:)
# Note: one more here than you want to skip, since tail and head both
# take numbers to mean including the line specified. The sed line below
# generates seven lines of output, so 8 here.
END=$(($START + 8))
head -n $START A.txt
sed -n '2,+6p' B.txt
tail -n +$END A.txt
Otherwise, you could use another grep statement to find the last line in the replaced block and do it just the same way.
EDIT: There was a (sort of) sign error in the original code. +$END instead of -$END is required.
This awk should work for any size of blocks as long as A1/B1 is the first field and BN/AN is the last
awk '$1=="B1"{x=1}x{ss=ss?ss"\n"$0:$0}$NF=="BN"{x=0}
$1=="A1"{y=1}$NF=="AN"{y=0;$0=ss}ARGIND==2&&!y' B.txt A.txt
output
many lines of text
marker
B1 B2 B3
B4 B5 B6
...
... BN
many line of text

Resources