Merge header columns in a matrix in bash - bash

I want to merge the headers of the matrix:
12 12 12 13
bb 2
cc 1
aa 5
ee 6
like this:
12 13
bb 2
cc 1
aa 5
ee 6
I tried this and it didn't work (and it wasn't actually applicable to the larger matrix:
merged_headers=()
for i in {1..3}; do
header=$(head -1 unmerge.txt | awk -v col=$i '{print $col}')
if [ -z "$header" ]; then
header=${merged_headers[-1]}
else
merged_headers+=($header)
fi
sed -i "s/^[ \t]*$/$header/g" unmerge.txt
done

Assumptions:
columns are consolidated in a left-to-right order
if the column headers are 13 12 14 12 13 14 then the new column headers will be (left-to-right) 13 12 14 (as opposed to a numeric or string ordering that would generate 12 13 14)
the consolidated data set will have at most one non-empty value per unique column header; otherwise we'll append them together into a single string; if multiple values are guaranteed to be numeric we could modify the code to sum the values
One awk idea:
awk '
BEGIN { FS=OFS="\t"
newcolno=1
}
{ printf "%s", $1 # print 1st column
if (NR==1) { # if header record ...
for (old=2; old<=NF; old++) { # loop through columns ...
if (! ($old in newcol)) { # looking for new header and if found ...
printf "%s%s", OFS, $old # print to stdout and ...
newcol[$old]= ++newcolno # make note of the new column number to map to
}
old2new[old]= newcol[$old] # map current column number to new column number
}
}
else { # non-header rows
delete row # clear our new output array
for (old=2; old<=NF; old++) # loop through current columns ...
row[old2new[old]]=row[old2new[old]] $old # append value to new row/column
for (new=2; new<=newcolno; new++) # loop through new row/columns and ...
printf "%s%s", OFS, row[new] # print to stdout
}
print "" # terminate current line
}
' unmerge.txt
This generates:
12 13
bb 2
cc 1
aa 5
ee 6
Testing a larger file to demonstrate some of our assumptions:
$ cat unmerge2.txt
12 12 12 13 12 13
bb 2
cc 1
aa 5
ee 6
ff 17 87 # distinct headers so no problems
gg 100 -3 # both have header "13" so we'll append the strings
The awk script generates:
12 13
bb 2
cc 1
aa 5
ee 6
ff 87 17
gg 100-3
Once OP is satisified with the results, and assuming OP still wants to update/overwrite the input file with the new results:
if using GNU awk you can add -i inplace to facilitate an inplace update of the input file: awk -i inplace 'BEGIN {FS=OFS="\t"; newcolno=1}...' unmerge.txt
otherwise OP can direct the output to a tmp file and then overwrite the source file with the tmp file: awk 'BEGIN {FS=OFS="\t"; newcolno=1}...' unmerge.txt > tmpfile; mv tmpfile unmerge.txt

Related

Compare names and numbers in two files and output more

For example, there are 2 files:
$ cat file1.txt
e 16
a 9
c 14
b 9
f 25
g 7
$ cat file2.txt
a 10
b 12
c 15
e 8
g 7
Сomparing these two files with the command(directory dir 1 contains file 1, in directory 2 respectively file 2) grep -xvFf "$dir2" "$dir1" | tee "$dir3" we get the following output in dir 3:
$ cat file3.txt
e 16
a 9
c 14
b 9
f 25
Now I need to essentially compare the output of file 3 and file 2 and output to file 3 only those results where the number next to the letter has become greater, if the number is equal to or less than the value in file 2, do not output these values to the 3rd file, that is the contents of file 3 should be like this:
$ cat file3.txt
e 16
f 25
{m,g}awk 'FNR < NR ? __[$!_]<+$NF : (__[$!_]=+$NF)<_' f2.txt f1.txt
e 16
f 25
if u really wanna clump it all in one shot :
mawk '(__[$!!(_=NF)]+= $_ * (NR==FNR)) < +$_' f2.txt f1.txt
One awk idea:
awk '
FNR==NR { a[$1]=$2; next } # 1st file: save line in array a[]
($1 in a) && ($2 > a[$1]) # 2nd file: print current line if 1st field is an index in array a[] *AND* 2nd field is greater than the corrsponding value from array a[]
!($1 in a) # 2nd file: print current line if 1st field is not an index in array a[]
' file2.txt file1.txt
This generates:
e 16
f 25

how to set field name as file name bash/awk

I have a file with 500 columns and I would need to split each column into a new file while printing $1 as common in all the files. Below is a sample file, and I managed to do this using the below bash/awk solution :
ID F1 F2 F4 F4
aa 1 2 3 4
bb 1 2 3 4
cc 1 2 3 4
dd 1 2 3 4
num=('1' '2' '3' '4')
for i in ${num[#]}; do awk -F "\t" -v col="$i" '{print $1,$col}' OFS="\t"
Input.txt > ${i}.txt; done
which gives the required output as:
1.txt
ID ID
aa aa
bb bb
cc cc
dd dd
2.txt
ID F1
aa 1
bb 1
cc 1
dd 1
....
However, I could not track which file corresponds to which column as the output file name is the field number but not the field name. Could it be possible to write the header of the field as prefix to the output file name?
ID.txt
ID ID
aa aa
bb bb
cc cc
dd dd
F1.txt
ID F1
aa 1
bb 1
cc 1
dd 1
You can do it all in one awk script. When processing the first line, put all the column headings in an array. Then when you process lines you write to the file names from that array in a loop.
awk -F'\t' 'NR == 1 { split($0, filenames) }
{for (col = 1; col <= NF; col++) {
file= filenames[col] ".txt";
print $1, $col >> file;
close(file) } }' Input.txt
If I understand your requirement correctly, it seems like you're very close. Try
num=('1' '2' '3' '4')
for i in ${num[#]}; do
echo "i=$i"
awk -F "\t" -v col="$i" -v OFS="\t" '
NR==1{fName=$(col+1)".out";next}
{print $1,$(col+1) > fName}' data.txt
done
1>cat F1.out
aa 1
bb 1
cc 1
dd 1
. . . .
1>cat F4.out
aa 4
bb 4
cc 4
dd 4
Edit
If you need to keep the headers as shown in your example output, just remove the ;next.
Edit 2
If you have multiple column with the same name, you can append the data to same file by using >> fName instead. One word of warning with this technique. When you use > fName, this "restarts" the file each time you rerun your script. But when using >>, you will be appending to each file each time you run the script. That can cause problems for down-stream processes ;-) ... So you'd need to add code that cleans up your previous run of the script.
Here, we're relying on the fact that awk can also write output to a file, using > fName (where fName has been defined as the value of col(Num)+1 (to skip over the first column values).
And, if you were going to do this thousands of times a day, it would be worth further optimizing per comments above to have awk read the file once and create all the outputs from internal loops. But if you only need to do this a couple of times, then your 'use the tools of unix/linux to decompose the task into manageable parts' is perfectly appropriate.
IHTH

Extract column after pattern from file

I have a sample file which looks like this:
5 6 7 8
55 66 77 88
A B C D
1 2 3 4
2 4 6 8
3 8 12 16
E F G H
11 22 33 44
and so on...
I would like to enter a command in a bash script or just in a bash terminal to extract one of the columns independently of the others. For instance, I would like to do something like a grep/awk command with the pattern=C and get the following output:
C
3
6
12
How can I extract a specific column independent of the others and also put a # of lines to extract after the pattern so that I don't get the above column with the 7's or the G column in my output?
If it's always 3 records after the found term:
awk '{for(i=1;i<=NF;i++) {if($i=="C") col=i}} col>0 && rcount<=3 {print $col; rcount++}' test
This will look at each field in your record and if it finds a "C", it will capture the column number i. If the column number is greater than 0 then it will print the contents of the column. It counts up to 3 records and then stops printing.
$ cat tst.awk
!prevNF { delete f; for (i=1; i<=NF; i++) f[$i] = i }
NF && (tgt in f) { print $(f[tgt]) }
{ prevNF = NF }
$ awk -v tgt=C -f tst.awk file
C
3
6
12
$ awk -v tgt=F -f tst.awk file
F
22

Creating a mapping count

I have this data with two columns
Id Users
123 2
123 1
234 5
234 6
34 3
I want to create this count mapping from the given data like this
123 3
234 11
34 3
How can I do it in bash?
You have to use associative arrays, something like
declare -A newmap
newmap["123"]=2
newmap["123"]=$(( ${newmap["123"]} + 1))
obviously you have to iterate through your input, see if the entry exists then add to it, else initialize it
It will be easier with awk.
Solution 1: Doesn't expect the file to be sorted. Stores entire file in memory
awk '{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
34 3
234 11
123 3
What we are doing here is using the first column as key and adding second column as value. In the END block we iterate over our array and print the key=value pair.
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk 'NR>1{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
NR>1 is telling awk to skip the first line. NR contains the line number so we instruct awk to start creating our array from second line onwards.
Solution 2: Expects the file to be sorted. Does not store the file in memory.
awk '$1!=prev && NR>1{print prev,sum}{prev=$1; sum+=$2}END{print prev,sum}' file
123 3
234 14
34 17
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk '$1!=prev && NR>2{print prev, sum}NR>1{prev = $1; sum+=$2}END{print prev, sum}' ff
123 3
234 14
34 17
A Bash (4.0+) solution:
declare -Ai count
while read a b ; do
count[$a]+=b
done < "$infile"
for idx in ${!count[#]}; do
echo "${idx} ${count[$idx]}"
done
For a sorted output the last line should read
done | sort -n

printing variable number lines to output

I would like to have a script to modify some large text files (100k records) such that, for every record, a number of lines in the output is created equivalent to the difference in columns 3 and 2 of every input line. In the output I want to print the record name (column 1), and a step-wise walk between the numbers contained in columns 2 and 3.
Sample trivial input could be (tab separated data, if it makes a difference)
a 3 5
b 10 14
with the desired output (again, ideally tab separated)
a 3 4
a 4 5
b 10 11
b 11 12
b 12 13
b 13 14
It's a challenge sadly beyond my (very) limited abilities.
Can anyone provide a solution to the problem, or point me in the right direction? In an ideal world I would be able to be integrate this into a bash script, but I'll take anything that works!
Bash solution:
while read h f t ; do
for ((i=f; i<t; i++)) ; do
printf "%s\t%d\t%d\n" $h $i $((i+1))
done
done < input.txt
Perl solution:
perl -lape '$_ = join "\n", map join("\t", $F[0], $_, $_ + 1), $F[1] .. $F[2] - 1' input.txt
awk -F '\t' -v OFS='\t' '
$2 >= $3 {print; next}
{for (i=$2; i<$3; i++) print $1, i, i+1}
' filename
With awk:
awk '$3!=$2 { while (($3 - $2) > 1) { print $1,$2,$2+1 ; $2++} }1' inputfile
Fully POSIX, and no unneeded loop variables:
$ while read h f t; do
while test $f -lt $t; do
printf "%s\t%d\t%d\n" "$h" $f $((++f))
done
done < input.txt
a 3 4
a 4 5
b 10 11
b 11 12
b 12 13
b 13 14

Resources