Split a column into separate columns based on value - bash

I have a tab delimited file that looks as follows:
cat my file.txt
gives:
1 299
1 150
1 50
1 57
2 -45
2 62
3 515
3 215
3 -315
3 -35
3 3
3 6789
3 34
5 66
5 1334
5 123
I'd like to use Unix commands to get a tab-delimited file that based on values in column#1, each column of the output file will hold all relevant values of column#2
(I'm using here for the example the separator "|" instead of tab only to illustrate my desired output file):
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
The corresponding Headers (1,2,3,5; based on column#1 values) could be a nice addition to the code (as shown below), but the main request is to split the information of the first file into separated columns. Thanks!
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |

Here's a one liner that matches your output. It builds a string $ARGS containing as many process substitutions as there are unique values in the first column. Then, $ARGS is used as the argument for the paste command:
HEADERS=$(cut -f 1 file.txt | sort -n | uniq); ARGS=""; for h in $HEADERS; do ARGS+=" <(grep ^"$h"$'\t' file.txt | cut -f 2)"; done; echo $HEADERS | tr ' ' '|'; eval "paste -d '|' $ARGS"
Output:
1|2|3|5
299|-45|515|66
150|62|215|1334
50||-315|
57||-35|
||3|

You can use gnu-awk
awk '
BEGIN{max=0;}
{
d[$1][length(d[$1])+1] = $2;
if(length(d[$1])>max)
max = length(d[$1]);
}
END{
PROCINFO["sorted_in"] = "#ind_num_asc";
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") j;
flag = 1;
}
print line;
for(i=1; i<=max; ++i){
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") d[j][i];
flag = 1;
}
print line;
}
}' file.txt
you get
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
Or, you can use python .... for example, in split2Columns.py
import sys
records = [line.split() for line in open(sys.argv[1])]
import collections
records_dict = collections.defaultdict(list)
for key, val in records:
records_dict[key].append(val)
from itertools import izip_longest
print "\t|\t".join(records_dict.keys())
print "\n".join(("\t|\t".join(map(str,l)) for l in izip_longest(*records_dict.values(), fillvalue="")))
python split2Columns.py file.txt
you get same result

#Jose Ricardo Bustos M. - thanks for your answer! I unfortunately couldn't install on my Mac the gnu-awk, but based on your suggestive answer I've performed something similar using awk:
HEADERS=$(cut -f 1 try.txt | awk '!x[$0]++');
H=( ${HEADERS// / });
MAXUNIQNUM=$(cut -f 1 try.txt |uniq -c|awk '{print $1}'|sort -nr|head -1);
awk -v header="${H[*]}" -v max=$MAXUNIQNUM
'BEGIN {
split(header,headerlist," ");
for (q = 1;q <= length(headerlist); q++)
{counter[q]=1;}
}
{for (z = 1; z <= length(headerlist); z++){
if (headerlist[z] == $1){
arr[counter[z],headerlist[z]] = $2;
counter[z]++
};
}
}
END {
for (x = 1; x <= max; x++){
for (y = 1; y<= length(headerlist); y++){
printf "%s\t",arr[x,headerlist[y]];
}
printf "\n"
}
}' try.txt

This is using an array to keep track of the column headings, using them to name temporary files and paste everything together in the end:
#!/bin/bash
infile=$1
filenames=()
idx=0
while read -r key value; do
if [[ "${filenames[$idx]}" != "$key" ]]; then
(( ++idx ))
filenames[$idx]="$key"
echo -e "$key\n----" > "$key"
fi
echo "$value" >> "$key"
done < "$1"
paste "${filenames[#]}"
rm "${filenames[#]}"

Related

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

Sort multiple tables inside Markdown file with text interspersed between them

There is a Markdown file with headings, text, and unsorted tables. I want to programmatically sort each table by ID, which is the 3rd column, in descending order, preferably using PowerShell or Bash. The table would remain in its place in the file.
# Heading
Text
| Col A | Col B | ID |
|---------|---------|----|
| Item 1A | Item 1B | 8 |
| Item 2A | Item 2B | 9 |
| Item 3A | Item 3B | 6 |
# Heading
Text
| Col A | Col B | ID |
|---------|---------|----|
| Item 4A | Item 4B | 3 |
| Item 5A | Item 5B | 2 |
| Item 6A | Item 6B | 4 |
I have no control over how the Markdown file is generated. Truly.
Ideally the file would remain in Markdown after the sort for additional processing. However, I explored these options without success:
Convert to JSON and sort (the solutions I tried didn't agree with tables)
Convert to HTML and sort (only found JavaScript solutions)
This script alone, while helpful, would need to be modified to parse through the Markdown file (having trouble finding understandable guidance on how to run a script on content between two strings)
The reason for command line (and not JavaScript on the HTML, for example) is that this transformation will take place in an Azure Release Pipeline. It is possible to add an Azure Function to the pipeline, which would allow me to run JavaScript code in the cloud, and I will pursue that if all else fails. I want to exhaust command-line options first because I am not very familiar with JavaScript or how to pass content between Functions and releases.
Thank you for any ideas.
By modifying the referred script, how about:
flush() {
printf "%s\n" "${lines[#]:0:2}"
printf "%s\n" "${lines[#]:2}" | sort -t \| -nr -k 4
lines=()
}
while IFS= read -r line; do
if [[ ${line:0:1} = "|" ]]; then
lines+=("$line")
else
(( ${#lines[#]} > 0 )) && flush
echo "$line"
fi
done < input.md
(( ${#lines[#]} > 0 )) && flush
Output:
# Heading
Text
| Col A | Col B | ID |
|---------|---------|----|
| Item 2A | Item 2B | 9 |
| Item 1A | Item 1B | 8 |
| Item 3A | Item 3B | 6 |
# Heading
Text
| Col A | Col B | ID |
|---------|---------|----|
| Item 6A | Item 6B | 4 |
| Item 4A | Item 4B | 3 |
| Item 5A | Item 5B | 2 |
BTW, if Perl is your option, here is an alternative:
perl -ne '
sub flush {
print splice(#ary, 0, 2); # print header lines
# sort the table with keying the ID by Schwartzian transform
print map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [$_, (split(/\s*\|\s*/))[3] ] }
#ary;
#ary = ();
}
# main loop
if (/^\|/) { # table section
push(#ary, $_);
} else { # other section
if ($#ary > 0) {
&flush;
} else {
print;
}
}
END {
if ($#ary > 0) { &flush; }
}
' input.md
Hope this helps.
If possible to identify markdown tables, a small 'awk' (or bash/python/perl) can filter the output. It assume each table has 2 header line.
awk -v 'FS="|" '
function cmp_id(i1, v1, i2, v2) {
return v1-v2 ;
}
function show () {
asorti(k, d, "cmp_id")
# for (i=1 ; i<=n; i++ ) print i, k[i], d[i] ;
# Print first 2 original header row, followed by sorted data lines
print s[1] ; print s[2]
for (i=1 ; i<=n; i++ ) if ( d[i]>=3 ) print s[d[i]] ;
n = 0
}
# Capture tables
/^\|/ { s[++n] = $0 ; k[n] = $4 ; next }
n > 0 { show() ; }
{ print }
END { show() ; }
'

awk command to print multiple columns using for loop

I am having a single file in which it contains 1st and 2nd column with item code and name, then from 3rd to 12th column which contains its 10 days consumption quantity continuously.
Now i need to convert that into 10 different files. In each the 1st and 2nd column should be the same item code and item name and the 3rd column will contain the consumption quantity of one day in each..
input file:
Code | Name | Day1 | Day2 | Day3 |...
10001 | abcd | 5 | 1 | 9 |...
10002 | degg | 3 | 9 | 6 |...
10003 | gxyz | 4 | 8 | 7 |...
I need the Output in different file as
file 1:
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
file 2:
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
file 3:
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
and so on....
I wrote a code like this
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$3}' FILE_NAME > file1;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$4}' FILE_NAME > file2;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$5}' FILE_NAME > file3;
and so on...
Now i need to write it with in a 'for' or 'while' loop which would be faster...
I dont know the exact code, may be like this..
for (( i=3; i<=NF; i++)) ; do awk 'BEGIN { FS = "\t" } ; {print $1,$2,$i}' input.tsv > $i.tsv; done
kindly help me to get the output as i explained.
If you absolutely need to to use a loop in Bash, then your loop can be fixed like this:
for ((i = 3; i <= 10; i++)); do awk -v field=$i 'BEGIN { FS = "\t" } { print $1, $2, $field }' input.tsv > file$i.tsv; done
But it would be really better to solve this using pure awk, without shell at all:
awk -v FS='\t' '
NR == 1 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i > fn;
print "" >> fn;
}
}
NR > 2 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i >> fn;
}
}' inputfile
That is, when you're on the first record,
create the output files by writing the header line and a blank line (as in specified in your question).
For the 3rd and later records, append to the files.
Note that the code in your question suggests that the fields in the file are separated by tabs, but the example files seem to use | padded with variable number of spaces. It's not clear which one is your actual case. If it's really tab-separated, then the above code will work. If in fact it's as the example inputs, then change the first line to this:
awk -v OFS=' | ' -v FS='[ |]+' '
bash + cut solution:
input.tsv test content:
Code | Name | Day1 | Day2 | Day3
10001 | abcd | 5 | 1 | 9
10002 | degg | 3 | 9 | 6
10003 | gxyz | 4 | 8 | 7
day_splitter.sh script:
#!/bin/bash
n=$(cat $1 | head -1 | awk -F'|' '{print NF}') # total number of fields
for ((i=3; i<=$n; i++))
do
fn="Day"$(($i-2)) # file name containing `Day` number
$(cut -d'|' -f1,2,$i $1 > $fn".txt")
done
Usage:
bash day_splitter.sh input.tsv
Results:
$cat Day1.txt
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
$cat Day2.txt
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
$cat Day3.txt
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
In pure awk:
$ awk 'BEGIN{FS=OFS="|"}{for(i=3;i<=NF;i++) {f="file" (i-2); print $1,$2,$i >> f; close(f)}}' file
Explained:
$ awk '
BEGIN {
FS=OFS="|" } # set delimiters
{
for(i=3;i<=NF;i++) { # loop the consumption fields
f="file" (i-2) # create the filename
print $1,$2,$i >> f # append to target file
close(f) } # close the target file
}' file

Check a Value and Tag according to that value and append in same row using shell

I have a file as
NUMBER|05-1-2016|05-2-2016|05-3-2016|05-4-2016|
0000000 | 0 | 225.993 | 0 | 324|
0003450 | 89| 225.993 | 0 | 324|
0005350 | 454 | 225.993 | 54 | 324|
In example There are four dates in the header
I want to check the value under the date for the field 1 'number' and tag values according to that using shell
example if value is between 0-100 tag 'L' and if greater than 100 , tag 'H'
So the output should be like
NUMBER|05-1-2016|05-2-2016|05-3-2016|05-4-2016|05-1-2016|05-2-2016|05-3-2016|05-4-2016|
0000000 | 0 | 225.993 | 0 | 324| L | H | L | H|
0003450 | 89| 225.993 | 0 | 324|L | H | L | H|
0005350 | 454 | 225.993 | 54 | 324|H | H | L | H|
A quick and dirty example, that:
sets the input and output field separator (-F and OFS below) to |,
prints the the header (record with NR==1)
for all others prints the fields 1-5, and then executes function lh for fields 2-5
defines the function lh, as one returning L for values < 100, and H for all others
Code:
awk -F \| '
BEGIN {OFS="|"}
NR==1 {print}
NR > 1 {print $1, $2, $3, $4, $5, lh($2), lh($3), lh($4), lh($5) }
function lh(val) { return (val < 100) ? "L" : "H"}
' file.txt
Alternative function lh:
function lh(val) {
result = "";
if (val < 100) {
result = "L";
} else {
result = "H";
}
return result;
}

Sum of Columns for multiple variables

Using Shell Script (Bash), I am trying to sum the columns for all the different variables of a list. Suppose I have the following input of a Test.tsv file
Win Lost
Anna 1 1
Charlotte 3 1
Lauren 5 5
Lauren 6 3
Charlotte 3 2
Charlotte 4 5
Charlotte 2 5
Anna 6 4
Charlotte 2 3
Lauren 3 6
Anna 1 2
Anna 6 2
Lauren 2 1
Lauren 5 5
Lauren 6 6
Charlotte 1 3
Anna 1 4
And I want to sum up how much each of the participants have won and lost. So I want to get this as a result:
Sum Win Sum Lost
Anna 57 58
Charlotte 56 57
Lauren 53 56
What I would usually do is take the sum per person and per column and repeat that process over and over. See below how I would do it for the example mentioned:
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f2 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f3 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
However I would need to repeat this line for every participant. This becomes a pain when you have to many variables you want to sum it up for.
What would be the way to write this script?
Thanks!
This is pretty straightforward with awk. Using GNU awk:
awk -F '\t' 'BEGIN { OFS = FS } NR > 1 { won[$1] += $2; lost[$1] += $3 } END { PROCINFO["sorted_in"] = "#ind_str_asc"; print "", "Sum Win", "Sum Lost"; for(p in won) print p, won[p], lost[p] }' filename
-F '\t' makes awk split lines at tabs, then:
BEGIN { OFS = FS } # the output should be separated the same way as the input
NR > 1 { # From the second line forward (skip header)
won[$1] += $2 # tally up totals
lost[$1] += $3
}
END { # When done, print the lot.
# GNU-specific: Sorted traversal or player names
PROCINFO["sorted_in"] = "#ind_str_asc"
print "", "Sum Win", "Sum Lost"
for(p in won) print p, won[p], lost[p]
}

Resources