I have a tab file with two columns like that
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
and the desired output should be
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 14 27
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 6 209 299 305
2 10 13 23 40 47 58 87 10 23 40 58
I would like to change the numbers in 2nd column for random numbers in 1st column resulting in an output in 2nd column with the same number of numbers. I mean e.g. if there are four numbers in 2nd column for x row, the output must have four random numbers from 1st column for this row, and so on...
I'm try to create two arrays by AWK and split and replace every number in 2nd column for numbers in 1st column but not in a randomly way. I have seen the rand() function but I don't know exactly how joint these two things in a script. Is it possible to do in BASH environment or are there other better ways to do it in BASH environment? Thanks in advance
awk to the rescue!
$ awk -F'\t' 'function shuf(a,n)
{for(i=1;i<n;i++)
{j=i+int(rand()*(n+1-i));
t=a[i]; a[i]=a[j]; a[j]=t}}
function join(a,n,x,s)
{for(i=1;i<=n;i++) {x=x s a[i]; s=" "}
return x}
BEGIN{srand()}
{an=split($1,a," ");
shuf(a,an);
bn=split($2,b," ");
delete m; delete c; j=0;
for(i=1;i<=bn;i++) m[b[i]];
# pull elements from a upto required sample size,
# not intersecting with the previous sample set
for(i=1;i<=an && j<bn;i++) if(!(a[i] in m)) c[++j]=a[i];
cn=asort(c);
print $1 FS join(c,cn)}' file
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 85 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 20 205 294 295
2 10 13 23 40 47 58 87 10 13 47 87
shuffle (standard algorithm) the input array, sample required number of elements, additional requirement is no intersection with the existing sample set. Helper structure map to keep existing sample set and used for in tests. The rest should be easy to read.
Assuming that there is a tab delimiting the two columns, and each column is a space delimited list:
awk 'BEGIN{srand()}
{n=split($1,a," ");
m=split($2,b," ");
printf "%s\t",$1;
for (i=1;i<=m;i++)
printf "%d%c", a[int(rand() * n) +1], (i == m) ? "\n" : " "
}' FS=\\t input
Try this:
# This can be an external file of course
# Note COL1 and COL2 seprated by hard TAB
cat <<EOF > d1.txt
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
EOF
# Loop to read each line, not econvert TAB to:, though could have used IFS
cat d1.txt | sed 's/ /:/' | while read LINE
do
# Get the 1st column data
COL1=$( echo ${LINE} | cut -d':' -f1 )
# Get col1 number of items
NUM_COL1=$( echo ${COL1} | wc -w )
# Get col2 number of items
NUM_COL2=$( echo ${LINE} | cut -d':' -f2 | wc -w )
# Now split col1 items into an array
read -r -a COL1_NUMS <<< "${COL1}"
COL2=" "
# THis loop runs once for each COL2 item
COUNT=0
while [ ${COUNT} -lt ${NUM_COL2} ]
do
# Generate a random number to use as teh random index for COL1
COL1_IDX=${RANDOM}
let "COL1_IDX %= ${NUM_COL1}"
NEW_NUM=${COL1_NUMS[${COL1_IDX}]}
# Check for duplicate
DUP_FOUND=$( echo "${COL2}" | grep ${NEW_NUM} )
if [ -z "${DUP_FOUND}" ]
then
# Not a duplicate, increment loop conter and do next one
let "COUNT = COUNT + 1 "
# Add the random COL1 item to COL2
COL2="${COL2} ${COL1_NUMS[${COL1_IDX}]}"
fi
done
# Sort COL2
COL2=$( echo ${COL2} | tr ' ' '\012' | sort -n | tr '\012' ' ' )
# Print
echo ${COL1} :: ${COL2}
done
Output:
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 :: 88 95
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 :: 20 299 304 305
2 10 13 40 47 58 :: 2 10 40 58
How do I change the code if I have more than two columns. Let's say the data is like this
ifile.dat
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
How do I achieve this?
ofile.dat
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I mean the max of each column by comparing first column. Thanks.
Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.
cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u
Instead of relying on sort, you could switch over to something more robust like awk:
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt
That's a mouthful. It breaks down like:
Before processing the file set the PROCINFO setting of sorted_in to #val_num_asc since we will be sorting the contents of an array by its index which will be numeric: (BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"})
Loop through each field in the record for(i=2;i<=NF;++i)
Test to see if the current value is greater than the stored value in the array which has as its index the first field's value. If it is greater, then store it in place of whatever is currently held in the array for that index at that fields position: (if (a[$1][i]<$i){a[$1][i]=$i})
When done processing all of the records, sort the array into new array asorted: (END{n=asorti(a, asorted);)
Iterate through the array and print each element (for(col1 in asorted){print col1, a[col1][2], a[col1][3]})
There may be a more elegant way to do this in awk, but this will do the trick.
Example:
:~$ cat test
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
:~$ awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' test
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I've managed to extract data (from an html page) that goes into a table, and I've isolated the columns of said table into a text file that contains the lines below:
[30,30,32,35,34,43,52,68,88,97,105,107,107,105,101,93,88,80,69,55],
[28,6,6,50,58,56,64,87,99,110,116,119,120,117,114,113,103,82,6,47],
[-7,,,43,71,30,23,28,13,13,10,11,12,11,13,22,17,3,,-15,-20,,38,71],
[0,,,3,5,1.5,1,1.5,0.5,0.5,0,0.5,0.5,0.5,0.5,1,0.5,0,-0.5,-0.5,2.5]
Each bracketed list of numbers represents a column. What I'd like to do is turn these lists into actual columns that I can work with in different data formats. I'd also like to be sure to include that blank parts of these lists too (i.e., "[,,,]")
This is basically what I'm trying to accomplish:
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
. . . .
. . . .
. . . .
I'm parsing data from a web page, and ultimately planning to make the process as automated as possible so I can easily work with the data after I output it to a nice format.
Anyone know how to do this, have any suggestions, or thoughts on scripting this?
Since you have your lists in python, just do it in python:
l=[["30", "30", "32"], ["28","6","6"], ["-7", "", ""], ["0", "", ""]]
for i in zip(*l):
print "\t".join(i)
produces
30 28 -7 0
30 6
32 6
awk based solution:
awk -F, '{gsub(/\[|\]/, ""); for (i=1; i<=NF; i++) a[i]=a[i] ? a[i] OFS $i: $i}
END {for (i=1; i<=NF; i++) print a[i]}' file
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
..........
..........
Another solution, but it works only for file with 4 lines:
$ paste \
<(sed -n '1{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '2{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '3{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '4{s,\[,,g;s,\],,g;s|,|\n|g;p}' t)
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
68 87 28 1.5
88 99 13 0.5
97 110 13 0.5
105 116 10 0
107 119 11 0.5
107 120 12 0.5
105 117 11 0.5
101 114 13 0.5
93 113 22 1
88 103 17 0.5
80 82 3 0
69 6 -0.5
55 47 -15 -0.5
-20 2.5
38
71
Updated: or another version with preprocessing:
$ sed 's|\[||;s|\][,]\?||' t >t2
$ paste \
<(sed -n '1{s|,|\n|g;p}' t2) \
<(sed -n '2{s|,|\n|g;p}' t2) \
<(sed -n '3{s|,|\n|g;p}' t2) \
<(sed -n '4{s|,|\n|g;p}' t2)
If a file named data contains the data given in the problem (exactly as defined above), then the following bash command line will produce the output requested:
$ sed -e 's/\[//' -e 's/\]//' -e 's/,/ /g' <data | rs -T
Example:
cat data
[30,30,32,35,34,43,52,68,88,97,105,107,107,105,101,93,88,80,69,55],
[28,6,6,50,58,56,64,87,99,110,116,119,120,117,114,113,103,82,6,47],
[-7,,,43,71,30,23,28,13,13,10,11,12,11,13,22,17,3,,-15,-20,,38,71],
[0,,,3,5,1.5,1,1.5,0.5,0.5,0,0.5,0.5,0.5,0.5,1,0.5,0,-0.5,-0.5,2.5]
$ sed -e 's/[//' -e 's/]//' -e 's/,/ /g' <data | rs -T
30 28 -7 0
30 6 43 3
32 6 71 5
35 50 30 1.5
34 58 23 1
43 56 28 1.5
52 64 13 0.5
68 87 13 0.5
88 99 10 0
97 110 11 0.5
105 116 12 0.5
107 119 11 0.5
107 120 13 0.5
105 117 22 1
101 114 17 0.5
93 113 3 0
88 103 -15 -0.5
80 82 -20 -0.5
69 6 38 2.5
55 47 71