find all values sequentially from given to target value in columns, bash

find all values sequentially from given to target value in columns, bash - bash

I have a data.txt file
1 2 3 4 5 6
cat data.txt
17 245 1323 17.7777 10.2222 61.1111
19 232 1232 19.9999 19.9999 68.8888
13 133 1233 13.3333 13.3333 63.3333
17 177 1678 17.7777 17.7777 69.9999
12 122 2325 12.2222 11.333 64.4444
18 245 1323 18.8888 12.4444 68.8888
12 222 1222 12.2222 19.9999 61.1111
14 245 1323 14.4444 13.5555 68.8888
I would like the find all the values sequentially from 12.2222 in column 4 to 18.8888.
Answer:
echo ${minValsCol4[#]}
12.2222 13.3333 14.4444 17.7777 18.8888
And the values sequentially from 63.3333 in column 6 to 68.8888.
Answer:
echo ${minValsCol6[#]}
63.3333 64.4444 68.8888
Any solution in awk?
Thanks.

Using awk and sort -nu:
awk -v col=4 -v start=12.2222 -v end=18.8888 '$col>=start && $col<=end{
print $col}' data.txt | sort -nu
12.2222
13.3333
14.4444
17.7777
18.8888

Related

Sort text file after the 16th column with p-values in bash

I have a tab delimited textfile with 18 column and more than 300000 rows. I have also a header line and I would sort the whole text file by the 16th column, which contains p-values. So I would like to sort it, having the lowest p-values above and also leaving the headline as it is.
I already have a code, it doesn't give me any error message, but it only shows the header line in the output file, and nothing else.
Here is my file:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
Output should look like this:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
Here is my code:
awk 'NR==1; NR > 1 {print $0 | "sort -g -rk 16,16"}' file.txt > file_out.txt

I'm guessing your sort doesn't have a -g option and so it's failing and not producing any output. Try this instead just using POSIX options:
$ awk 'NR==1; NR > 1 {print | "sort -nrk 16,16"}' file
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103

Would you please try the following:
cat <(head -n 1 file.txt) <(tail -n +2 file.txt | sort -nk16,16) > file_out.txt

Using GNU awk (for array sorting):
awk 'NR==1 { print;next } { map[$3][$16]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for(j in map[i]) { print map[i][j] } } }' file
Explanation
awk 'NR==1 {
print;next # Header record, print and skip to the next line
}
{
map[$3][$16]=$0 # None header line - create a two dimensional array indexed by ID (assuming that it is unique in the file) and by 16th field with the line as the value
}
END { PROCINFO["sorted_in"]="#ind_num_asc"; # Set the array sorting to index number ascending
for(i in map) {
for(j in map[i]) {
print map[i][j] # Loop through the array printing the values
}
}
}' file

I suggest you to try next script:
#!/bin/bash
head -n 1 file.txt > file_out.txt
tail -n +2 file.txt | sort -k 16 >> file_out.txt
This definitely works, according to your output sample, when you convert the blanks into tabs, obviously.

awk to the rescue!
$ awk 'NR==1; NR>1{print | "sort -k16n"}' file | column -t
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571

Reshape table and complete voids with NA (or -999) using bash

I'm trying to create a table based on the ASCII bellow. What I need is to arrange the numbers from the 2nd column in a matrix. The first and third columns of the ASCII give columns and rows in the new matrix. The new matrix needs to be fully populated, so it is necessary to complete missing positions on the new table with NA (or -999).
This is what I have
$ cat infile.txt
1 68 2
1 182 3
1 797 4
2 4 1
2 70 2
2 339 3
2 1396 4
3 12 1
3 355 3
3 1854 4
4 7 1
4 85 2
4 333 3
5 9 1
5 68 2
5 182 3
5 922 4
6 10 1
6 70 2
and what I would like to have:
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
I can only use standard UNIX commands (e.g. awk, sed, grep, etc).
So What I have so far...
I can mimic a 2d array in bash
irows=(`awk '{print $1 }' infile.txt`) # rows positions
jcols=(`awk '{print $3 }' infile.txt`) # columns positions
values=(`awk '{print $2 }' infile.txt`) # values
declare -A matrix # the new matrix
nrows=(`sort -k3 -n in.txt | tail -1 | awk '{print $3}'`) # numbers of rows
ncols=(`sort -k1 -n in.txt | tail -1 | awk '{print $1}'`) # numbers of columns
nelem=(`echo "${#values[#]}"`) # number of elements I want to pass to the new matrix
# Creating a matrix (i,j) with -999
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
matrix[$i,$j]=-999
done
done
and even print on the screen
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
printf " %i" ${matrix[$i,$j]}
done
echo
done
But when I tried to assign the elements, something gets wrong
for ((i=0;i<=$((nelem-1));i++)) do
matrix[${irows[$i]},${jcols[$i]}]=${values[$i]}
done
Thanks in advance for any help with this, really.

A solution in plain bash by simulating a 2D array with an associative array could be something like that (Notice that row and column counts are not hard coded and the code works with any permutation of input lines provided that each line has the format specified in the question):
$ cat printmat
#!/bin/bash
declare -A mat
nrow=0
ncol=0
while read -r col elem row; do
mat[$row,$col]=$elem
if ((row > nrow)); then nrow=$row; fi
if ((col > ncol)); then ncol=$col; fi
done
for ((row = 1; row <= nrow; ++row)); do
for ((col = 1; col <= ncol; ++col)); do
elem=${mat[$row,$col]}
if [[ -z $elem ]]; then elem=NA; fi
if ((col == ncol)); then elem+=$'\n'; else elem+=$'\t'; fi
printf "%s" "$elem"
done
done
$ ./printmat < infile.txt prints out
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA

Any time you find yourself writing a loop in shell just to manipulate text you have the wrong approcah. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for many of the reasons why.
Using any awk in any shell on every UNIX box:
$ cat tst.awk
{
vals[$3,$1] = $2
numRows = ($3 > numRows ? $3 : numRows)
numCols = $1
}
END {
OFS = "\t"
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = ((rowNr,colNr) in vals ? vals[rowNr,colNr] : "NA")
printf "%s%s", val, (colNr < numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk infile.txt
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA

here is one way to get you started. Note that this is not intended to be "the" answer but to encourage you to try to learn the toolkit.
$ join -a1 -e NA -o2.2 <(printf "%s\n" {1..4}"_"{1..6}) \
<(awk '{print $3"_"$1,$2}' file | sort -n) |
pr -6at
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
works, however, row and column counts are hard coded, which is not the proper way to do it.
Preferred solution will be filling up an awk 2D array with the data and print it in matrix form at the end.

How to replace list of numbers in column for random numbers in other column in BASH environment

I have a tab file with two columns like that
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
and the desired output should be
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 14 27
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 6 209 299 305
2 10 13 23 40 47 58 87 10 23 40 58
I would like to change the numbers in 2nd column for random numbers in 1st column resulting in an output in 2nd column with the same number of numbers. I mean e.g. if there are four numbers in 2nd column for x row, the output must have four random numbers from 1st column for this row, and so on...
I'm try to create two arrays by AWK and split and replace every number in 2nd column for numbers in 1st column but not in a randomly way. I have seen the rand() function but I don't know exactly how joint these two things in a script. Is it possible to do in BASH environment or are there other better ways to do it in BASH environment? Thanks in advance

awk to the rescue!
$ awk -F'\t' 'function shuf(a,n)
{for(i=1;i<n;i++)
{j=i+int(rand()*(n+1-i));
t=a[i]; a[i]=a[j]; a[j]=t}}
function join(a,n,x,s)
{for(i=1;i<=n;i++) {x=x s a[i]; s=" "}
return x}
BEGIN{srand()}
{an=split($1,a," ");
shuf(a,an);
bn=split($2,b," ");
delete m; delete c; j=0;
for(i=1;i<=bn;i++) m[b[i]];
# pull elements from a upto required sample size,
# not intersecting with the previous sample set
for(i=1;i<=an && j<bn;i++) if(!(a[i] in m)) c[++j]=a[i];
cn=asort(c);
print $1 FS join(c,cn)}' file
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 85 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 20 205 294 295
2 10 13 23 40 47 58 87 10 13 47 87
shuffle (standard algorithm) the input array, sample required number of elements, additional requirement is no intersection with the existing sample set. Helper structure map to keep existing sample set and used for in tests. The rest should be easy to read.

Assuming that there is a tab delimiting the two columns, and each column is a space delimited list:
awk 'BEGIN{srand()}
{n=split($1,a," ");
m=split($2,b," ");
printf "%s\t",$1;
for (i=1;i<=m;i++)
printf "%d%c", a[int(rand() * n) +1], (i == m) ? "\n" : " "
}' FS=\\t input

Try this:
# This can be an external file of course
# Note COL1 and COL2 seprated by hard TAB
cat <<EOF > d1.txt
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
EOF
# Loop to read each line, not econvert TAB to:, though could have used IFS
cat d1.txt | sed 's/ /:/' | while read LINE
do
# Get the 1st column data
COL1=$( echo ${LINE} | cut -d':' -f1 )
# Get col1 number of items
NUM_COL1=$( echo ${COL1} | wc -w )
# Get col2 number of items
NUM_COL2=$( echo ${LINE} | cut -d':' -f2 | wc -w )
# Now split col1 items into an array
read -r -a COL1_NUMS <<< "${COL1}"
COL2=" "
# THis loop runs once for each COL2 item
COUNT=0
while [ ${COUNT} -lt ${NUM_COL2} ]
do
# Generate a random number to use as teh random index for COL1
COL1_IDX=${RANDOM}
let "COL1_IDX %= ${NUM_COL1}"
NEW_NUM=${COL1_NUMS[${COL1_IDX}]}
# Check for duplicate
DUP_FOUND=$( echo "${COL2}" | grep ${NEW_NUM} )
if [ -z "${DUP_FOUND}" ]
then
# Not a duplicate, increment loop conter and do next one
let "COUNT = COUNT + 1 "
# Add the random COL1 item to COL2
COL2="${COL2} ${COL1_NUMS[${COL1_IDX}]}"
fi
done
# Sort COL2
COL2=$( echo ${COL2} | tr ' ' '\012' | sort -n | tr '\012' ' ' )
# Print
echo ${COL1} :: ${COL2}
done
Output:
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 :: 88 95
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 :: 20 299 304 305
2 10 13 40 47 58 :: 2 10 40 58

Remove rows that have a specific numeric value in a field

I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?

I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.

Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt

awk do math between lines

i'd like to put an empty line between lines which $1 differ > then 1.
thats the code sample:
104 9
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
134 6
146 8
147 8
148 8
the try was:
awk '{a=$1; b=$2; getline; c=$1; d=$2; if (c-a>1) print a"\t"b"\n"c"\t"d;else print "\n"}' file
but the result is mixed up:
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
147 8
148 8
what i'm missing?

This awk should work:
awk 'NR>1 && $1>p+1{print ""} {p=$1} 1' file
104 9
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
134 6
146 8
147 8
148 8

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

find all values sequentially from given to target value in columns, bash - bash

Using awk and sort -nu: awk -v col=4 -v start=12.2222 -v end=18.8888 '$col>=start && $col<=end{ print $col}' data.txt | sort -nu 12.2222 13.3333 14.4444 17.7777 18.8888

Related

Sort text file after the 16th column with p-values in bash

Reshape table and complete voids with NA (or -999) using bash

How to replace list of numbers in column for random numbers in other column in BASH environment

Remove rows that have a specific numeric value in a field

awk do math between lines

Categories

Resources