Renumbering numbers in a text file based on an unique mapping - bash

I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564

Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6

With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort

Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9

Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input

awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.

Related

Reading a log file into 2D/1D array in bash script

I have a file log.txt as below
1 0.694003 5.326995 7.500997 6.263974 0.633941 36.556128
2 2.221990 4.422010 4.652992 5.964420 0.660997 51.874905
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
4 1.914000 5.451991 0.668012 6.355688 0.634081 106.733134
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
6 1.997993 1.406001 7.977006 3.923551 0.517551 37.611894
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
8 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
I want to read 6th column of this file into an array myArray so that it will give below:
echo ${myArray[9]} = 0.396635
Thank you.
Here's a way to do this (bash 4+), assuming log.txt's first column starts at 1 and doesn't skip any numbers.
readarray -t myArray < <(tr -s ' ' < log.txt | cut -d' ' -f6)
echo ${myArray[8]}
tr -s ' ' collapses the whitespace, for easier manipulation
cut -d' ' -f6 selects the 6th space separated column
<(...) turns the subcommand into a temporary file
readarray reads lines from the file into the variable myArray
Note that the array is 8 indexed, so I've selected [8] instead of [9].
Assumption:
first column of file is an integer
first column of file may not be sequential
OP wants (needs?) the array index to match the value in the first column
Sample data file:
$ cat log.txt
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
23 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
One idea using awk (to parse the input file):
$ awk '{print $1,$6}' log.txt
3 0.728308
5 0.687023
7 0.567905
9 0.396635
23 0.508963
We can then feed this into a while loop to build the array:
unset myArray
while read -r ndx value
do
myArray["${ndx}"]="${value}"
done < <(awk '{print $1,$6}' log.txt)
Verify contents of array:
$ typeset -p myArray
declare -a myArray=([3]="0.728308" [5]="0.687023" [7]="0.567905" [9]="0.396635" [23]="0.508963")
$ for ndx in "${!myArray[#]}"
do
echo "index = ${ndx} ; value = ${myArray[${ndx}]}"
done
index = 3 ; value = 0.728308
index = 5 ; value = 0.687023
index = 7 ; value = 0.567905
index = 9 ; value = 0.396635
index = 23 ; value = 0.508963
Another approach using just bash4+ builtins. (if acceptable)
#!/usr/bin/env bash
mapfile -t rows < log.txt
read -ra column <<< "${rows[8]}"
echo "${column[5]}"

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

How to find the max value in array without using sort cmd in shell script

I have a array=(4,2,8,9,1,0) and I don't want to sort the array to find the highest number in the array because I need to get the index value of the highest number as it is, so I can use it for further reference.
Expected output:
9 index value => 3
Can somebody help me to achieve this?
Slight variation with a loop using the ternary conditional operator and no assumptions about range of values:
arr=(4 2 8 9 1 0)
max=${arr[0]}
maxIdx=0
for ((i = 1; i < ${#arr[#]}; ++i)); do
maxIdx=$((arr[i] > max ? i : maxIdx))
max=$((arr[i] > max ? arr[i] : max))
done
printf '%s index => values %s\n' "$maxIdx" "$max"
The only assumption is that array indices are contiguous. If they aren't, it becomes a little more complex:
arr=([1]=4 [3]=2 [5]=8 [7]=9 [9]=1 [11]=0)
indices=("${!arr[#]}")
maxIdx=${indices[0]}
max=${arr[maxIdx]}
for i in "${indices[#]:1}"; do
((arr[i] <= max)) && continue
maxIdx=$i
max=${arr[i]}
done
printf '%s index => values %s\n' "$maxIdx" "$max"
This first gets the indices into a separate array and sets the initial maximum to the value corresponding to the first index; then, it iterates over the indices, skipping the first one (the :1 notation), checks if the current element is a new maximum, and if it is, stores the index and the maximum.
Without using sort, you can use a simple loop in shell. Here is a sample bash code:
#!/usr/bin/env bash
array=(4 2 8 9 1 0)
for i in "${!array[#]}"; do
[[ -z $max ]] || (( ${array[i]} > $max )) && { max="${array[i]}"; maxind=$i; }
done
echo "max=$max, maxind=$maxind"
max=9, maxind=3
arr=(4 2 8 9 1 0)
paste <(printf "%s\n" "${arr[#]}") <(seq 0 $((${#arr[#]} - 1)) ) |
sort -k1,1 |
tail -n1 |
sed 's/\t/ index value => /'
Print each array element on a newline with printf
Print array indexes with seq
Join both streams using paste
Numerically sort the lines using the first fields (ie. array value) sort
Print the last line tail -n1
The array value and result is separated by a tab. Substitute tab with the output string you want using sed. One could use ex. cut -d, -f2 to get only the index or use read a b <( ... ) to read the numbers into variables, etc.
Using Perl
$ export data=4,2,8,9,1,0
$ echo $data | perl -ne ' map{$i++; if($_>$x) {$x=$_;$id=$i} } split(","); print "max=$x", " index=",--${id},"\n" '
max=9 index=3
$

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Sorting and printing a file in bash UNIX

I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp

Resources