Listing skipped numbers in a large txt file using bash - bash

I need to find a way to display the missing numbers from a large txt file. It's a web graph that has 875,713 vertices. However, when I sort the file the largest number that is displayed at the end is 916,427. So there are some numbers not being used for vertex index. Is there a bash command I could use to do this?
I found this after searching around some other threads but I'm not entirely sure if its correct:
awk 'NR != $1 { for (i = prev + 1; i < $1; i++) {print i} } { prev = $1 + 1 }' file

Assuming the 'number' of each vertex is in the first column, you can use:
awk '{a[$1]} END{for(i = 1; i <= 916427; i++){if(!(i in a)){print i}}}' file
E.g.
# create some example data and remove "10"
seq 916427 | sed '10d' > test.txt
head test.txt
1
2
3
4
5
6
7
8
9
11
awk '{a[$1]} END { for (i = 1; i <= 916427; i++) { if (!(i in a)) {print i}}}' test.txt
10

If you don't want to store the array in memory (otherwise #jared_mamrot solution would work), you can use
awk 'NR==1 {p=$1; next} {for (i=p+1; i<$1; i++) {print i}; p=$1}' < <( sort -n file)
which sorts the file first.

Just because you tagged your question bash, I'll provide a bash solution. :)
# sample data as jared suggested, with 10 removed...
seq 916427 | sed '10d' > test.txt
# read sample data into an array...
mapfile -t a < test.txt
# reverse the $a array into $b
for i in "${a[#]}"; do b[$i]=1; done
# step through list of possible numbers, testing if each one is an index of $b
for ((i=1; i<${a[((${#a[#]}-1))]}; i++)); do [[ -z ${b[i]} ]] && echo $i; done
The line noise in the last line (${a[((${#a[#]}-1))]}) simply means "the value of the last array element", and the -1 is there because without instructions otherwise, mapfile starts numbering things at zero.
This takes a little longer to run than awk, because awk is awesome. But it runs in bash without calling any external tools. Aside from the ones generating our sample data, of course!
Note that the last line verifies $b array membership with a string comparison. You might get a very slight performance increase by doing a math comparison instead ((( ${b[i]} )) || echo $i) but the improvement would be so small that it's not even worth mentioning. Oh dang.
Note also that both this and the awk solution involve creating very large arrays in memory, then stepping through those arrays. Be careful of your memory, and don't waste array space with unnecessary data. You will probably want to pull just your indices out of your original dataset for this comparison, rather than loading everything into a bash or awk array.

Related

How to find the max value in array without using sort cmd in shell script

I have a array=(4,2,8,9,1,0) and I don't want to sort the array to find the highest number in the array because I need to get the index value of the highest number as it is, so I can use it for further reference.
Expected output:
9 index value => 3
Can somebody help me to achieve this?
Slight variation with a loop using the ternary conditional operator and no assumptions about range of values:
arr=(4 2 8 9 1 0)
max=${arr[0]}
maxIdx=0
for ((i = 1; i < ${#arr[#]}; ++i)); do
maxIdx=$((arr[i] > max ? i : maxIdx))
max=$((arr[i] > max ? arr[i] : max))
done
printf '%s index => values %s\n' "$maxIdx" "$max"
The only assumption is that array indices are contiguous. If they aren't, it becomes a little more complex:
arr=([1]=4 [3]=2 [5]=8 [7]=9 [9]=1 [11]=0)
indices=("${!arr[#]}")
maxIdx=${indices[0]}
max=${arr[maxIdx]}
for i in "${indices[#]:1}"; do
((arr[i] <= max)) && continue
maxIdx=$i
max=${arr[i]}
done
printf '%s index => values %s\n' "$maxIdx" "$max"
This first gets the indices into a separate array and sets the initial maximum to the value corresponding to the first index; then, it iterates over the indices, skipping the first one (the :1 notation), checks if the current element is a new maximum, and if it is, stores the index and the maximum.
Without using sort, you can use a simple loop in shell. Here is a sample bash code:
#!/usr/bin/env bash
array=(4 2 8 9 1 0)
for i in "${!array[#]}"; do
[[ -z $max ]] || (( ${array[i]} > $max )) && { max="${array[i]}"; maxind=$i; }
done
echo "max=$max, maxind=$maxind"
max=9, maxind=3
arr=(4 2 8 9 1 0)
paste <(printf "%s\n" "${arr[#]}") <(seq 0 $((${#arr[#]} - 1)) ) |
sort -k1,1 |
tail -n1 |
sed 's/\t/ index value => /'
Print each array element on a newline with printf
Print array indexes with seq
Join both streams using paste
Numerically sort the lines using the first fields (ie. array value) sort
Print the last line tail -n1
The array value and result is separated by a tab. Substitute tab with the output string you want using sed. One could use ex. cut -d, -f2 to get only the index or use read a b <( ... ) to read the numbers into variables, etc.
Using Perl
$ export data=4,2,8,9,1,0
$ echo $data | perl -ne ' map{$i++; if($_>$x) {$x=$_;$id=$i} } split(","); print "max=$x", " index=",--${id},"\n" '
max=9 index=3
$

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Sorting and printing a file in bash UNIX

I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp

Array arithmetic bash shell

I have numerous(nTotal) number of files each with one column of length L of float numbers, I want to add entry in line i_th of all these file and at the end. Compute its average and standard deviation.
I first read each file. Then I try to add this array to a an array, which gives me a syntax error: (standard_in) 2: syntax error. I expect that suma[i] contains sum of all the entries on line i_th of all the files now. Then I find the average
Edit I changed for loops as suggested.
for (( n= 1 ; n < $nTotal; n++ ))
do
IFS=$'\n'
arr1=($(./a.out filename | sed 's/:.*//'))
for (( i= 1 ; i < $L; i++ ))
do
sum[i]=`echo "${sum[i]} - ${arr1[i]}" | bc`
done
done
for (( i= 1 ; i < $L; i++ ))
do
ya=$(echo -1*${sum[i]} | bc)
aveSum=$(echo $ya/$nTotal | bc -l)
done
Edit: ./a.out produces files with one column of float numbers.
To find standard deviation though, I again read data files and store them in arrays (I'm sure this is not the smartest way of doing it but I couldn't think of anything else.). I also could not find the standard deviation using:
for (( i= 1 ; i < $L; i++ ))
do
ya=$(echo -1*${sum[i]} | bc)
ta=$(echo $ya/$nTotal | bc -l)
tempval=`echo "${arr1[i]} - $ta * ${arr1[i]} - $ta" | bc`
val[i]=`echo "${val[i]} - $tempval" | bc`
done
Here I get zero for val[i] elements, I can't figure what is wrong. I would really appreciate it if you can guide me for this problem.
Bash might not be exactly the easiest for this problem, particularly since it doesn't implement non-integer arithmetic. I'd use awk:
awk '{ n[FNR]++;
delta = $1 - mean[FNR];
mean[FNR] += delta / n[FNR];
m2[FNR] += delta * ($1 - mean[FNR]);
}
END {for (i=1; i in n; ++i)
print mean[i], sqrt(m2[i]/(n[i]-1));
}' file1 file2 ...
The math is taken directly from the well-known "online" mean and variance algorithms. The program assumes that all files have exactly L lines, but if a few have more or less, the missing data will just be ignored; you might want to do a better validity test. In the particular case that only one file has too many lines, the standard deviation computation will trap a divide-by-zero; in one reading, that doesn't matter since the correct data will already have been printed, but you might want to fix that, too.
The program makes use of a couple of awk features: first, arrays are automatically (and lazily) initialized to 0 (if used as numbers); second, FNR is the line number in the current file. (NR is the line number in the input as a whole, but in this case FNR is more useful.)

Inserting if loop within awk

I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.
Here's the problem:
I had a lot of files that looked like this:
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
This is the awk command used for it:
awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
while(i<NR-1){
++n;
for(k=i;k<i+t;k++)print a[k] > "win"n".txt";
close("_win"n".txt")
i=i+t-d
}
}' myfile.txt
done
And I get several files with names win1.txt , win2.txt , win3.txt , etc...
My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.
EXTRA INFO
When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.
If you don't have to use awk for some other reason, try the sed approach
#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
sed -n "$first","$last"p <<< "$file"
((first+=1500))
((last+=1500))
done
Obviously this is going to be less fast than awk and more error prone for gigatic files, but should work in most cases.
Change the while condition to make it stop earlier:
while (i+t <= NR) {
Change the end condition of the for loop to compensate for the last output file being potentially bigger:
for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)
The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
while (i+t <= NR) {
++n;
for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt";
i=i+t-d
}
}' myfile.txt
I tested it with small values of t and d, and it seems to work as requested.
One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.

Resources