Bash script order files size - bash

I would like to order a list of files by their size, but comparing it with a specific number (another file size) being the rule to compare the absolute distance.
This has to be done with a bash script.
For instance:
Size to compare: 5
List of files sizes: { 1, 2, 6, 10, 5 }
Result: {5, 6, 2, 1, 10 }
I am far from being an expert in bash coding, so I would appreciate some help here.

size=5
source=(1 2 6 10 5)
for i in ${source[#]}; do j=$((i-size)); echo ${j/-/} $i; done | sort -n | cut -d " " -f 2 | tr "\n" " "
Output:
5 6 2 1 10
This solution also uses Schwartzian Transform mentioned by chepner.

Use a Schwartzian Transform:
printf "%d\n" 1 2 6 10 5 |
# Decorate
perl -ne 'printf "%d %d\n", abs($_ - 5), $_' |
# Sort
sort -k1,1n |
# Undecorate
awk '{print $2}'
I'm only using Perl because it's the shortest way I could think to access an absolute value function.

Perl can be called from a bash script, since it's installed everywhere.
perl -e '$n=shift; #A=split/,/,(shift); print join ", ", sort {abs($a-$n)<=>abs($b-$n)} #A' 5 1,2,6,10,5
output:
5, 6, 2, 1, 10
$n is set to your number 5 using shift
Array #A is set by splitting the input string using commas as a delimiter
The array is printed using a custom sort function sort {abs($a-$n)<=>abs($b-$n)}
Variation assuming your input file sizes are on separate lines:
printf "%d\n" 1 2 6 10 5 | perl -ne 'BEGIN{$n=shift} push #A, $_; END{print join "", sort {abs($a-$n)<=>abs($b-$n)} #A}' 5
output:
5
6
2
1
10

Related

How to print 1-10,11-20 and so on number of rows of a file in loop using shell? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a file consisting of 4000 rows, I need to iterate the records of that file over shell script and extract first 10 rows and send that rows to my java code which i already wrote, and then next 10 rows and so on
To pass 10 lines at a time as arguments to your script:
< file xargs -d$'\n' -n 10 myscript
To pipe 10 lines at a time as input to your script:
< file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | myscript' {}
Assuming your input is in a file named file which I'm creating with 30 instead of 4000 lines of input:
$ seq 30 > file
and modifying to have some lines that contain spaces, some that contain shell variables, and some that contain regexp and globbing chars to show no type of shell expansion is being done:
$ head -10 file
1
here is a multi-field line
3
4
$HOME
6
.*
8
9
10
Here's 10 args at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 awk 'BEGIN{for (i=1; i<ARGC; i++) print i, "<" ARGV[i] ">"; exit} END{print "---"}'
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
and here's 10 lines of input at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | awk '\''{print NR, "<" $0 ">"} END{print "---"}'\''' {}
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
Considering that OP wants to pass lines as an argument to OP's code if that is the case then could you please try following once(haven't tested it by running it since I don't have OP's java code etc).
awk '
FNR%10==0{
system("your_java_code " value OFS $0)
value=""
}
{
value=(value?value OFS:"")$0
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
OR
awk '
{
value=(value?value OFS:"")$0
}
FNR%10==0{
system("your_java_code " value)
value=""
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
PS: Just for safer side, I kept END section of awk code so that in case there are left over lines(let's say total number of lines are NOT completely divided by 10) then it will call java program with remaining lines to it.
This might work for you (GNU parallel):
parallel -kN10 javaProgram :::: file
This will pass the lines 1-10, 11-20, ... as arguments to program javaProgram
If you want to pass 10 lines at time, use:
parallel -kN10 --cat javaProgram :::: file
Sounds to me like you want to slice out rows from a file, then pipe those rows to java. This interpretation differs from the other answers, so let me know if I'm not understanding you:
$ file=/etc/services
$ count=$(wc -l < "${file}")
$ start=1
$ stride=10
$ for ((i=start; i<=count; i+=stride)); do
awk -v i="${i}" -v stride="${stride}" \
'NR > (i+stride) { exit } NR >= i && NR < (i + stride)' "${file}" \
| java ...
done
file holds the path to the data rows. count is the total count of rows in that file. start is the first row, stride is how many you want to slice out in each iteration.
The for loop then performs the stride addition, while awk slices out the rows so numbered. We pipe them to the java program on standard in.
Assuming that you are passing the 10 lines groups from your file to your script as command line arguments, this is an answer:
rows=4000 # the number of rows in file
groupsize=10 # the size of lines groups
OIFS="$IFS"; IFS=$'\n' # use newline as input field separator to avoid `for` splitting on spaces
groups=$(($rows / $groupsize)) # the number of groups of lines
for i in $(seq 1 $groups); do # loop through each group of lines
from=$((($i * $groupsize) - $groupsize + 1))
to=$(($i * $groupsize))
# build the arguments for each script invocation by concatenating each group of lines
for line in `sed -n -e ${from},${to}p file`; do # 'file' is your input file name
arguments=$arguments \"$line\"
done
echo script $arguments # remove echo and change 'script' with your script name
done
IFS="$OIFS" # restore original input field separator
Like this :
for ((i=0; i<=4000; i+=10)); do
arr=( ) # create a new empty array
for ((j=$i; j<=i+10; j++)); do
arr+=( $j ) # add id to array
done
printf '%s\n' "${arr[#]}" # or execute command with all the id
done

Renumbering numbers in a text file based on an unique mapping

I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564
Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6
With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort
Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9
Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input
awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.

Extract substring passing a variable to cut

Is it possible to do something like this in bash with cut:
strLf="JJT9879YGTT"
strZ=(2, 3, 5, 6, 9, 11)
numZ=${#strZ[#]}
for ((ctZ=0; ctZ<${numZ}; ctZ++))
do
lenThis=${strZ[${ctZ}]}
fetch=$(echo "${strLf}" | cut -c 1-${lenThis})
done
Through successive loops, I want ${fetch} to contain "JJ" "JJT" "JJT98" "JJT987" "JJT9879YG" "JJT9879YGTT", etc, according to the indexes given by strZ.
Or is there some other way I need to be doing this?
You can use ${string:position:length} to get the length characters of $string starting in position.
$ s="JJT9879YGTT"
$ echo ${s:0:2}
JJ
$ echo ${s:0:3}
JJT
And also using variables:
$ t=5
$ echo ${s:0:$t}
JJT98
So if you put all these values in an array, you can loop through them and use its value as a length argument saying ${string:0:length}:
strLf="JJT9879YGTT"
strZ=(2 3 5 6 9 11)
for i in ${strZ[#]}; do
echo "${strLf:0:$i}"
done
For your given string it returns this to me:
$ for i in ${strZ[#]}; do echo "${strLf:0:$i}"; done
JJ
JJT
JJT98
JJT987
JJT9879YG
JJT9879YGTT
words="JJT9879YGTT"
strZ=(2 3 5 6 9 11)
for i in "${strZ[#]}"
do
echo ${words:0:$i}
done
Output:
JJ
JJT
JJT98
JJT987
JJT9879YG
JJT9879YGTT
DEMO
Realised my mistake - I was using commas in the array to identify elements.
So instead of what I was using:
strZ=(2, 3, 5, 6, 9, 11)
I have to use:
strZ=(2 3 5 6 9 11)
It works for cut as well.

How to sort a content of a text file using a shell script

I am new to shell scripting. I am interested how to know how to sort a content of a file using shell scripting.
Here is an example:
fap0089-josh.baker
fap00233-adrian.edwards
fap00293-bob.boyle
fap00293-bob.jones
fap002-brian.lopez
fap00293-colby.morris
fap00293-cole.mitchell
psf0354-SKOWALSKI
psf0354-SLEE
psf0382-SLOWE
psf0391-SNOMURA
psf0354-SPATEL
psf0364-SRICHARDS
psf0354-SSEIBERT
psf0354-SSIRAH
bsi0004-STRAN
bsi0894-STURBIC
unit054-SUNDERWOOD
Considering the data above (this is a small set, I have more than 5.5 records), I would like to sort it like this:
Number of entries starting with fap,psf,bsi,unit etc...
The total number of environments for each type, i.e: each numeric after the word, 0004,0382,054 etc are environments. e.g: psf has 4 unique environments.
The sum total
Here's a Schwarzian transform to sort by 1) leading letters, then 2) digits
sed -r 's/^([[:alpha:]]+)([[:digit:]]+)/\1 \2 /' filename |
sort -t ' ' -k 1,1 -k 2,2n |
sed 's/ //; s/ //'
output:
bsi0004-STRAN
bsi0894-STURBIC
fap002-brian.lopez
fap0089-josh.baker
fap00233-adrian.edwards
fap00293-bob.boyle
fap00293-bob.jones
fap00293-colby.morris
fap00293-cole.mitchell
psf0354-SKOWALSKI
psf0354-SLEE
psf0354-SPATEL
psf0354-SSEIBERT
psf0354-SSIRAH
psf0364-SRICHARDS
psf0382-SLOWE
psf0391-SNOMURA
unit054-SUNDERWOOD
To generate the metrics you mention, I'd use perl:
perl -nE '
/^([[:alpha:]]+)(\d+)/ or next;
$count{$1}++;
$nenv{$1}{$2}=1;
$total+=$2
}
END {
say "Counts:";
say "$_ => $count{$_}" for sort keys %count;
say "Number of environments";
say "$_ => ", scalar keys %{$nenv{$_}} for sort keys %nenv;
say "Total = $total";
' filename
Counts:
bsi => 2
fap => 7
psf => 8
unit => 1
Number of environments
bsi => 2
fap => 4
psf => 4
unit => 1
Total = 5355
Without using perl, it's less efficient because you have to read the file multiple times.
echo Counts:
sed 's/[0-9].*//' filename | sort | uniq -c
echo Number of environments:
sed -r 's/^([a-z]+)([0-9]*).*/\1 \2/' filename | sort -u | cut -d" " -f1 | uniq -c
echo Total:
{ printf "%d+" $(sed -r 's/^[a-z0]+([0-9]*).*/\1/' filename); echo 0; } | bc
Counts:
2 bsi
7 fap
8 psf
1 unit
Number of environments:
2 bsi
4 fap
4 psf
1 unit
Total:
5355

Finding gaps in sequential numbers

I don’t do this stuff for a living so forgive me if it’s a simple question (or more complicated than I think). I‘ve been digging through the archives and found a lot of tips that are close but being a novice I’m not sure how to tweak for my needs or they are way beyond my understanding.
I have some large data files that I can parse out to generate a list of coordinate that are mostly sequential
5
6
7
8
15
16
17
25
26
27
What I want is a list of the gaps
1-4
9-14
18-24
I don’t know perl, SQL or anything fancy but thought I might be able to do something that would subtract one number from the next. I could then at least grep the output where the difference was not 1 or -1 and work with that to get the gaps.
With awk :
awk '$1!=p+1{print p+1"-"$1-1}{p=$1}' file.txt
explanations
$1 is the first column from current input line
p is the previous value of the last line
so ($1!=p+1) is a condition : if $1 is different than previous value +1, then :
this part is executed : {print p+1 "-" $1-1} : print previous value +1, the - character and fist columns + 1
{p=$1} is executed for each lines : p is assigned to the current 1st column
interesting question.
sputnick's awk one-liner is nice. I cannot write a simpler one than his. I just add another way using diff:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
the output with your example would be:
1,4
9,14
18,24
I knew that there is comma in it, instead of -. you could replace the grep with sed to get -, grep cannot change the input text... but the idea is same.
hope it helps.
A Ruby Answer
Perhaps someone else can give you the Bash or Awk solution you asked for. However, I think any shell-based answer is likely to be extremely localized for your data set, and not very extendable. Solving the problem in Ruby is fairly simple, and provides you with flexible formatting and more options for manipulating the data set in other ways down the road. YMMV.
#!/usr/bin/env ruby
# You could read from a file if you prefer,
# but this is your provided corpus.
nums = [5, 6, 7, 8, 15, 16, 17, 25, 26, 27]
# Find gaps between zero and first digit.
nums.unshift 0
# Create array of arrays containing missing digits.
missing_nums = nums.each_cons(2).map do |array|
(array.first.succ...array.last).to_a unless
array.first.succ == array.last
end.compact
# => [[1, 2, 3, 4], [9, 10, 11, 12, 13, 14], [18, 19, 20, 21, 22, 23, 24]]
# Format the results any way you want.
puts missing_nums.map { |ary| "#{ary.first}-#{ary.last}" }
Given your current corpus, this yields the following on standard output:
1-4
9-14
18-24
Just remember the previous number and verify that the current one is the previous plus one:
#! /bin/bash
previous=0
while read n ; do
if (( n != previous + 1 )) ; then
echo $(( previous + 1 ))-$(( n - 1 ))
fi
previous=$n
done
You might need to add some checking to prevent lines like 28-28 for single number gaps.
Perl solution similar to awk solution from StardustOne:
perl -ane 'if ($F[0] != $p+1) {printf "%d-%d\n",$p+1,$F[0]-1}; $p=$F[0]' file.txt
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace. Fields are indexed starting with 0.
-e execute the perl code
Given input file, use the numinterval util and paste its output beside file, then munge it with tr, xargs, sed and printf:
gaps() { paste <(echo; numinterval "$1" | tr 1 '-' | tr -d '[02-9]') "$1" |
tr -d '[:blank:]' | xargs echo |
sed 's/ -/-/g;s/-[^ ]*-/-/g' | xargs printf "%s\n" ; }
Output of gaps file:
5-8
15-17
25-27
How it works. The output of paste <(echo; numinterval file) file looks like:
5
1 6
1 7
1 8
7 15
1 16
1 17
8 25
1 26
1 27
From there we mainly replace things in column #1, and tweak the spacing. The 1s are replaced with -s, and the higher numbers are blanked. Remove some blanks with tr. Replace runs of hyphens like "5-6-7-8" with a single hyphen "5-8", and that's the output.
This one list the ones who breaks the sequence from a list.
Idea taken from #choroba but done with a for.
#! /bin/bash
previous=0
n=$( cat listaNums.txt )
for number in $n
do
numListed=$(($number - 1))
if [ $numListed != $previous ] && [ $number != 2147483647 ]; then
echo $numListed
fi
previous=$number
done

Resources