read file line by line and sum each line individually - bash

Im trying to make a script that creates a file say file01.txt that writes a number on each line.
001
002
...
998
999
then I want to read the file line by line and sum each line and say whether the number is even or odd.
sum each line like 0+0+1 = 1 which is odd
9+9+8 = 26 so even
001 odd
002 even
..
998 even
999 odd
I tried
while IFS=read -r line; do sum+=line >> file02.txt; done <file01.txt
but that sums the whole file not each line.

You can do this fairly easily in bash itself making use of built-in parameter expansions to trim leading zeros from the beginning of each line in order to sum the digits for odd / even.
When reading from a file (either a named file or stdin by default), you can use the initialization with default to use the first argument (positional parameter) as the filename (if given) and if not, just read from stdin, e.g.
#!/bin/bash
infile="${1:-/dev/stdin}" ## read from file provide as $1 or stdin
Which you will use infile with your while loop, e.g.
while read -r line; do ## loop reading each line
...
done < "$infile"
To trim the leading zeros, first obtain the substring of leading zeros trimming all digits from the right until only zeros remain, e.g.
leading="${line%%[1-9]*}" ## get leading 0's
Now using the same type parameter expansion with # instead of %% trim the leading zeros substring from the front of line saving the resulting number in value, e.g.
value="${line#$leading}" ## trim from front
Now zero your sum and loop over the digits in value to obtain the sum of digits:
for ((i=0;i<${#value};i++)); do ## loop summing digits
sum=$((sum + ${value:$i:1}))
done
All that remains is your even / odd test. Putting it altogether in a short example script that intentionally outputs the sum of digits in addition to your wanted "odd" / "even" output, you could do:
#!/bin/bash
infile="${1:-/dev/stdin}" ## read from file provide as $1 or stdin
while read -r line; do ## read each line
[ "$line" -eq "$line" 2>/dev/null ] || continue ## validate integer
leading="${line%%[1-9]*}" ## get leading 0's
value="${line#$leading}" ## trim from front
sum=0 ## zero sum
for ((i=0;i<${#value};i++)); do ## loop summing digits
sum=$((sum + ${value:$i:1}))
done
printf "%s (sum=%d) - " "$line" "$sum" ## output line w/sum
## (temporary output)
if ((sum % 2 == 0)); then ## check odd / even
echo "even"
else
echo "odd"
fi
done < "$infile"
(note: you can actually loop over the digits in line and skip removing the leading zeros substring. The removal ensure that if the whole value is used it isn't interpreted as an octal value -- up to you)
Example Use/Output
Using a quick process substitution to provide input of 001 - 020 on stdin you could do:
$ ./sumdigitsoddeven.sh < <(printf "%03d\n" {1..20})
001 (sum=1) - odd
002 (sum=2) - even
003 (sum=3) - odd
004 (sum=4) - even
005 (sum=5) - odd
006 (sum=6) - even
007 (sum=7) - odd
008 (sum=8) - even
009 (sum=9) - odd
010 (sum=1) - odd
011 (sum=2) - even
012 (sum=3) - odd
013 (sum=4) - even
014 (sum=5) - odd
015 (sum=6) - even
016 (sum=7) - odd
017 (sum=8) - even
018 (sum=9) - odd
019 (sum=10) - even
020 (sum=2) - even
You can simply remove the output of "(sum=X)" when you have confirmed it operates as you expect and redirect the output to your new file. Let me know if I understood your question properly and if you have further questions.

Would you please try the bash version:
parity=("even" "odd")
while IFS= read -r line; do
mapfile -t ary < <(fold -w1 <<< "$line")
sum=0
for i in "${ary[#]}"; do
(( sum += i ))
done
echo "$line" "${parity[sum % 2]}"
done < file01.txt > file92.txt
fold -w1 <<< "$line" breaks the string $line into lines of character
(one digit per line).
mapfile assigns array to the elements fed by the fold command.
Please note the bash script is not efficient in time and not suitable
for the large inputs.

With GNU awk:
awk -vFS='' '{sum=0; for(i=1;i<=NF;i++) sum+=$i;
print $0, sum%2 ? "odd" : "even"}' file01.txt
The FS awk variable defines the field separator. If it is set to the empty string (this is what the -vFS='' option does) then each character is a separate field.
The rest is trivial: the block between curly braces is executed for each line of the input. It compute the sum of the fields with a for loop (NF is another awk variable, its value is the number of fields of the current record). And it then prints the original line ($0) followed by the string even if the sum is even, else odd.

pure awk:
BEGIN {
for (i=1; i<=999; i++) {
printf ("%03d\n", i) > ARGV[1]
}
close(ARGV[1])
ARGC = 2
FS = ""
result[0] = "even"
result[1] = "odd"
}
{
printf("%s: %s\n", $0, result[($1+$2+$3) % 2])
}
Processing a file line by line, and doing math, is a perfect task for awk.
pure bash:
set -e
printf '%03d\n' {1..999} > "${1:?no path provided}"
result=(even odd)
mapfile -t num_list < "$1"
for i in "${num_list[#]}"; do
echo $i: ${result[(${i:0:1} + ${i:1:1} + ${i:2:1}) % 2]}
done
A similar method can be applied in bash, but it's slower.
comparison:
bash is about 10x slower.
$ cd ./tmp.Kb5ug7tQTi
$ bash -c 'time awk -f ../solution.awk numlist-awk > result-awk'
real 0m0.108s
user 0m0.102s
sys 0m0.000s
$ bash -c 'time bash ../solution.bash numlist-bash > result-bash'
real 0m0.931s
user 0m0.929s
sys 0m0.000s
$ diff --report-identical result*
Files result-awk and result-bash are identical
$ diff --report-identical numlist*
Files numlist-awk and numlist-bash are identical
$ head -n 5 *
==> numlist-awk <==
001
002
003
004
005
==> numlist-bash <==
001
002
003
004
005
==> result-awk <==
001: odd
002: even
003: odd
004: even
005: odd
==> result-bash <==
001: odd
002: even
003: odd
004: even
005: odd
read is a bottleneck in a while IFS= read -r line loop. More info in this answer.
mapfile (combined with for loop) can be slightly faster, but still slow (it also copies all the data to an array first).
Both solutions create a number list in a new file (which was in the question), and print the odd/even results to stdout. The path for the file is given as a single argument.
In awk, you can set the field separator to empty (FS="") to process individual characters.
In bash it can be done with substring expansion (${var:index:length}).
Modulo 2 (number % 2) to get odd or even.

Related

How to find if a series of non-float numbers are missing in a string

In Bash,
I want to find a range of non-float numbers in a string.
If I have a string like so:
"1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg 4.5.jpg"
And I want to find if numbers firstNum-lastNum are missing. Say, if
firstNum=1
lastNum=5
the function would return
"1 is missing, 2 is missing, 5 is missing"
Its relatively easy to find non-float numbers in a string, but what confuses my script is the "2.005.jpg" part of the string. My script doesnt understand how to recognize that 5 is part of float 2, and therefore should ignore it.
I would just say if the number has leading zeros or has "[0-9]." in front of it, ignore it. But unfortunately, I need support for numbers with any amount of leading zeros.
If you're not against using awk, you can use this script:
echo "1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg" | \
awk -v min=1 -v max=5 -v RS="[^0-9. ]+" '
($0+0)!~/\./&&/[0-9]+/{a[$0+0]}
END{for(i=min;i<=max;i++)if(!(i in a))print i " is missing"}'
This is a GNU awk script that relies on the record separator RS to split the line with only (float) numbers.
The trick is to add 0 to the found number and check that it is still in decimal form (without any dot .). If so, the number is stored in the array a.
The END statement is looping through all decimal number from min (1) to max (5) and prints a message if the number is not part of the array a.
The posix compliant alternative script is the following:
echo "1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg" | \
awk -v min=1 -v max=5 '
{
split($0,n,"[^0-9. ]+");
for(i in n){
if((n[i]+0)!~/\./&&n[i]~/[0-9]+/){
a[n[i]+0]
}
}
}
END{for(i=min;i<=max;i++)if(!(i in a))print i " is missing"}'
The main difference is the use of the function split() that replaces RS. split breaks the input string and puts number into the array n. The array elements are then checked and put in the array a in case of decimal number.
Take a look at this extglob pattern:
find_missing() {
shopt -s extglob
for(( i = $2; i <= $3; i++ )); do
[[ $1 = !(*[0-9]|*[0-9].)*(0)"$i"!(.[0-9]*|[0-9]*) ]] || printf '<%s> missing!\n' "$i"
done
}
Consider $i to be 4:
"$i": match the number
"$i"!(.[0-9]*|[0-9]*): match the number if it's not followed by either .<number>, which would make it a float number (4.1 for example), or simply followed by another number which would make it a different number (it would falsely consider 41 to be 4 for example)
*(0)"$i"!(.[0-9]*|[0-9]*): allow leading 0s
!(*[0-9]|*[0-9].)*(0)"$i"!(.[0-9]*|[0-9]*): match the number if it's not prefixed by <number>., which would make it a float number (1.4 for example), or prefixed by another number which would make it a different number (it would falsely consider 24 to be 4 for example)
shopt -s extglob: enable extended globbing
Test run:
$ find_missing "1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg" 1 5
<1> missing!
<2> missing!
<5> missing!
$ find_missing "1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg" 1 2
<1> missing!
<2> missing!
$ find_missing "001 3.002 A.4A" 1 4
<2> missing!
<3> missing!
Possible Answer:
Here's a bash function that gives the expected output value on the provided test case in a (hopefully) reasonable way:
function check_missing {
prefix=""
for i in {1..5}; do
# make sure that $i is present,
# with optional leading zeroes,
# but with at least one non-number
# *before* the zeroes and *after* $i
if ! [[ "$1" =~ .*[^0-9\.]0*"$i"\.?[^0-9\.].* ]]; then
echo -n "${prefix}${i} is missing"
prefix=", "
fi
done
echo
}
I'm not sure how well this will generalize to the other inputs you have (or how important the output formatting is), but hopefully it at least gives an idea for how to solve the problem.
Sample output:
> check_missing "001.004.jpg 2.005.jpg 003.jpg Blah4.jpg"
1 is missing, 2 is missing, 5 is missing
> check_missing "1.4.jpg 2.005.jpg 003: Blah.jpg Blah4.jpg"
1 is missing, 2 is missing, 5 is missing

Random word Bash script if a number is supplied as the first command line argument then it will select from only words with that many characters

I am trying to create a Bash script that
- prints a random word
- if a number is supplied as the first command line argument then it will select from only words with that many characters.
This is my go at the first section (print a random word):
C=$(sed -n "$RANDOM p" /usr/share/dict/words)
echo $C
I am really stuck with the second section. Can anyone help?
might help someone coming from ryans tutorial
#!/bin/bash
charlen=$1
grep -E "^.{$charlen}$" $PWD/words.txt | shuf -n 1
you have to use a while loop to read every single line of that file and check if the length of a word equals the specified number ( including apostrophes ). In my o.s it is 99171 line ( i.e the file).
#!/usr/bin/env bash
readWords() {
declare -i int="$1"
(( int == 0 )) && {
printf "%s\n" "$int is 0, cant find 0 words"
return 1
}
while read getWords;do
if [[ ${#getWords} -eq $int ]];then
printf "%s\n" "$getWords"
fi
done < /usr/share/dict/words
}
readWords 20
this function takes a single argument. the declare command coerces the argument into an integer, if the argument is a string , it coerces it into a number which is 0 . Since we don't have 0 words if the specified argument ( number ) is 0 ( or a string coerced to 0 ) return from the function.
Read every single line in /usr/share/dict/words, get the length of each line with ${#getWords} ( $# >> gives the length of a string/commandline parameters/array size ) check if it equals the specified argument ( number )
A loop is not required, you can do something like
CH=$1; # how many characters the word must have
WordFile=/usr/share/dict/words; # file to read from
# find how many words that matches that length
TOTW=$(grep -Ec "^.{$CH}$" $WordFile);
# pick a random one, if you expect more than 32767 hits you
# need to do something like ($RANDOM+1)*($RANDOM+1)
RWORD=$(($RANDOM%$TOTW+1));
#show that word
grep -E "^.{$CH}$" $WordFile|sed -n "$RWORD p"
Depending on things you probably need to add checks for things like that $1 is a reasonable number, the file exist, that TOTW is >0 and so on.
This code would achieve what you want:
awk -v n="$1" 'length($0) == n' /usr/share/dict/words > /tmp/wordsHolder
shuf -n 1 /tmp/wordsHolder
Some comments: by using "$RANDOM" (as you did on your original script attempt), one would generate an integer on the range 0 - 32767, which could be more (or less) than the number of words (lines) available, given the desired number of characters on a word -- thus, potential for errors here.
To avoid that, we are using a shuf syntax that will retrieve a (sub)randomly picked word (line) on the file using its entire range (from line 1 - last line of file).

Splitting files by line across two files equally without pre-defined chunk length - Unix

I have two files of equal length (i.e. no. of lines):
text.en
text.cs
I want to incrementally split the files into 12 parts and as I iterate, I need to add 1 out of the first ten part to it.
Let's say if I the files contain 100 lines, I need some sort of loop that does:
#!/bin/bash
F1=text.en
F2=text.cs
for i in `seq 0 9`;
do
split -n l/12 -d text.en
cat x10 > dev.en
cat x11 > test.en
echo "" > train.en
for j in `seq 0 $i`; do
cat x0$j >> train.en
done
split -n l/12 -d text.cs
cat x10 > dev.cs
cat x11 > test.cs
echo "" > train.cs
for j in `seq 0 $i`; do
cat x0$j >> train.cs
done
wc -l train.en train.cs
echo "############"
done
[out]:
55632 train.en
55468 train.cs
111100 total
############
110703 train.en
110632 train.cs
221335 total
############
165795 train.en
165011 train.cs
330806 total
############
It's giving me unequal chunks between the files.
Also, when I use split, it's splitting into unequal chunks:
alvas#ubi:~/workspace/cvmt$ split -n l/12 -d text.en
alvas#ubi:~/workspace/cvmt$ wc -l x*
55631 x00
55071 x01
55092 x02
54350 x03
54570 x04
54114 x05
55061 x06
53432 x07
52685 x08
52443 x09
52074 x10
52082 x11
646605 total
I don't know the no. of lines of the file before hand, so I can't use the split -l option.
How do I split a file into equal size by no. of lines given that I don't know how many lines are there in the files beforehand? Should I do some sort of pre-calculation with wc -l?
How do I ensure that the split across two files are of equal size in for every chunk?
(Note that the solution needs to split the file at the end of the lines, i.e. don't split up any lines, just split the file by line).
It's not entirely clear what you're trying to achieve, but here are a few pointers:
split -n l/12 splits into 12 chunks of roughly equal byte size, not number of lines.
split -n r/12 will try to distribute the line count evenly, but if the chunk size is not a divisor of the total line count, you'll still get (slightly) varying line counts: the extra lines are distributed round-robin style.
E.g., with 100 input lines and a line chunk size of 12, you'll get line counts of 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8: 100 / 12 = 8 (integer division), and 100 % 12 = 4, so all files get at least 8 lines, with the extra 4 lines distributed among the first 4 output files.
So, yes, if you want a fixed line count for all files (except for the last, if the chunk size is not a divisor), you must calculate the total line count up front, perform integer division to get the fixed line count, and use split -l with that count:
totalLines=$(wc -l < text.en)
linesPerFile=$(( totalLines / 12 ))
split -l 12 text.en # with 100 lines, yields 8 files with 12 and 1 with 4 lines
Additional observations:
With a small, fixed iteration count, it is easier and more efficient to use brace expansion (e.g., for i in {0..9} rather than for i in `seq 0 9`).
If a variable must be used, or with larger numbers, use an arithmetic expression:
n=9; for (( i = 0; i <= $n; i++ )); do ...; done
While you cannot do cat x0{0..$i} directly (because Bash doesn't support variables in brace expansions), you can emulate it by combining seq -f and xargs:
You can replace
echo "" > train.en
for j in `seq 0 $i`; do
cat x0$j >> train.en
done
with the following:
seq -f 'x%02.f' "$i" | xargs cat > train.en
Since you control the value of $i, you could even simplify to:
eval "cat x0{0..$i}" > train.en # !! Only do this if you trust $i to contain a number.

Iterate through URLs

If I wanted to use ffmpeg to download a bunch of .ts files from a website, and the url format is
http://example.com/video-1080Pxxxxx.ts
Where the xxxxx is a number from 00000 to 99999 (required zero padding), how would I iterate through that in bash so that it tries every integer starting at 00000, 00001, 00002, etc.?
Loop over the integer values from 0 to 99999, and use printf to pad to 5 digits.
for x in {0..99999}; do
zx=$(printf '%05d' $x) # zero-pad to 5 digits
url="http://example.com/video-1080P${zx}.ts"
... # Do something with url
done
In pure bash:
$ n=99999 ; for ((i=0; i<=n; i++)) { s=$(printf "%05d" $i); echo $s ; }
or with a utility:
$ seq -w 0 99999
$ seq --help
Usage: seq [OPTION]... LAST
or: seq [OPTION]... FIRST LAST
or: seq [OPTION]... FIRST INCREMENT LAST
Print numbers from FIRST to LAST, in steps of INCREMENT.
Mandatory arguments to long options are mandatory for short options too.
-f, --format=FORMAT use printf style floating-point FORMAT
-s, --separator=STRING use STRING to separate numbers (default: \n)
-w, --equal-width equalize width by padding with leading zeroes
Why not get do something with a for loop:
for i in 0000{0..9} 000{10..99} 00{100..999} 0{1000..9999} {10000..99999}
do
# Curl was used since some minimal installs of linux do not have wget
curl -O http://example.com/video-1080P"$i".ts
sleep 1
done
(I am sure that there is a much better way to do this but it is not presenting itself to me at the moment)
My Bash (4.3) can do this:
$ echo {001..010}
001 002 003 004 005 006 007 008 009 010
So you could just do
for i in {00000..99999}; do
url="http://example.com/video-1080P${i}.ts"
# Use url
done

Shell: How to append characters at the end of a string?

I need to write a shell script to append characters to each line in a text to make all lines be the same length. For example, if the input is:
Line 1 has 25 characters.
Line two has 27 characters.
Line 3: all lines must have the same number of characters.
Here "Line 3" has 58 characters (not including the newline character) so I have to append 33 characters to "Line 1" and 31 characters to "Line 2". The output should look like:
Line 1 has 25 characters.000000000000000000000000000000000
Line two has 27 characters.0000000000000000000000000000000
Line 3: all lines must have the same number of characters.
We can assume the max length (58 in the above example) is known.
Here is one way of doing it:
while read -r; do # Read from the file one line at a time
printf "%s" "$REPLY" # Print the line without the newline
for (( i=1; i<=((58 - ${#REPLY})); i++ )); do # Find the difference in length to iterate
printf "%s" "0" # Pad 0s
done
printf "\n" # Add the newline
done < file
Output:
Line 1 has 25 characters.000000000000000000000000000000000
Line two has 27 characters.0000000000000000000000000000000
Line 3: all lines must have the same number of characters.
Of course this is easy if you know the max length of the line. If you don't then you need to read the file in an array keep track of the length of each line and keeping the length of the line which is longest in a variable. Once you have completely read the file, you iterate your array and do the same for loop shown above.
awk '{print length($0)}' <file_name> | sort -nr | head -1
you would not need a loop to find the highest length
Here's a cryptic one:
perl -lpe '$_.="0"x(58-length)' file

Resources