Bash looping through file ends prematurely

Bash looping through file ends prematurely - bash

I have troubles in Bash looping within a text file of ~20k lines.
Here is my (minimised) code:
LINE_NB=0
while IFS= read -r LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"
done <"${FILE}"
The while loop ends prematurely after a few hundreds iterations. However, the loop works correctly if I remove the CMD=$(sed...) part. So, evidently, there is some interference I cannot spot.
As I ready here, I also tried:
LINE_NB=0
while IFS= read -r -u4 LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"
done 4<"${FILE}"
but nothing changes. Any explanation for this behaviour and help on how can I solve it?
Thanks!
To clarify the situation for user1934428 (thanks for your interest!), I now have created a minimal script and added "set -x". The full script is as follows:
#!/usr/bin/env bash
set -x
FILE="$1"
LINE_NB=0
while IFS= read -u "$file_fd" -r LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< "${LINE}")
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'" #, TIME='${TIME}' "
done {file_fd}<"${FILE}"
echo "Done."
The input file is a list of ~20k lines of the form:
S1 0.018206
L1 0.018966
F1 0.006833
S2 0.004212
L2 0.008005
I8R190 18.3791
I4R349 18.5935
...
The while loops ends prematurely at (seemingly) random points. One possible output is:
+ FILE=20k/ir-collapsed.txt
+ LINE_NB=0
+ IFS=
+ read -u 10 -r LINE
+ LINE_NB=1
++ sed 's/\([^ ]*\) .*/\1/'
+ CMD=S1
+ echo '[1] S1 0.018206: CMD='\''S1'\'''
[1] S1 0.018206: CMD='S1'
+ echo '[6510] S1514 0.185504: CMD='\''S1514'\'''
...[snip]...
[6510] S1514 0.185504: CMD='S1514'
+ IFS=
+ read -u 10 -r LINE
+ echo Done.
Done.
As you can see, the loop ends prematurely after line 6510, while the input file is ~20k lines long.

Yes, making a stable file copy is a best start.
Learning awk and/or perl is still well worth your time. It's not as hard as it looks. :)
Aside from that, a couple of optimizations - try to never run any program inside a loop when you can avoid it. For a 20k line file, that's 20k seds, which really adds up unnecessarily. Instead you could just use parameter parsing for this one.
# don't use all caps.
# cmd=$(sed "s/\([^ ]*\) .*/\1/" <<< "${line}") becomes
cmd="${cmd%% *}" # strip everything from the first space
Using the read to handle that is even better, since you were already using it anyway, but don't spawn another if you can avoid it. As much as I love it, read is pretty inefficient; it has to do a lot of fiddling to handle all its options.
while IFS= read -u "$file_fd" cmd timeval; do
echo "[$((++line_nb))] CMD='${CMD}' TIME='${timeval}'"
done {file_fd}<"${file}"
or
while IFS= read -u "$file_fd" -r -a tok; do
echo "[$((++line_nb))] LINE='${tok[#]}' CMD='${tok[0]}' TIME='${tok[1]}'"
done {file_fd}<"${file}"
(This will sort of rebuild the line, but if there were tabs or extra spaces, etc, it will only pad with the 1st char of $IFS, which is a space by default. Shouldn't matter here.)
awk would have made short work of this, though, and been a lot faster, with better tools already built in.
awk '{printf "NR=[%d] LINE=[%s] CMD=[%s] TIME=[%s]\n",NR,$0,$1,$2 }' 20k/ir-collapsed.txt
Run some time comparisons - with and without the sed, with one read vs two, and then compare each against the awk. :)
The more things you have to do with each line, and the more lines there are in the file, the more it will matter. Make it a habit to do even small things as neatly as you can - it will pay off well in the long run.

Related

Bash equivalent of sed command

I'm writing a script in bash and i would like to know if there is an alternative way to write these sed commands (without using sed):
sed '1,11d;$d' "${SOTTOCARTELLA}"/file
sed '1,11!d' "${SOTTOCARTELLA}"/file
sed '1,11d' -i "${SOTTOCARTELLA}"/file1

With sed '1,11!d' "${SOTTOCARTELLA}"/file you are asking for the first 11 lines of the file;
With sed '1,11d' -i "${SOTTOCARTELLA}"/file1 you are asking for the entire file except for the first 11 lines.
If you don't want to use head, tail or other binaries as suggested, you can achieve the same options using read and some support variables.
For example, let's try sed '1,11!d' "${SOTTOCARTELLA}"/file.
You will need a start point and an end point (and of course, the file).
start=1
end=11
counter="$((start - 1))";
file="${SOTTOCARTELLA}/file"
exec 3<"${file}" ### Create file descriptor 3
while IFS= read -r line <&3; do ### Read file line by line
if [ "${counter}" -lt "${end}" ]; then ### If I'm in my "bundaries"
printf "%s\n" "${line}" ### Print the line
fi
counter="$((counter + 1))"
done
exec 3>&- ### Close file descriptor 3
Note that this piece of code can be way better (E.G. adding a control on the counter in while condition), but this is the least you will need to understand two things:
sed, head, tails, awk, etc. are born in order to avoid to rewrite over and over again same routines and, also, for performance issues; this is why everyone, including me, will be telling you to use those.
This kind of codes are useful only for portability concerns, that's why I wrote this piece of code in a posix compliant way.

Bash pattern matching loop super slow [duplicate]

This question already has answers here:
Bash while read loop extremely slow compared to cat, why?
(4 answers)
Closed 4 years ago.
When I do this with awk it's relatively fast, even though it's Row By Agonizing Row (RBAR). I tried to make a quicker more elegant bug resistant solution in Bash that would only have to make far fewer passes through the file. It takes probably 10 seconds to get through the first 1,000 lines with bash using this code. I can make 25 passes through all million lines of file with awk in about the same time! How come bash is several orders of magnitude slower?
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
if [ "$MAIN_REF" == "$FIELD_1" ]; then
#echo "$line"
if [ "$FIELD_2" == "$REF_1" ]; then
((REF_1_COUNT++))
fi
((LINE_COUNT++))
if [ "$LINE_COUNT" == "1000" ]; then
echo $LINE_COUNT;
fi
fi
done < temp/refmatch

Bash is slow. That's just the way it is; it's designed to oversee the execution of specific tools, and it was never optimized for performance.
All the same, you can make it less slow by avoiding obvious inefficiencies. For example, read will split its input into separate words, so it would be both faster and clearer to write:
while read -r field1 field2 rest; do
# Do something with field1 and field2
instead of
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
Your version sets up two pipelines and creates four children (at least) for every line of input, whereas using read the way it was designed requires no external processes whatsoever.
If you are using cut because your lines are tab-separated and not just whitespace-separated, you can achieve the same effect with read by setting IFS locally:
while IFS=$'\t' read -r field1 field2 rest; do
# Do something with field1 and field2
Even so, don't expect it to be fast. It will just be less agonizingly slow. You would be better off fixing your awk script so that it doesn't require multiple passes. (If you can do that with bash, it can be done with awk and probably with less code.)
Note: I set three variables rather than two, because read puts the rest of the line into the last variable. If there are only two fields, no harm is done; setting a variable to an empty string is something bash can do reasonably rapidly.

As #codeforester points out, the original bash script spawns so many subprocesses.
Here's the modified version to minimize the overheads:
#!/bin/bash
while IFS=$'\t' read -r FIELD_1 FIELD_2 others; do
if [[ "$MAIN_REF" == "$FIELD_1" ]]; then
#echo "$line"
if [[ "$FIELD_2" == "$REF_1" ]]; then
let REF_1_COUNT++
fi
let LINE_COUNT++
echo "$LINE_COUNT"
if [[ "$LINE_COUNT" == "1000" ]]; then
echo "$LINE_COUNT"
fi
fi
done < temp/refmatch
It runs more than 20 times faster than the original one but I'm afraid it may be the limitation of bash script.

Bash - Read only the last 100 lines of a long Log file

I am reading a long log file and splitting the columns in variables using bash.
cd $LOGDIR
IFS=","
while read LogTIME name md5
do
LogTime+="$(echo $LogTIME)"
Name+="$(echo $name)"
LOGDatamd5+="$(echo $md5)"
done < LOG.txt
But this is really slow and I don't need all the lines. The last 100 lines are enough (but the log file itself needs all the other lines for different programs).
I tried to use tail -n 10 LOG.txt | while read LogTIME name md5, but that takes really long as well and I had no output at all.
Another way I tested without success was:
cd $LOGDIR
foo="$(tail -n 10 LOG.txt)"
IFS=","
while read LogTIME name md5
do
LogTime+="$(echo $LogTIME)"
Name+="$(echo $name)"
LOGDatamd5+="$(echo $md5)"
done < "$foo"
But that gives me only the output of foo in total. Nothing was written into the variables inside the while loop.
There is probably a really easy way to do this, that I can't see...
Cheers,
BallerNacken

Process substitution is the common pattern:
while read LogTIME name md5 ; do
LogTime+=$LogTIME
Name+=$name
LogDatamd5+=$md5
done < <(tail -n100 LOG.txt)
Note that you don't need "$(echo $var)", you can assign $var directly.

How to read a space in bash - read will not

Many people have shown how to keep spaces when reading a line in bash. But I have a character based algorithm which need to process each end every character separately - spaces included. Unfortunately I am unable to get bash read to read a single space character from input.
while read -r -n 1 c; do
printf "[%c]" "$c"
done <<< "mark spitz"
printf "[ ]\n"
yields
[m][a][r][k][][s][p][i][t][z][][ ]
I've hacked my way around this, but it would be nice to figure out how to read a single any single character.
Yep, tried setting IFS, etc.

Just set the input field separator(a) so that it doesn't treat space (or any character) as a delimiter, that works just fine:
printf 'mark spitz' | while IFS="" read -r -n 1 c; do
printf "[%c]" "$c"
done
echo
That gives you:
[m][a][r][k][ ][s][p][i][t][z]
You'll notice I've also slightly changed how you're getting the input there, <<< appears to provide a extraneous character at the end and, while it's not important to the input method itself, I though it best to change that to avoid any confusion.
(a) Yes, I'm aware that you said you've tried setting IFS but, since you didn't actually show how you'd tried this, and it appears to work fine the way I do it, I have to assume you may have just done something wrong.

reading a very long line from without breaking into multiple line in shell script

I have a file which has very long rows of data. When i try to read using shell script, the data comes into multiple lines,ie, breaks at certain points.
Example row:
B_18453583||Active|917396140129|405819121107402|Active|7396140129||7396140129|||||||||18-MAY-10|||||18-MAY-10|405819121107402|Outgoing International Calls,Outgoing Calls,WAP,Call Waiting,MMS,Data Service,National Roaming-Voice,Outgoing International Calls except home country,Conference Call,STD,Call Forwarding-Barr,CLIP,Incoming Calls,INTSNS,WAPSNS,International Roaming-Voice,ISD,Incoming Calls When Roaming Internationally,INTERNET||For You Plan||||||||||||||||||
All this is the content of a single line.
I use a normal read like this :
var=`cat pranay.psv`
for i in $var; do
echo $i
done
The output comes as:
B_18453583||Active|917396140129|405819121107402|Active|7396140129||7396140129|||||||||18- MAY-10|||||18-MAY-10|405819121107402|Outgoing
International
Calls,Outgoing
Calls,WAP,Call
Waiting,MMS,Data
Service,National
Roaming-Voice,Outgoing
International
Calls
except
home
country,Conference
Call,STD,Call
Forwarding-Barr,CLIP,Incoming
Calls,INTSNS,WAPSNS,International
Roaming-Voice,ISD,Incoming
Calls
When
Roaming
Internationally,INTERNET||For
You
Plan||||||||||||||||||
How do i print all in single line??
Please help.
Thanks

This is because of word splitting. An easier way to do this (which also disbands with the useless use of cat) is this:
while IFS= read -r -d $'\n' -u 9
do
echo "$REPLY"
done 9< pranay.psv
To explain in detail:
$'...' can be used to create human readable strings with escape sequences. See man bash.
IFS= is necessary to avoid that any characters in IFS are stripped from the start and end of $REPLY.
-r avoids interpreting backslash in text specially.
-d $'\n' splits lines by the newline character.
Use file descriptor 9 for data storage instead of standard input to avoid greedy commands like cat eating all of it.

You need proper quoting. In your case, you should use the command read:
while read line ; do
echo "$line"
done < pranay.psv

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash looping through file ends prematurely - bash

Related

Bash equivalent of sed command

Bash pattern matching loop super slow [duplicate]

Bash - Read only the last 100 lines of a long Log file

How to read a space in bash - read will not

reading a very long line from without breaking into multiple line in shell script

Categories

Resources