How to reduce run time of shell script? [duplicate] - shell

This question already has answers here:
Take nth column in a text file
(6 answers)
Closed 2 years ago.
I have written a simple code that takes data from a text file( which has space-separated columns and 1.5 million rows) gives the output file with the specified column. But this code takes more than an hr to execute. Can anyone help me out to optimize runtime
a=0
cat 1c_input.txt/$1 | while read p
do
IFS=" "
for i in $p
do
a=`expr $a + 1`
if [ $a -eq $2 ]
then
echo "$i"
fi
done
a=0
done >> ./1.c.$2.column.freq
some lines of sample input:
1 ib Jim 34
1 cr JoHn 24
1 ut MaRY 46
2 ti Jim 41
2 ye john 6
2 wf JoHn 22
3 ye jOE 42
3 hx jiM 21
some lines of sample output if the second argument entered is 3:
Jim
JoHn
MaRY
Jim
john
JoHn
jOE
jiM

I guess you are trying to print just 1 column, then do something like
#! /bin/bash
awk -v c="$2" '{print $c}' 1c_input.txt/$1 >> ./1.c.$2.column.freq

If you just want something faster, use a utility like cut. So to
extract the third field from a single space delimited file bigfile
do:
cut -d ' ' -f 3 bigfile
To optimize the shell code in the question, using only builtin shell
commands, do something like:
while read a b c d; echo "$c"; done < bigfile
...if the field to be printed is a command line parameter, there are
several shell command methods, but they're all based on that line.

Related

Return the read cursor to the start of the file

I'm trying to read into two file (name,number) at the same time and get value of each possible pair.
The two file are like this:
*name1
John
*name2
Paul
*number1
25
*number2
45
What i'm trying to obtain are label and result like:
*name1 *number1 John 25
*name2 *number2 John 45
*name2 *number1 Paul 25
*name2 *number2 Paul 45
Since i come from python i've tried to do it with two loop like this:
name=/home/davide/name.txt
number=/home/davide/number.txt
while read name; do
if [[ ${name:0:1} == "*" ]]; then
n=$(echo $name)
else
while read number; do
if [[ ${number:0:1} == "*" ]]; then
echo $number $n
else
echo $name $number
fi
done < $number
fi
done < $name
I have the first two pair so my guess it's that i need a command to start from the beginning of number again (like seek(0) on python) but i haven't found a similar one for bash.
I also get an "ambiguous redirect" error and i don't understand why.
After setting up your input files:
printf >name.txt '%s\n' '*name1' John '*name2' Paul
printf >number.txt '%s\n' '*number1' 25 '*number2' 45
...the following code:
#!/usr/bin/env bash
name_file=name.txt
number_file=number.txt
while IFS= read -r name1 && IFS= read -r value1; do
while IFS= read -r name2 && IFS= read -r value2; do
printf '%s\n' "$name1 $name2 $value1 $value2"
done <"$number_file"
done <"$name_file"
...properly outputs:
*name1 *number1 John 25
*name1 *number2 John 45
*name2 *number1 Paul 25
*name2 *number2 Paul 45
What changed?
We stopped using name and number both for the filenames and for the values read from them. Because of this, when you ran <$number, it no longer had the filename number.txt in it after the first iteration; likewise for $name.
We started quoting all expansions ("$foo", not $foo). See the http://shellcheck.net/ warning SC2086, and BashPitfalls #14, explaining why even echo $foo is buggy.
Running read with the -r argument and IFS set to an empty value prevents it from consuming literal backslashes or pruning leading and trailing newlines.
Using two reads inside the condition of each while loop lets us read two lines at a time from each file (as is appropriate, given the intent to process content in pairs).
Bash operates more easly on "streams", not like, on the data itself.
first substitute every second newline starting from the first for a tabulation or a space or other separator
then "paste" the files together
Then rearange columns, from *name1 John *number1 25 to *name1 *number1 John 25
cat >name.txt <<EOF
*name1
John
*name2
Paul
EOF
cat <<EOF >number.txt
*number1
25
*number2
45
EOF
paste <(<name.txt sed 'N;s/\n/\t/') <(<number.txt sed 'N;s/\n/\t/') |
awk '{print $1,$3,$2,$4}'
will output:
*name1 *number1 John 25
*name2 *number2 Paul 45
First, in your example you overwrite the variable $number. So you have issues on reading file $number beginning from the second loop-run.
Solution with paste
Command paste can combine multiple files, and with option -d line-by-line.
#!/usr/bin/env bash
name=/home/davide/name.txt
number=/home/davide/number.txt
# combine both files linb-by-line
paste $'-d\n' "$name" "$number" |
while read nam
do
#after reading name to var 'nam', read number to var 'num':
read num
# print both
echo "$nam $num"
done
if you want TABS or any other separator and no other processing, you don't need the while loop. Examples
paste "$name" "$number"
paste -d: "$name" "$number"
paste -d\| "$name" "$number"
$ cat tst.awk
NR==FNR {
if ( NR%2 ) {
tags[++numPairs] = $0
}
else {
vals[numPairs] = $0
}
next
}
!(NR%2) {
for (pairNr=1; pairNr<=numPairs; pairNr++) {
print prev, tags[pairNr], $0, vals[pairNr]
}
}
{ prev = $0 }
$ awk -f tst.awk number.txt name.txt
*name1 *number1 John 25
*name1 *number2 John 45
*name2 *number1 Paul 25
*name2 *number2 Paul 45
In your script, you use the variable name for both the file path and the while-loop variable. This causes the "ambiguous redirect" error. Two lines need fix e.g.:
name_file=/home/davide/name.txt
done < $name_file
No need to for seek(0) in shell scripts. Just process the file again, e.g:
while read line ; do
echo "== $line =="
done < /some/file
while read line ; do
echo "--> ${line:0:1}"
done < /some/file
This is less efficient and less flexible than a more real programming language where you can seek(). But that's about differences, advantages and disadvantages between shell scripting and programming.
By the way, this line:
n=$(echo $name)
... is merely a awkward way of just doing:
n=$name
This can cause your script to behave quite unpredictable when $name contains special character like *. And since $name is read from a text file, this not unlikely to happen. (thanks Charles Duffy for making this point)

Bash for loop through array

I'm writing a little bash script for work. But now I'm stuck. Let me just show you the code and explain:
# I have an `array` with names
NAMES=(Skypper Lampart Shepard Ryan Dean Jensen)
Now I wanna iterate trough the names
for (( i = 0; i < 6; i++ )); do
COMMAND="sed -i ${i+2}s/.*/${NAMES[${i}]}"
${COMMAND} config.txt
done
config.txt is a file with 2 numbers and names and I just wanna replace the names.
1
2
Name 1
Name 2
Name 3
Name 4
Name 5
Name 6
My problem is in the for-Loop how can I make $i + 2? So if I $i is 1 it should be 3.
Expected output:
1
2
Skypper
Lampart
Shepard
Ryan
Dean
Jensen
Bash is good at reading arrays (something you could have easily searched for).
Try something like:
for idx in "${!NAMES[#]}"
do
sed -i "$((idx + 2))s/.*/${NAMES[idx]} $idx/" config.txt
done
You will find that placing commands inside variables can also come unstuck unless you know what you are doing, so just use the command as intended :)
You might also need to remember that indexes start at zero and not 1
If I understood what you want to accomplish (Replace "Name" with a string from NAMES array, problem being index in array starts from 0 and you want to start on the 3rd line) - dirty and quick solution is to add 2 empty strings to beginning of your array and start your loop from the position you want.
Use this:
NAMES=(Skypper Lampart Shepard Ryan Dean Jensen)
line=2 # Need to skip first 2 lines
for name in "${NAMES[#]}"
do
((line++))
sed -i "${line}s/.*/$name/g" config.txt
done
You can try something around like this:
NAMES=(Skypper Lampart Shepard Ryan Dean Jensen)
for (( i = 0; i < 6; i++ )); do
b=$(( $i+2 ))
COMMAND="sed -i $b s/.*/${NAMES[${i}]}"
echo $COMMAND
# ${COMMAND} config.txt
done
Which gives me something like the following output:
# sh test.sh
sed -i 2 s/.*/Skypper
sed -i 3 s/.*/Lampart
sed -i 4 s/.*/Shepard
sed -i 5 s/.*/Ryan
sed -i 6 s/.*/Dean
sed -i 7 s/.*/Jensen
A bit late answer... :)
In your code you calling the sed n-times. This is inefficient. Therefore me proposing different solution, using ed instead of the sed. (as in good old times 30 years ago in BSD 2.9 :) ).
For this, approach:
first creating commands for the ed
executing them in one editor invocation
# it is good practice not using UPPERCASE variables
# as theycould collide with ENV variables
names=(Skypper Lampart Shepard Ryan Dean Jensen)
file="config.txt"
#create an array of commands for the "ed"
declare -a cmd
for name in "${names[#]}"; do
cmd+=("/Name/s//$name/")
done
cmd+=(w q)
echo "=== [$file before] ==="
cat "$file"
echo "=== [commands for execution ]==="
printf "%s\n" "${cmd[#]}"
#execute the prepared command in the "ed"
printf "%s\n" "${cmd[#]}" | ed -s "$file"
echo "===[ $file after ]==="
cat "$file"
output from the above
=== [config.txt before] ===
1
2
Name 1
Name 2
Name 3
Name 4
Name 5
Name 6
=== [commands for execution ]===
/Name/s//Skypper/
/Name/s//Lampart/
/Name/s//Shepard/
/Name/s//Ryan/
/Name/s//Dean/
/Name/s//Jensen/
w
q
===[ config.txt after ]===
1
2
Skypper 1
Lampart 2
Shepard 3
Ryan 4
Dean 5
Jensen 6
a variant which replaces by the line-numbers
names=(Skypper Lampart Shepard Ryan Dean Jensen)
file="config.txt"
#create an array of commands for the "ed"
declare -a cmd
n=3
for name in "${names[#]}"; do
cmd+=("${n}s/.*/$name/")
let n++
done
cmd+=(w q)
echo "=== [$file before] ==="
cat "$file"
echo "=== [commands for execution ]==="
printf "%s\n" "${cmd[#]}"
#execute the prepared command in the "ed"
printf "%s\n" "${cmd[#]}" | ed -s "$file"
echo "===[ $file after ]==="
cat "$file"
output
=== [config.txt before] ===
1
2
Name 1
Name 2
Name 3
Name 4
Name 5
Name 6
=== [commands for execution ]===
3s/.*/Skypper/
4s/.*/Lampart/
5s/.*/Shepard/
6s/.*/Ryan/
7s/.*/Dean/
8s/.*/Jensen/
w
q
===[ config.txt after ]===
1
2
Skypper
Lampart
Shepard
Ryan
Dean
Jensen

How to get the length of each word in a column without AWK, sed or a loop? [duplicate]

This question already has answers here:
Length of string in bash
(11 answers)
Closed 6 years ago.
Is it even possible? I currently have a one-liner to count the number of words in a file. If I output what I currently have it looks like this:
3 abcdef
3 abcd
3 fec
2 abc
This is all done in 1 line without loops and I was thinking if I could add a column with length of each word in a column. I was thinking I could use wc -m to count the characters, but I don't know if I can do that without a loop?
As seen in the title, no AWK, sed, perl.. Just good old bash.
What I want:
3 abcdef 6
3 abcd 4
3 fec 3
2 abc 3
Where the last column is length of each word.
while read -r num word; do
printf '%s %s %s\n' "$num" "$word" "${#word}"
done < file
You can do something like this also:
File
> cat test.txt
3 abcdef
3 abcd
3 fec
2 abc
Bash script
> cat test.txt.sh
#!/bin/bash
while read line; do
items=($line) # split the line
strlen=${#items[1]} # get the 2nd item's length
echo $line $strlen # print the line and the length
done < test.txt
Results
> bash test.txt.sh
3 abcdef 6
3 abcd 4
3 fec 3
2 abc 3

How to sum a row of numbers from text file-- Bash Shell Scripting

I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")

Shell script numbering lines in a file

I need to find a faster way to number lines in a file in a specific way using tools like awk and sed. I need the first character on each line to be numbered in this fashion: 1,2,3,1,2,3,1,2,3 etc.
For example, if the input was this:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
The output needs to look like this:
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
Here is a chunk of what I have. $lines is the number of lines in the data file divided by 3. So for a file of 21000 lines I process this loop 7000 times.
export i=0
while [ $i -le $lines ]
do
export start=`expr $i \* 3 + 1`
export end=`expr $start + 2`
awk NR==$start,NR==$end $1 | awk '{printf("%d%s\n", NR,$0)}' >> data.out
export i=`expr $i + 1`
done
Basically this grabs 3 lines at a time, numbers them, and adds to an output file. It's slow...and then some! I don't know of another, faster, way to do this...any thoughts?
Try the nl command.
See https://linux.die.net/man/1/nl (or another link to the documentation that comes up when you Google for "man nl" or the text version that comes up when you run man nl at a shell prompt).
The nl utility reads lines from the
named file or the standard input if
the file argument is ommitted, applies
a configurable line numbering filter
operation and writes the result to the
standard output.
edit: No, that's wrong, my apologies. The nl command doesn't have an option for restarting the numbering every n lines, it only has an option for restarting the numbering after it finds a pattern. I'll make this answer a community wiki answer because it might help someone to know about nl.
It's slow because you are reading the same lines over and over. Also, you are starting up an awk process only to shut it down and start another one. Better to do the whole thing in one shot:
awk '{print ((NR-1)%3)+1 $0}' $1 > data.out
If you prefer to have a space after the number:
awk '{print ((NR-1)%3)+1, $0}' $1 > data.out
Perl comes to mind:
perl -pe '$_ = (($.-1)%3)+1 . $_'
should work. No doubt there is an awk equivalent. Basically, ((line# - 1) MOD 3) + 1.
This might work for you:
sed 's/^/1/;n;s/^/2/;n;s/^/3/' input
Another way is just to use grep and match everything. For example this will enumerate files:
grep -n '.*' <<< `ls -1`
Output will be:
1:file.a
2:file.b
3:file.c
awk '{printf "%d%s\n", ((NR-1) % 3) + 1, $0;}' "$#"
Python
import sys
for count, line in enumerate(sys.stdin):
stdout.write( "%d%s" % ( 1+(count % 3), line )
You don't need to leave bash for this:
i=0; while read; do echo "$((i++ % 3 + 1)) $REPLY"; done < input
This should solve the problem. $_ will print the whole line.
awk '{print ((NR-1)%3+1) $_}' < input
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
# cat input
line 1
line 2
line 3
line 4
line 5
line 6
line 7

Resources