Which of two is better in file manipulation?

Which of two is better in file manipulation? - bash

I have a file 'tbook1' with lot of numerical values (+2M). I have to perform the below in bash (Solaris / RHEL):
Do following:
Remove 1st and last 2 lines
Remove (,") & (")
Substitute (, ) with (,)
I can do it using two methods:
Method1:
sed -e 1d -e 's/,"//g' -e 's/, /,/g' -e 's/"//g' -e 'N;$!P;$!D;$d' tbook1 > tbook1.3
method2:
tail -n +2 tbook1 | head -n -2 > tbook1.1
sed -e 's/,"//' -e 's/, //' tbook 1.1 > tbook1.2
I want to know which one is better i.e. faster & efficient (resource usage) ?

Method 1 would usually be more efficient, mainly because of method 2's extra pipe and intermediate file that gets read and written to..

Method one scans the file only once and writes 1 result (but please store the result in a file with different name)
Method two 2 scans the original file and the intermediate result and write the intermediate and the final result. It is bound to be about twice slower.

I think head and tail are more efficient for this line elimination task than pure sed. But the other two answers are also right. You should avoid running several passes.
You can improve the second method by chaining them together:
tail -n +2 book.txt | head -n -2 | sed -e 's/,"//' -e 's/, //'
Then head and tail are faster. Try it your self (on a reasonable sized file):
#!/usr/bin/env bash
target=/dev/null
test(){
mode=$1
start=$(date +%s)
if [ $mode == 1 ]; then
sed -e 1d -e 's/,"//g' -e 's/, /,/g' -e 's/"//g' -e 'N;$!P;$!D;$d' book.txt > $target
elif [ $mode == 2 ]; then
tail -n +2 book.txt | head -n -2 | sed -e 's/,"//' -e 's/, //' > $target
else
cat book.txt > /dev/null
fi
((time = $(date +%s) - $start))
echo $time "seconds"
}
echo "cat > /dev/null"
test 0
echo "sed > $target"
test 1
echo "tail/head > $target"
test 2
My results:
cat > /dev/null
0 seconds
sed > /dev/null
5 seconds
tail/head > /dev/null
3 seconds

Related

How to find all non-dictionary words in a file in bash/zsh?

I'm trying to find all words in a file that don't exist in the dictionary. If I look for a single word the following works
b=ther; look $b | grep -i "^$b$" | ifne -n echo $b => ther
b=there; look $b | grep -i "^$b$" | ifne -n echo $b => [no output]
However if I try to run a "while read" loop
while read a; do look $a | grep -i "^$a$" | ifne -n echo "$a"; done < <(tr -s '[[:punct:][:space:]]' '\n' <lotr.txt |tr '[:upper:]' '[:lower:]')
The output seems to contain all (?) words in the file. Why doesn't this loop only output non-dictionary words?

Regarding ifne
If stdin is non-empty, ifne -n reprints stdin to stdout. From the manpage:
-n Reverse operation. Run the command if the standard input is empty
Note that if the standard input is not empty, it is passed through
ifne in this case.
strace on ifne confirms this behavior.
Alternative
Perhaps, as an alternative:
1 #!/bin/bash -e
2
3 export PATH=/bin:/sbin:/usr/bin:/usr/sbin
4
5 while read a; do
6 look "$a" | grep -qi "^$a$" || echo "$a"
7 done < <(
8 tr -s '[[:punct:][:space:]]' '\n' < lotr.txt \
9 | tr '[A-Z]' '[a-z]' \
10 | sort -u \
11 | grep .
12 )

What does following linux script mean in page 70 of book "Design Data intensive applications" by Martin Kleppmann?

#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
What does db_get() do?
Especially "sed -e "s/^$1,//""

db_get() prints the last value for the key $1.
$1,$2 are arguments to the script e.g. 1=money, 2=34.
grep "^$1," database lists all lines starting with $1.
sed -e "s/^$1,//" then removes the key, part, so that only the values remain.
tail -n 1 prints only the last line.
You can try this out by yourself with e.g.
$ cat database
jack,5
gill,6
jack,3
$ key=jack
$ grep "^$key," database
jack,5
jack,3
$ grep "^$key," database | sed -e "s/^$key,//"
5
3
$ grep "^$key," database | sed -e "s/^$key,//" | tail -n 1
3

Multiple strings in for loop grep -wc

How do I add multiple strings to search the same file for:
Currently:
#!/bin/bash
for log in filename.log.201[45]-*-*.gz; do
printf '%s:' "$log"
zcat "$log" | grep -wc 'dollar for dollars'
done
Desired result:
#!/bin/bash
for log in filename.log.201[45]-*-*.gz; do
printf '%s:' "$log"
echo "count for dollar for dollars"
zcat "$log" | grep -wc 'dollar for dollars'
echo "count for pollar for pollars"
zcat "$log" | grep -wc 'pollar for pollars'
done

You can use nested loop for this one.
for count in 'dollar for dollars' 'pollar for pollars'; do
for log in filename.log.201[45]-*-*.gz; do
printf '%s:' "$log"
echo -n $count :
zcat "$log" | grep -wc "$count"
done
done

You probably would be better off using an actual programming language, like awk.
If you want a count of the total number of occurrences of each pattern (which might be more than the number of lines in which each pattern appears, in the case that a pattern appears more than once in a line) you could use the -o option to grep to output the actual matches, and then construct the final report using sort | uniq -c, which counts the number of occurrences of each unique line in an output stream. That lets you supply multiple patterns to a single grep command, using the -e option:
for log in filename.log.201[45]-*-*.gz; do
zcat "$log" |
grep -e "pattern 1" -e "pattern 2" -ow |
sort | uniq -c |
xargs -d\\n printf "${log//%/%%}:%s\n"
done

Getting error: sed: -e expression #1, char 2: unknown command: `.'

EDIT: FIXED. Now concerned with optimizing the code.
I am writing a script to separate data from one file into multiple files. When I run the script, I get the error: "sed: -e expression #1, char 2: unknown command: `.'" without any line number, making it somewhat hard to debug. I have checked the lines in which I use sed individually, and they work without problem. Any ideas? I realize that there are a lot of things that I did somewhat unconventionally and that there are faster ways of doing some things (I'm sure there's a way to avoid continuously importing somefile), but right now I'm just trying to understand this error. Here is the code:
x1=$(sed -n '1p' < somefile | cut -f1)
y1=$(sed -n '1p' < somefile | cut -f2)
p='p'
for i in 1..$(seq 1 $(cat "somefile" | wc -l))
do
x2=$(sed -n $i$p < somefile | cut -f1)
y2=$(sed -n $i$p < somefile | cut -f1)
if [ "$x1" = "$x2" ] && [ "$y1" = "$y2" ];
then
x1=$x2
y1=$x2
fi
s="$(sed -n $i$p < somefile | cut -f3) $(sed -n $i$p < somefile | cut$
echo $s >> "$x1-$y1.txt"
done

The problem is in the following line:
for i in 1..$(seq 1 $(cat "somefile" | wc -l))
If somefile were to have 3 lines, then this would result in following values of i:
1..1
2
3
Clearly, something like sed -n 1..1p < filename would result in the error you are observing: sed: -e expression #1, char 2: unknown command: '.'
You rather want:
for i in $(seq 1 $(cat "somefile" | wc -l))

This is the cause of the problem:
for i in 1..$(seq 1 $(cat "somefile" | wc -l))
Try just
for i in $(seq 1 $(wc -l < somefile))
However, you are reading your file many, many times too often with all those sed commands. Read it just once:
read x1 y1 < <(sed 1q somefile)
while read x2 y2 f3 f4; do
if [[ $x1 = $x2 && $y1 = $y2 ]]; then
x1=$x2
y1=$x2
fi
echo "$f3 $f4"
done < somefile > "$x1-$y1.txt"
The line where you construct the s variable is truncated -- I'm assuming you have 4 fields per line.
Note: a problem with cut-and-paste coding is that you introduce errors: you assign y2 the same field as x2

Bash - How to count C source file functions calls

I want to find for each function defined in a C source file how many times it's called and on which line.
Should I search for patterns which look like function definitions in C and then count how many times that function name occurs. If so, how can I do it? regular expressions?
Any help will be highly appreciated!
#!/bin/bash
if [ -r $1 ]; then
#??????
else
echo The file \"$1\" does NOT exist
fi
The final result is: (please report any bugs)
10 if [ -r $1 ]; then
11 functs=`grep -n -e "\(void\|double\|char\|int\) \w*(.*)" $1 | sed 's/^.*\(void\|double\|int\) \(\w*\)(.*$/\2/g'`
12 for f in $functs;do
13 echo -n $f\(\) is called:
14 grep -n $f $1 > temp.txt
15 echo -n `grep -c -v -e "\(void\|double\|int\) $f(.*)" -e"//" temp.txt`
16 echo " times"
17 echo -n on lines:
18 echo -n `grep -v -e "\(void\|double\|int\) $f(.*)" -e"//" temp.txt | sed -n 's/^\([0-9]*\)[:].*/\1/p'`
19 echo
20 echo
21 done
22 else
23 echo The file \"$1\" does not exist
24 fi

This might sort of work. The first bit finds function definitions like
<datatype> <name>(<stuff>)
and pulls out the <name>. Then grep for that string. There are loads of situations where this won't work, but it might be a good place to start if you're trying to make a simple shell script that works on some programs.
functions=`grep -e "\(void\|double\|int\) \w*(.*)$" -f input.c | sed 's/^.*\(void\|double\|int\) \(\w*\)(.*$/\2/g'`
for func in $functions
do
echo "Counting references for $func:"
grep "$func" -f input.c | wc -l
done

You can try with this regex
(^|[^\w\d])?(functionName(\s)*\()
for example to search all printf occurrences
(^|[^\w\d])?(printf(\s)*\()
to use this expression with grep you have to use the option -E, like this
grep -E "(^|[^\w\d])?(printf(\s)*\()" the_file.txt
Final note, what miss with this solution is to skip the occurrences in comment bloks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Which of two is better in file manipulation? - bash

Method 1 would usually be more efficient, mainly because of method 2's extra pipe and intermediate file that gets read and written to..

Method one scans the file only once and writes 1 result (but please store the result in a file with different name) Method two 2 scans the original file and the intermediate result and write the intermediate and the final result. It is bound to be about twice slower.

Related

How to find all non-dictionary words in a file in bash/zsh?

What does following linux script mean in page 70 of book "Design Data intensive applications" by Martin Kleppmann?

Multiple strings in for loop grep -wc

Getting error: sed: -e expression #1, char 2: unknown command: `.'

Bash - How to count C source file functions calls

Categories

Resources