Different ways of grepping for large amounts of data - bash

So I have a huuuuge file and a big list of items that I want to grep out from that file.
For the sake of this example, let the files be denoted thus-
seq 1 10000 > file.txt #file.txt contains numbers from 1 to 10000
seq 1 5 10000 > list #list contains every fifth number from 1 to 10000
My question is, which is the best way to grep out the lines corresponding to 'list' from 'file.txt'
I tried it in two ways-
time while read i ; do grep -w "$i" file.txt ; done < list > output
That command took - real 0m1.300s
time grep -wf list file.txt > output
This one was slower, clocking in at- real 0m1.402s.
Is there a better (faster) way to do this? Is there a best way that I'm missing?

You're comparing apples and oranges
this command greps words from list in file.txt
time for i in `cat list`; do grep -w "$i" file.txt ; done > output
this command greps patterns from file.txt in list
time grep -f file.txt list > output
you need to fix one file as the source of strings to match and the other file as the target data in which to match strings - also use the same grep options like -w or -F
It sounds like list is the source of patterns and file.txt is target datafile - here are my timings for the original adjusted commands plus one awk and two sed solutions - the sed solutions differ in whether the patterns are given as separate sed commands or one extended regex
timings
one grep
real 0m0.016s
user 0m0.001s
sys 0m0.001s
2000 output1
loop grep
real 0m10.120s
user 0m0.060s
sys 0m0.212s
2000 output2
awk
real 0m0.022s
user 0m0.007s
sys 0m0.000s
2000 output3
sed
real 0m4.260s
user 0m4.211s
sys 0m0.022s
2000 output4
sed -r
real 0m0.144s
user 0m0.085s
sys 0m0.047s
2000 output5
script
n=10000
seq 1 $n >file.txt
seq 1 5 $n >list
echo "one grep"
time grep -Fw -f list file.txt > output1
wc -l output1
echo "loop grep"
time for i in `cat list`; do grep -Fw "$i" file.txt ; done > output2
wc -l output2
echo "awk"
time awk 'ARGIND==1 {list[$1]; next} $1 in list' list file.txt >output3
wc -l output3
echo "sed"
sed 's/^/\/^/;s/$/$\/p/' list >list.sed
time sed -n -f list.sed file.txt >output4
wc -l output4
echo "sed -r"
tr '\n' '|' <list|sed 's/^/\/^(/;s/|$/)$\/p/' >list.sedr
time sed -nr -f list.sedr file.txt >output5
wc -l output5

You can try awk:
awk 'NR==FNR{a[$1];next} $1 in a' file.txt list
In my system, awk is faster than grep with the sample data.
Test:
$ time grep -f file.txt list > out
real 0m1.231s
user 0m1.056s
sys 0m0.175s
$ time awk 'NR==FNR{a[$1];next} $1 in a' file.txt list > out1
real 0m0.068s
user 0m0.067s
sys 0m0.001s

Faster or not, you've useless use of cat up there
Why not?
grep -f list file.txt # Aren't files meant other way
Or use a bit more customized awk
awk 'NR==FNR{a[$1];next} $1 in a{print $1;next}' list file.txt

Related

Adding numbers with a while loop using piped output

So i am running a randomfile that can receive several arguments ($1 and $2) not shown, and then does something with the argument passed...
with the 3rd argument, i am supposed to search for $3 (or not $3) in file1 and add number of instances of this to file2 ...
this works fine:
cat file1 | grep $3 | wc -l | while read line1; do echo $3 $line1 > file2; done
cat file1 | grep -v $3 | wc -l | while read line2; do echo not $3 $line2 >> file2; done
Now I am trying to read file2 that is holding the instances of the search, i want to get the numbers in the file, get the sum, to then append to file2. So, for example, if $3 was "baby":
file2 would contain-
baby 30
not baby 20
and then i want to get the sum of 20 and 30 and append to that same file2, so that it looks like-
baby 30
not baby 20
total 50
This is what i have at the moment:
cat file2 | grep -o '[0-9]*' | while read num ; do sum=$(($sum + $num));echo "total $sum" >> file2; done
my file2 ends up with two lines for totals, where one of them is what i need-
baby 30
not baby 20
total 30
total 50
What did I miss here?
This is happening because your echo is within your for loop.
The obvious solution would be to move this outside your for loop, but if you try this you will find that $sum is not set, this is because the while loops and pipes are actually spawned as their own processes. You can solve this by using braces ({}) to group your commands:
cat file2 | grep -o '[0-9]*' | { while read num ; do sum=$(($sum + $num)); done; echo "total $sum" >> file2; }
Other answers do point out better ways of doing this, but this hopefully helps you understand what is happening.
cat file1 | grep $3 | wc -l | while read line1; do echo $3 $line1 > file2; done
If you want to count the instances of $3, you can use the option -c of grep, avoiding a pipe to wc(1). Moreover, it would be better to quote the $3. Finally, you don't need a loop to read the count (either from wc or grep): it is a single line! So, your code above could be written like this:
count=$(grep -c "$3" file1)
echo $count $3 >file2
The second grep would be just the same as before:
count=$(grep -vc "$3" file1)
echo $count $3 >>file2
Now you should have the intermediate result:
30 baby
20 not baby
Note that I reversed the two terms, count and pattern; this is because we know that the count is a single word, but the pattern could be more words. So writing first the count, we have a well defined format: "count, then all the rest".
The third loop can be written like this:
while read num string; do
# string is filled with all the rest on the line
let "sum = $sum + $num"
done < file2
echo "$sum total" >> file2
There are other ways to sum up the total; if needed, you could also reverse again the terms of the final file, as was your original - it could be done by using another file again.

One line command with variable, word count and zcat

I have many files on a server which contains many lines:
201701010530.contentState.csv.gz
201701020530.contentState.csv.gz
201701030530.contentState.csv.gz
201701040530.contentState.csv.gz
I would like with one line command this result:
170033|20170101
169865|20170102
170010|20170103
170715|20170104
The goal is to have the number of lines of each file, just by keeping the date which is already in the filename of the file.
I tried this but the result is not in one line but two...
for f in $(ls -1 2017*gz);do zcat $f | wc -l;echo $f | awk '{print substr($0,1,8)}';done
Thanks in advance guys.
Just use zcat file | wc -l to get the number of lines.
For the name, I understand it is enough to extract the first 8 characters:
$ t="201701030530.contentState.csv.gz"
$ echo "${t:0:8}"
20170103
All together:
for file in 2017*gz;
do
lines=$(zcat "$file" | wc -l)
printf "%s|%s\n" "$lines" "${file:0:8}"
done > myresult.csv
Note the usage of for file in 2017*gz; to go through the files matching the 2017*gz pattern: this suffices, no need to parse ls!
Use zgrep -c ^ file to count the lines, here encapsulated in awk:
$ awk 'FNR==1{ "zgrep -c ^ " FILENAME | getline s; print s "|" substr(FILENAME,1,8) }' *.gz
12|20170101
The whole "zgrep -c ^ " FILENAME should probably be in a var (s) and then s | getline s.

how to delete all characters starting from the nth position for every word using bash?

I have a file containing 1,700,000 words. I want to do naive stemming of the words, if a word's length is more than 6 characters, I delete all characters after 6th position. For example:
Input:
Everybody is around
Everyone keeps talking
Output:
Everyb is around
Everyo keeps talkin
I wrote the following script:
INPUT=train.txt
while read line; do
for word in $line; do
new="$(echo $word | awk '{print substr($0,1,6);exit}')"
echo -n $new >> train_stem_6.txt
echo -n ' ' >> train_stem_6.txt
done
echo ' ' >> train_stem_6.txt
done < "$INPUT"
This answers the question perfectly, but it is extremely slow, and since I have 1,700,000 words, it takes forever.
Is there a faster way to do this using bash script.
Thanks a lot,
You can use this gnu awk using custom RS:
awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file
Everyb is around
Everyo keeps talkin
Timings of 3 commands on 11 MB input file:
sed:
time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' file >/dev/null
real 0m2.913s
user 0m2.878s
sys 0m0.020s
awk command by #andlrc:
time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' file >/dev/null
real 0m1.191s
user 0m1.174s
sys 0m0.011s
My suggested awk command:
time awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file >/dev/null
real 0m1.926s
user 0m1.905s
sys 0m0.013s
So both awk commands are taking pretty much same time to finish the job and sed tends to be slower on bigger files.
3 commands on 167mb file
$ time awk -v RS='[[:space:]]+' 'RT{ORS=RT} {$1=substr($1, 1, 6)} 1' test > /dev/null
real 0m29.070s
user 0m28.898s
sys 0m0.060s
$ time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' test >/dev/null
real 0m13.897s
user 0m13.805s
sys 0m0.036s
$ time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' test > /dev/null
real 0m40.525s
user 0m40.323s
sys 0m0.064s
Do you consider using sed?
sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g'
You can use awk for this:
awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' train.txt
Breakdown:
{
for(i=1;i<=NF;i++) { # Iterate over each word
$i = substr($i, 1, 6); # Shrink it to a maximum of 6 characters
}
}
1 # Print the row
This will however treat Awesome, as a word and therefore remove e,
Pure bash, (i.e. not POSIX), as a one-liner:
while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done < train.txt
...and the same code reformatted for clarity:
while read x ; do
set -- $x
for f in $* ; do
echo -n ${f:0:6}" "
done
echo
done < train.txt
Note: repeated whitespace becomes a single space.
Test run, first make a function using above code, with standard input:
len6() { while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done ; }
Invoke:
COLUMNS=90 man bash | tail | head -n 5 | len6
Output:
gracef when proces suspen is attemp When a proces is stoppe the
shell immedi execut the next comman in the sequen It suffic to
place the sequen of comman betwee parent to force it into a subshe
which may be stoppe as a unit.

awk shell variable with field separator

I am trying to create a hash:
awk -F ';' '/DHCP/ {for(i=1; i<=5; i++) {getline; print $2$1}}' file \
| awk '{print $1"=>\"0000:0000:0000:1000::"$2"/64\""}'
returns me the following :
host1=>"0000:0000:0000:1000::2/64"
host2=>"0000:0000:0000:1000::3/64"
host3=>"0000:0000:0000:1000::4/64"
host4=>"0000:0000:0000:1000::5/64"
host5=>"0000:0000:0000:1000::6/64"
This is all fine, but notice the 5 in the for loop in awk. How can I retrieve the total number of lines of the file into that for loop?
I can use wc -l into a variable, but how to use the shell variable and field separator ; together with awk ?
ADD
This is what the file looks like :
#rest are dynamically assigned by DHCP server
2 ; host1 ; server1 ; ; ;
3 ; host2 ; sX ;;
4 ; host3 ; plic ;;
5 ; host4 ; cluc ;;
6 ; host6 ; blah ;;
awk -F'[ \t]*;[ \t]*' 'NR > 1 && NF > 1 { print $2"=>\"0000:0000:0000:1000::"$1"/64\"" }' file
I've gotten rid of the check for DHCP -- I just test if we're past the first line. And NF > 1 makes sure that we don't do anything on a blank line.
I combined the two uses of awk into one by using a more elaborate field separator. It matches ; and any whitespace around it.
awk -v IT=$(cat file1|wc -l) -F ';' '/DHCP/ {for(i=1; i<=IT; i++) {getline; print $2$1}}' file \| awk '{print $1"=>\"0000:0000:0000:1000::"$2"/64\""}'
The -v flag passes external variables to awk.
#Ed Morton on my system catting executed faster
(precise)cronkilla#localhost:/tmp$ time wc -l < file1
4
real 0m0.003s
user 0m0.001s
sys 0m0.002s
(precise)cronkilla#localhost:/tmp$ time cat file1 | wc -l
4
real 0m0.003s
user 0m0.001s
sys 0m0.001s

Add prefix to every line in text in bash

Suppose there is a text file a.txt e.g.
aaa
bbb
ccc
ddd
I need to add a prefix (e.g. myprefix_) to every line in the file:
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
I can do that with awk:
awk '{print "myprefix_" $0}' a.txt
Now I wonder if there is another way to do that in shell.
With sed:
$ sed 's/^/myprefix_/' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
This replaces every line beginning ^ with myprefix_. Note that ^ is not lost, so this allows to add content to the beginning of each line.
You can make your awk's version shorter with:
$ awk '$0="myprefix_"$0' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
or passing the value:
$ prefix="myprefix_"
$ awk -v prefix="$prefix" '$0=prefix$0' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
It can also be done with nl:
$ nl -s "prefix_" a.txt | cut -c7-
prefix_aaa
prefix_bbb
prefix_ccc
prefix_ddd
Finally: as John Zwinck explains, you can also do:
paste -d'' <(yes prefix_) a.txt | head -n $(wc -l a.txt)
on OS X:
paste -d '\0' <(yes prefix_) a.txt | head -n $(wc -l < a.txt)
Pure bash:
while read line
do
echo "prefix_$line"
done < a.txt
For reference, regarding the speed of the awk, sed, and bash solution to this question:
Generate a 800K input file in bash:
line="12345678901234567890123456789012345678901234567890123456789012345678901234567890"
rm a.txt
for i in {1..10000} ; do
echo $line >> a.txt
done
Then consider the bash script timeIt
if [ -e b.txt ] ; then
rm b.txt
fi
echo "Bash:"
time bashtest
rm b.txt
echo
echo "Awk:"
time awktest
rm b.txt
echo
echo "Sed:"
time sedtest
where bashtest is
while read line
do
echo "prefix_$line" >> b.txt
done < a.txt
awktest is:
awk '$0="myprefix_"$0' a.txt > b.txt
and sedtest is:
sed 's/^/myprefix_/' a.txt > b.txt
I got the following result on my machine:
Bash:
real 0m0.401s
user 0m0.340s
sys 0m0.048s
Awk:
real 0m0.009s
user 0m0.000s
sys 0m0.004s
Sed:
real 0m0.009s
user 0m0.000s
sys 0m0.004s
It seems like the bash solution is much slower..
You can also use the xargs utility:
cat file | xargs -d "\n" -L1 echo myprefix_
The -d option is used to allow input line with trailing blanks (related to -L spec).

Resources