How to grep against a huge .gz file fast? - performance

A very simple grep question but looking forward tricks to improve the performance!
I am using the following code to grep a list of IDs in a file against the other huge compressed .gz file (~20G).
zcat my.gz | grep -wFf my.list > output.txt
Since this code would decompress .gz file first, then grep my list of 100k IDs against the decompressed file, it costs too much time!
Is there any way can process the job faster?
Thanks!

Related

How to merge multiple large .gz files into one in a efficient way?

I'm trying to combine multiple (29) compressed files (.gz), one after the other, into one file.
The compressed files are around 500MB and in their uncompressed format ~30GB. All the files start with a header that I don't want in the final file.
I have tried to do it using zcatand gzip, but it takes a lot of time (more than 3hours):
zcat file*.gz | tail -n +2 | gzip -c >> all_files.txt.gz
I have also tried it with pigz:
unpigz -c file*.gz | tail -n +2 | pigz -c >> all_files_pigz.txt.gz
In this case, I'm working in a cluster where they don't have this command and I can't install anything.
The last thing I have tried is to merge all with cat:
cat file*.gz > all_files_cat.txt.gz
It doesn't take a lot of time, but when I'm going to read it, at some pint appears the following message:
gzip: unexpected end of file
How could I deal with this?
If you want to remove the first line of every uncompressed file, and concatenate them all into one compressed file, you'll need a loop. Something like
for f in file*.gz; do
zcat "$f" | tail -n +2
done | gzip -c > all_files_cat.txt.gz
If there's lots of big files, yes, it can take a while. Maybe use a lower compression level than the default (At the expense of larger file size). Or use a different compression program than gzip; there are lots of options, each with their own speed and compression ratio tradeoffs.

Logging how many times a certain file is requested on Apache2

Looking for some advice here.
I know this can be done using AWStats or something similar, but that seems like overkill for what I want to do here.
I have a directory in my webroot that contains thousands of XML files.
These are all loaded by calls to a single swf file using GET requests in the url.
eg :
https://www.example.com/myswf.swf?url=https://www.example.com/xml/1234567.xml
The urls are built dynamically and there are thousands of them. All pointing to the same swf file, but pulling in a different XML file from the XML directory.
What I'm looking to do is log how many times each of those individual XML files is requested to a text file.
As I know the target directory, is there a bash script or something that I can run that will monitor the XML directory and log each hit with a timestamp?
eg :
1234567.xml | 1475496840
7878332.xml | 1481188213
etc etc
Any suggestions?
Simpler, more direct approach-
uniq -c requests.txt
Where I'm assuming all your request URLs are in a file called requests.txt.
Better formatted output-
awk -F/ '{print $8}' requests.txt | uniq -c
This is an ugly way since it uses loops to process text rather that an elegant awk array but it should work (slowly). Optimisation is definitely required.
I'm assuming all your request URLs are in a file called requests.txt
#Put all the unique URLs in an index file
awk -F/ '{print $8}' requests.txt | sort -u > index
#Look through the file to count the number of occurrences of each item.
while read i
do
echo -n "$i | "
grep -c -w "$i" requests.txt
done < index

How to view the content of a gzipped file “abc.gz” without actually extracting it?

How to view the content of a gzipped file abc.gz without extracting it ?
I tried to find a way to see content without unzip but i did not find way.
You can use the command below to see the file without replacing it with the decompressed content:
gunzip -c filename.gz
Just use zcat to see content without extraction.
zcat abc.gz
From the manual:
zcat is identical to gunzip -c. (On some systems, zcat may be
installed as gzcat to preserve the original link to compress.)
zcat uncompresses either a list of files on the command line or its
standard input and writes the uncompressed data on standard output.
zcat will uncompress files that have the correct magic number whether
they have a .gz suffix or not.
Plain text:
cat abc
Gzipped text:
zcat abc.gz

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

combine multiple text files and remove duplicates

I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:
ip1,dns1
ip2,dns2
...
I wrote a small shell script to do this
#!/bin/bash
for file in data/*
do
cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted
I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!
First off, you're not using the full power of cat. The loop can be replaced by just
cat data/* > dnsFull
assuming that file is initially empty.
Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:
cat data/* | sort | uniq > dnsOut
This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by
sort -u data/* > dnsOut
If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:
awk '{if (!a[$0]++) print}' data/* > dnsOut

Resources