Logging how many times a certain file is requested on Apache2 - bash

Looking for some advice here.
I know this can be done using AWStats or something similar, but that seems like overkill for what I want to do here.
I have a directory in my webroot that contains thousands of XML files.
These are all loaded by calls to a single swf file using GET requests in the url.
eg :
https://www.example.com/myswf.swf?url=https://www.example.com/xml/1234567.xml
The urls are built dynamically and there are thousands of them. All pointing to the same swf file, but pulling in a different XML file from the XML directory.
What I'm looking to do is log how many times each of those individual XML files is requested to a text file.
As I know the target directory, is there a bash script or something that I can run that will monitor the XML directory and log each hit with a timestamp?
eg :
1234567.xml | 1475496840
7878332.xml | 1481188213
etc etc
Any suggestions?

Simpler, more direct approach-
uniq -c requests.txt
Where I'm assuming all your request URLs are in a file called requests.txt.
Better formatted output-
awk -F/ '{print $8}' requests.txt | uniq -c

This is an ugly way since it uses loops to process text rather that an elegant awk array but it should work (slowly). Optimisation is definitely required.
I'm assuming all your request URLs are in a file called requests.txt
#Put all the unique URLs in an index file
awk -F/ '{print $8}' requests.txt | sort -u > index
#Look through the file to count the number of occurrences of each item.
while read i
do
echo -n "$i | "
grep -c -w "$i" requests.txt
done < index

Related

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Search and write line of a very large file in bash

I have a big csv file containing 60210 lines. Those lines contains hashes, paths and file names, like so:
hash | path | number | hash-2 | name
459asde2c6a221f6... | folder/..| 6 | 1a484efd6.. | file.txt
777abeef659a481f... | folder/..| 1 | 00ab89e6f.. | anotherfile.txt
....
I am filtering this file regarding a list of hashes, and to the facilitate the filtering process, I create and use a reduced version of this file, like so:
hash | path
459asde2c6a221f6... | folder/..
777abeef659a481f... | folder/..
The filtered result contains all the lines that have a hash which is not present in my reference hash base.
But to make a correct analysis of the filtered result, I need the previous data that I removed. So my idea was to read the filtered result file, search for the hash field, and write it in an enhanced result file that will contain all the data.
I use a loop to do so:
getRealNames() {
originalcontent="$( cat $originalfile)"
while IFS='' read -r line; do
hash=$( echo "$line" | cut -f 1 -d " " )
originalline=$( echo "$originalcontent" |grep "$hash" )
if [ ! -z "$originalline" ]; then
echo "$originalline" > "$resultenhanced"
fi
done < "$resultfile"
}
But in real usage, it is highly inefficient: for the previous file, this loop takes approximately 3 hours to run on a 4Go RAM, Intel Centrino 2 system, and it seems to me way too long for this kind of operation.
Is there any way I can improve this operation?
Given the nature of your question, it is hard to understand why you would prefer using the shell to process such a huge file given specialized tools like awk or sed to process them. As Stéphane Chazelas points out in the wonderful answer in Unix.SE.
Your problem becomes easy to solve once you use awk/perl which speeds up the text processing. Also you are consuming the whole file into RAM by doing originalcontent="$( cat $originalfile)" which is not desirable at all.
Assuming in the both the original and reference file, the hash starts at the first column and the columns are separated by |, you need to use awk as
awk -v FS="|" 'FNR==NR{ uniqueHash[$1]; next }!($1 in uniqueHash)' ref_file orig_file
The above attempts takes into memory only the first column entries from your reference file, the original file is not consumed at all. Once we consume the entries in $1 (first column) of the reference file, we do filter on the original file by selecting those lines that are not in the array(uniqueHash) we created.
Change your locale settings to make it even faster by setting the C locale as LC_ALL=C awk ...
Your explanation of what you are trying to do unclear because it describes two tasks: filtering data and then adding missing values back to the filtered data. Your sample script addresses the second, so I'll assume that's the what you are trying to solve here.
As I read it, you have a filtered result that contains hashes and paths, and you need to lookup those hashes in the original file to get the other field values. Rather than loading the original file into memory, just let grep process the file directly. Assuming a single space (as indicated by cut -d " ") is your field separator, you can extract the hash in your read command, too.
while IFS=' ' read -r hash data; do
grep "$hash" "$originalfile" >> "$resultenhanced"
done < "$resultfile"

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Extract .co.uk urls from HTML file

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here

Using sed to dynamically generate a file name

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.
The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Resources