s3cmd count lines with zcat and grep - shell

I need to count the number of entries in a zipped (.gz) file from a S3 bucket containing certain characters. How could I do it?
Specifically, my S3 bucket is s3://mys3.com/. Under that, there are thousands of buckets like the following:
s3://mys3.com/bucket1/
s3://mys3.com/bucket2/
s3://mys3.com/bucket3/
...
s3://mys3.com/bucket2000/
In each of the bucket, there are about hundreds of zipped(.gz) JSON objects like the following:
s3://mys3.com/bucket1/file1.gz
s3://mys3.com/bucket1/file2.gz
s3://mys3.com/bucket1/file3.gz
...
s3://mys3.com/bucket1/file100.gz
Each of the zipped file contains about 20,000 JSON objects (Each JSON object is a line). In each of the JSON object, there are certain fields containing the word "request". I want to count how many JSON objects are there in bucket1 containing the word "request". I tried this but it did not work:
zcat s3cmd --recursive ls s3://mys3.com/bucket1/ | grep "request" | wc -l
I do not have a lot of shell experiences, so could anyone help me with that? Thanks!

In case anyone is interested:
s3cmd ls --recursive s3://mys3.com/bucket1/ | awk '{print $4}' | grep '.gz' | xargs -I# s3cmd get # - | zgrep 'request' | wc -l

Related

How to count occurrence of list of words in the first n lines of multiple files?

i currently have a list of terms - words.txt,with each term on one line, and I want to count how many total occurrences for all those terms exists in the first 500 lines of multiple csv files in the same directory.
I currently have something like this:
grep -Ff words.txt /some/directory |wc -l
How exactly can I get the program to display for each file the count number for just those first 500 lines of each file? Do i have to create new files with the 500 lines? How can i do that for a large number of original files? I'm very new to coding and working on a dataset for research, so any help is much appreciated!
Edit: I want it to display something like this but for each file:
grep -Ff words.txt list1.csv |wc -l
/Users/USER/Desktop/FILE/list1.csv:28
This works for me.
head /some/directory/* -n 100 | grep -Ff words.txt | wc -l
Sample Result: 38

Concatenate files with same partial id BASH

I have a directory with many fq.gz files. I want to loop over the filenames and concatenate any files with the same partial ID. For example out of the 1000 files in the directory, these six need to be concatenated into a single file (as they share the same ID From "L1" onwards)
141016-FC012-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141031-FC01229-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141020-FC01209-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141027-FC013-L1-N707-S504--123V_pre--Hs--R1.fq.gz
141023-FC01219-L1-N707-S504--123V_pre--Hs--R1.fq.gz
Can anyone help??
Probably not the best way, but this might do what you need:
while IFS= read -r -d '' id; do
cat *"$id" > "/some/location/${id%.fq.gz}_grouped.fq.gz"
done < <(printf '%s\0' *.fq.gz | cut -zd- -f3- | sort -uz)
This will create files with the following format:
<ID>_grouped.fq.gz
L1-N707-S504--123V_pre--Hs--R1_grouped.fq.gz
...
...

How to list files(with spaces) in s3 bucket using shell script (bash)?

I am listing all the files in s3 bucket and writing it in a text file. For example, my bucket has the following list of files:
text.zip
fixed.zip
hello.zip
good test.zip
I use the following code:
fileList=$(aws s3 ls s3://$inputBucketName/ | awk '{print $4}')
if [ ! -z "$fileList" ]
then
$AWS_CLI s3 ls s3://$inputBucketName/ | awk '{print $1,$2,$4}' > s3op.txt
sort -k1,1 -k2 s3op.txt > s3op_srt.txt
awk '{print $3}' s3op_srt.txt > filesOrder.txt
fi
cat filesOrder.txt;
After this when I iterate the files from the file I created (I will delete the files in S3 at the end of the loop, so the file won't be processed again):
fileName=`head -1 filesOrder.txt`
the files are listed like below:
text.zip
fixed.zip
hello.zip
good
So the problem is that, the list is not able to list the files with spaces correctly.
As the file name is returned as "good" and not as "good test.zip", it is not able to delete the file from S3.
Expected Result is
text.zip
fixed.zip
hello.zip
good test.zip
I used following command to delete files in S3:
aws s3 rm s3://$inputBucketName/$fileName
Put the full file path under double quotes.
For example:
aws s3 rm "s3://test-bucket/good test.zip"
In your case, it would be:
aws s3 rm "s3://$inputBucketName/$fileName"
Here even if the fileName has spaces, it'll be deleted.

Logging how many times a certain file is requested on Apache2

Looking for some advice here.
I know this can be done using AWStats or something similar, but that seems like overkill for what I want to do here.
I have a directory in my webroot that contains thousands of XML files.
These are all loaded by calls to a single swf file using GET requests in the url.
eg :
https://www.example.com/myswf.swf?url=https://www.example.com/xml/1234567.xml
The urls are built dynamically and there are thousands of them. All pointing to the same swf file, but pulling in a different XML file from the XML directory.
What I'm looking to do is log how many times each of those individual XML files is requested to a text file.
As I know the target directory, is there a bash script or something that I can run that will monitor the XML directory and log each hit with a timestamp?
eg :
1234567.xml | 1475496840
7878332.xml | 1481188213
etc etc
Any suggestions?
Simpler, more direct approach-
uniq -c requests.txt
Where I'm assuming all your request URLs are in a file called requests.txt.
Better formatted output-
awk -F/ '{print $8}' requests.txt | uniq -c
This is an ugly way since it uses loops to process text rather that an elegant awk array but it should work (slowly). Optimisation is definitely required.
I'm assuming all your request URLs are in a file called requests.txt
#Put all the unique URLs in an index file
awk -F/ '{print $8}' requests.txt | sort -u > index
#Look through the file to count the number of occurrences of each item.
while read i
do
echo -n "$i | "
grep -c -w "$i" requests.txt
done < index

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Resources