Pipe logs to csv in bash script - bash

I'm at a bit of a loss on how to proceed with the task of piping log file contents to a csv file based on certain criteria.
Essentially, the problem is something like this:
Write a script that receives http logs (or any arbitrary .log file) via pipe
input, and outputs a summarized csv in the number of hits per url per day.
Example: executing the pipe command
cat access.log|some filter commands|./your_script > summary.csv
creates a text file called summary.csv with the content:
" Action and path. 2015-01-01, 2015-01-02. 2015-01-03
GET /index.php, 34, 53, 65
POST /administrator, 32, 59, 39
..."
and so forth.
The problem I'm facing at the moment is figuring out how to identify and execute specific parts of the pipe input command, and apply filters, before feeding it to the output pipe.
From what I'm familiar with, an array of command parameters (such as "cat", "gedit", ">", "|", etc) might work, but this leaves the problem of identifying them and executing them as a pipe command would, instead of just one after the other.
I've searched quite thoroughly, but as yet found nothing even remotely helpful, aside from the suggestion to divide the pipe command into separate instructions and execute them one by one.
If anyone can suggest an easier and more effective way to do this, or any advice on this particular problem, it'd be much appreciated. Thanks in advance.

Perhaps you need tee command. You can use it to "fork the pipe", which means that an output
file could be created after a specific pipe input command. It is very useful when looking for errors.
For example:
cat access.log | some filter commands | tee out01.txt \
| some other filter | tee ou02.txt | ./your_script > summary.csv
More examples here.

Related

Difference between cat file_name | sort, sort < file_name, sort file_name in bash

Although they do give the same results, I wonder if there is some difference between them and which is the most appropriate way to sort something contained in a file.
Another thing which intrigues me is the use of delimiters, I noticed that the sort filter only works if you separate the strings with a new line, are there any ways to do this without having to write the new strings in a separate line
The sort(1) command reads lines of text, analyzes and sorts them, and writes out the result. The command is intended to read lines, and lines in unix/linux are terminated by a new line.
The command takes its first non-option argument as the file to read; if there is no specification it reads standard input. So:
sort file_name
is a command line with such argument. The other two examples, "... | sort" and "sort < ..." do not specify the file to read directly to sort(1), but use its standard input. The effect, for what sort(1) is concerned, is the same.
ways to do this without having to write the new strings in a separate line
Ultimately no. But if you want you can feed sort using another filter (a program), which reads the file non-linefeed-separated and creates lines to pass to sort. If such program exists and is named "myparse", you can do:
myparse non-linefeed-separated-file | sort
The solution using cat involves creating a second process unnecessarily. This could be a performance issue if you perform many of such operation in a loop.
When doing input redirection to your file, the shell is setting up the association of file with std input. If the file would not exist, the shell complains about the file being missing.
When passing the file name as explicit argument, the sort process has to care about opening the file and to report an error if there is an accessability problem with it.

Running sed ON a variable in bash script

Apologies for a seemingly inane question. But I have spent the whole day trying to figure it out and it drives me up the walls. I'm trying to write a seemingly simple bash script that would take a list of files in the directory from ls, replace part of the file names using sed, get unique names from the list and pass them onto some command. Like so:
inputs=`ls *.ext`
echo $inputs
test1_R1.ext test1_R2.ext test2_R1.ext test2_R2.ext
Now I would like to put it through sed to replace 1.ext and 2.ext with * to get test1_R* etc. Then I'd like to remove resulting duplicates by running sort -u to arrive to the following $outputs variable:
echo $outputs
test1_R* test2_R*
And pass this onto a command, like so
cat $outputs
I can do something like this in a command line:
ls *.ext | sed s/..ext/\*/g | sort -u
But if I try to assign the above to a variable in the script it just returns the output from the ls. I have tried several ways to do it: including the whole pipe in the script. Running each command separately and assigning it to a variable, then passing that variable to the next command and writing the outputs to files then passing the file to the next command. But so far none of this managed to achieve what I aimed to. I think my problem lies in (except general cluelessness aroung bash scripting) inability to run seq on a variable within script. There seems to be a lot of advice around in how to pass variables to pattern or replacement string in sed, but they all seem to take files as input. But I understand that it might not be the proper way of doing it anyway. Therefore I would really appreciate if someone could suggest an elegant way to achieve, what I'm trying to.
Many thanks!
Update 2/06/2014
Hi Barmar, thanks for your answer. Can't say it solved the problem, but it helped pin-pointing it. Seems like the problem is in me using the asterisk. I have to say, I'm very puzzled. The actual file names I've got are:
test1_R1.fastq.gz test1_R2.fastq.gz test2_R1.fastq.gz test2_R2.fastq.gz
If I'm using the code you suggested, which seems to me the right way do to it:
ins=$(ls *.fastq.gz | sed 's/..fastq.gz/\*/g' | sort -u)
Sed doesn't seem to do anything and I'm getting the output of ls:
test1_R1.fastq.gz test1_R2.fastq.gz test2_R1.fastq.gz test2_R2.fastq.gz
Now if I replace that backslash with anything else, the sed works, but it also returns whatever character I'm putting in front (or after) the asteriks:
ins=$(ls *.fastq.gz | sed 's/..fastq.gz/"*/g' | sort -u)
test1_R"* test2_R"*
That's odd enough, but surely I can just put an "R" in front of the asteriks and then replace R in the search pattern string, right? Wrong! If I do that whichever way: 's/R..fastq.gz/R*/g' 's/...fastq.gz/R*/g' 's/[A-Z]..fastq.gz/R*/g' I'm back to the original names! And even if I end up with something like test1_RR* test2_RR* and try to run it through sed again and replace "_R" for "_" or "RR" for "R", I'm having no luck and I'm back to the original names. And yet I can replace the rest of the file name no problem, just not to get me test1_R* I need.
I have a feeling I should be escaping that * in some very clever way, but nothing I've tried seems to work. Thanks again for your help!
This is how you capture the result of the whole pipeline in a variable:
var=$(ls *.ext | sed s/..ext/\*/g | sort -u)

redirecting email text from procmail into bash script

I am trying to redirect emails that match a particular pattern to a shell script which will create files containing the texts, with datestamped filenames.
First, here is the routine from .procmailrc that hands the emails off to the script:
:0c:
* Subject: ^Ingest_q.*
| /home/myname/procmail/process
and here is the script 'process':
#!/bin/bash
DATE=`date +%F_%N`
FILE=/home/myname/procmail/${DATE}_email.txt
while read line
do
echo "$line" 1>>"$FILE";
done
I have gotten very frustrated with this because I can pipe text to this script on the command line and it works fine:
mybox-248: echo 'foo' | process
mybox-249: ls
2013-07-31_856743000_email.txt process
The file contains the word 'foo.'
I have been trying to get an email text to get output as a date-stamped file for hours now, and nothing has worked.
(I've also turned logging on in my .procmailrc and that isn't working either -- I'm not trying to ask a second question by mentioning that, just wondering if that might provide some hint as to what I might be doing wrong ...).
Thanks,
GB
Quoting your attempt:
:0c:
* Subject: ^Ingest_q.*
| /home/myname/procmail/process
The regex is wrong, ^ only matches at beginning of line, so it cannot occur after Subject:. Try this instead.
:0c:process.lock
* ^Subject: Ingest_q
| /home/myname/procmail/process
I also specified a named lockfile; I do not believe Procmail can infer a lock file name from just a script name. As you might have multiple email messages being delivered at the same time, and you don't want their logging intermingled in the log file, using a lock file is required here.
Finally, the trailing .* in the regex is completely redundant, so I removed it.
(The olde Procmail mini-FAQ also addresses both of these issues.)
I realize your recipe is probably just a quick test before you start on something bigger, but the entire recipe invoking the process script can be completely replaced by something like
MAILDIR=/home/myname/procmail
DATE=`date +%F_%N`
:0c:
${DATE}_email.txt
This will generate Berkeley mbox format, i.e. each message should have a From_ pseudo-header before the real headers; if you are not sure whether this is already the case, you should probably use procmail -Yf- to make sure to make it so (otherwise there is really no way to tell where one message ends and another begins; this applies both to your original solution, and this replacement).
Because Procmail sees the file name you are delivering to, it can infer a lockfile name now, as a minor bonus.
Using MAILDIR to specify the directory is the conventional way to do this, but you can specify a complete path to an mbox file if you prefer, of course.

Is this (simple) for loop doing what I want it to?

I have pretty much no experience with cygwin & UNIX but need to use it for extracting a large set of data from a even larger set of files...
I had some help yesterday to do this short script, but (after running for ~7-8 hours) the script simply wrote to the same output file 22 times. Atleast that's what I think happened.
I've now changed the code to this (see below) but it would be really awesome if someone who knows how this is done properly could tell me if it's likely to work before I waste another 8 hours...
for chr in {1..22}
do
zcat /cygdrive/g/data/really_long_filename$chr | sed '/^#/d' | cut -f1-3 >> db_to_rs_$chr
done
I want it to read file 1..22, remove rows starting with #, and send columns 1 to 3 to a file ending with the same number 1..22
yesterday the last part was just ...-f1-3 >> db_to_rs which I suspect just rewrote that file 22 times?
Help is much appreciated
~L
Yes, the code would work as expected.
When the command ended in ...-f1-3 >> db_to_rs, it essentially appended all the output to the file db_to_rs.
Saying ... >> db_to_rs_$chr would create filenames ending in {1 .. 22}.
However, note that saying >> would append the output to a file. So if db_to_rs1 already exists, the output would be appended. If you want to create a new file instead, say > instead of >>.

Grep -f and only return the first match

I'm working with a large CSV that follows a basic process.
Backup the working original
Generate a skeleton CSV
Read from another CSV, format the contents, and then append it to the skeleton
Append the data from the backup to the new one.
The issue I'm running into is that when I read in the contents from the backup, I'm using grep -Ev -f with a file containing regexes to exclude undesired data from the backup to be included in the next revision. This currently presents a problem because grep appears to evaluate each regex in the file against every line from STDIN which will cause duplicates. The simple solution would be to simply pipe it through sort | uniq and call it a day but that will screw with the formatting of the csv currently in use. I can elaborate if needed but the short of it is I run a script to bulk process IP addresses but there is also manual editing of the file by other people and with the current form of the script the final output will be all of the automated content with manual entries being at the bottom of the file.
So, is there anyway without some ugly looping of grep to tell it to stop evaluating a line after a pattern is matched? Using -m 1 will stop grep after the first match in the whole stream where I need it stop after each new line.
For the task you want to accomplish. It would be best in my opinion to use AWK. You can find an excellent tutorial for AWK at : http://www.grymoire.com/Unix/Awk.html. You basically need to change the input field separator for awk with
awk -f',' foo.awk bar.dat
As far as the problem with sorting is concerned follow this : http://www.linuxquestions.org/questions/linux-general-1/how-to-use-awk-to-sort-243177/

Resources