bash: sed and/or grep having problems with specific line - bash

For a course called 'Programming Techniques', I have to scan a file with lines having the following format:
[IP-Address] - - [[Date and time]] "GET [some URL]" [HTML reply code] [some non-interesting number]
An example:
129.232.223.206 - - [30/Apr/1998:22:00:02 +0000] "GET /images/home_intro.anim.gif HTTP/1.0" 200 60349
My task is to scan all lines and extract from it the HTTP reply code only if this code is not equal to 200.
We have to use the command line. The following almost works:
cat file.out | sed 's/^.*\"[[:space:]]//' | sed 's/[[:space:]].*//' | grep -v '200' | sort | uniq 1> result1.txt
First, read in the file, remove everything up until the second " and the space after it, remove everything from the first space to the end, remove lines with 200, sort the numbers, remove duplicates, and send the remaining numbers to a file.
This produces the following output:
-
206
26.146.85.150ÀüŒÛ/ HTTP/1.0" 404 305
302
304
400
404
500
As we can see, it almost works. There is one line causing trouble:
26.146.85.150 - - [01/May/1998:16:47:28 +0000] "GET /images/home_fr_phra><HR><H3>\C0\FC\BC\DB/ HTTP/1.0" 404 305
This line causes the weird third output-line. What is wrong with this line? The only thing I can think of is the part \C0\FC\BC\DB. Backslashes always seem to cause trouble. So, what part of my command conflicts with this line?
Also, I noticed that if I switched sort and uniq, the file does get sorted, but duplicates do not get removed. Why?
(By the way, I'm relatively new to using the command line for the purposes described above.)

So, this looks like encoding SNAFU. If I'm not mistaken, what's happening is:
You're using an UTF-8 locale,
The input file does not contain valid UTF-8,
sed attempts to read the file as UTF-8 because of the aforementioned locale, and
sed breaks because of this (in particular, . does not match the offending bytes).
The stuff with the backslashes denotes a series of four bytes by their hex values, that is C0 FC BC DB. This is not valid UTF-8-encoded data.1
Given an UTF-8 locale, (GNU) sed interprets input as UTF-8, and . matches a valid UTF-8 character. It does not match invalid byte sequences. You can see this by running
echo -e '\xc0\xfc\xbc\xdb' | sed 's/.//g'
in a UTF-8 locale and noticing that the output is not empty. I am inclined to agree that this behavior is a bit of a nuisance, but here we are.
Since you don't seem to rely on any Unicode features, the solution could be to run sed with a non-UTF-8 locale, such as C. In your case:
cat file.out | LC_ALL=C sed 's/^.*\"[[:space:]]//' \
| LC_ALL=C sed 's/[[:space:]].*//' \
| grep -v '200' \
| sort \
| uniq 1 \
> result1.txt
(line breaks added for readability). By the way, you could conflate the two sed commands to a single one as follows:
LC_ALL=C sed 's/^.*\"[[:space:]]//; s/[[:space:]].*//'
1 c0 would indicate a two-byte UTF-8 code whose uppermost five bits are zero, which already makes no sense since it could be encoded as plain ASCII, and fc does not begin with the 10 bits in the uppermost half-nibble that the UTF-8 encoding would require there. So, although I am unsure what exactly their encoding is, it is definitely not UTF-8.

Related

How can I mask 200 characters of each line in a file with 3000 long lines?

I have a fixed width text data file. Each line is 3000 characters long. I need to mask (change to 'X") all the characters between position 1000 and 1200. There are no delimiters in the file, each field is known by its position in the line.
If I only needed to change 10 characters I could use sed:
sed -i -r 's/^(.{999}).{10}(.*)/\1XXXXXXXXX\2/'
But writing a sed command with 200 X's does not seem like a good idea.
I tried using awk, but it returns different values for some lines because of spaces in the data.
But writing a sed command with 200 X's does not seem like a good idea.
Let's do it anyway, but script it:
sed -E 's/^(.{999}).{200}/\1'"$(yes X | head -n200 | tr -d '\n')"'/'
Because it just so happens that 1000 % 200 == 0, I think we also could:
sed -E 's/.{200}/'"$(yes X | head -n200 | tr -d '\n')"'/6'
My go-to tools are, in order of increasing ability to get stuff done, sed, awk and python. You may want to consider stepping up :-)
In any case, this can be done in awk with some initial setup, something like:
BEGIN {x="XXXXXXXXXX"; x=x""x""x""x""x; x=x""x""x""x}
which gives you (10, then 50, then) 200 X's.
Then you can just fiddle with $0, which is the whole line regardless of spacing. Depending on what you actually meant by "between positions 1000 and 1200", the numbers below may be slightly different but you should get the idea:
{ print substr($0,1,999)""x""substr($0,1200) }
You can see how this will behave in the following snippet, replacing character positions 3 through 6 on each line:
pax> printf "hello there\ngoodbye\n" | awk '
...> BEGIN {x="X";x=x""x;x=x""x}
...> {print substr($0,1,2)""x""substr($0,7)}'
heXXXXthere
goXXXXe
This might work for you (GNU sed):
sed -E '1{x;:a;/^x{200}/!s/^/x/;ta;x};G;s/^(.{999}).{200}(.*)\n(.*)/\1\3\2/' file
Prime the hold space with a string containing 200 x's. Append the hold space to the current line and using substitution replace the intended string with the mask.

Can MQSC command output be 'unformatted?'

I'm trying to write automation routines for IBM MQ (on an IBM i to make life that little more difficult) and struggling with processing the rather over-formatted output.
It seems to insist on two fixed-width columns for all output and as these columns aren't wide enough for some values (notably SYSTEM.* queue names) the output per entry can be on different numbers of lines.
I'd like to avoid writing a parser just to fetch basic values from MQ. Can I force the output to a single (long) line, or specify column widths? I've got enough Unix-fu to combine pairs of lines and even strip out text with the likes of grep, sed, paste, but when the number of lines changes I'm tearing my hair out.
Well, I managed to tame sed and grep enough to get a working solution that can handle the two-or-three line output. It's very specific to this situation but the concepts could be applied to similar scenarios.
In short, I did not find a way to influence the output format of the display command, but did find a way to process it.
The following QShell command (run it with STRQSH) gives me a CSV of queue, current depth, maximum depth. I then use CPYFRMIMPF to move this into a DB2 file for processing.
CHGVAR VAR(&QSH) VALUE('+
echo "dis qlocal(*) curdepth maxdepth"
| /QSYS.LIB/QMQM.LIB/RUNMQSC.PGM ''' |< &QMGR |< ''' +
| grep ''[A-Z]\{4,8\}('' +
| sed -e ''/QUEUE([-A-Za-z0-9._\/]*)$/{N;s/\n//;}'' +
-e ''/TYPE([-A-Za-z0-9._\/]*)$/{N;s/\n//;}'' +
-e ''/CURDEPTH([0-9]*)$/{N;s/\n//;}'' +
-e ''s/^\ \ *//'' +
-e ''s/\ \ */,/g'' +
-e ''s/QUEUE[(]\([-A-Za-z0-9._\/]*\)[)]/"\1"/'' +
-e ''s/TYPE[(][-A-Za-z0-9._\/]*[)],//'' +
-e ''s/CURDEPTH[(]\([0-9]*\)[)]/\1/'' +
-e ''s/MAXDEPTH[(]\([0-9]*\)[)]/\1/'' +
| grep -v ''SYSTEM.'' +
> /tmp/mqqueuests.csv+
')
It allows for queue names with alphanumeric and . - _ / characters.
The fundamental solution to the problem of the variable numbers of lines lies in finding lines that do not end with MAXDEPTH( ) and removing the subsequent linefeed by means of the N command in sed. This pulls the next line of the file into the pattern buffer, where the linefeed can be stripped.
Just ran into the same problem - wanted to have runmqsc output one line per object, without extra messages. I did it this way - on AIX:
echo "DISPLAY CHSTATUS(*) STATUS CHSTATI CHSTADA CURRENT LOCLADDR " | runmqsc MY_MQ_SERVER | sed 's/^[^ ].*$/%/g' | tr -s " " | tr -d "\n" | tr "%" "\n"
So, first of all I created a "separator" out of the unnecessary lines, beginning on the first column (first sed command). Then all series of spaces are shrunk to one space and all newline characteris deleted. Then using the last tr command, the "separators" are converted to newlines, thus creating exactly one line output per object.
--Trifo

Using terminal to find PDF size

This is a normal output using pdfinfo
Creator: Pages
Producer: Mac OS X 10.10.1 Quartz PDFContext
CreationDate: Tue Mar 3 01:26:34 2015
ModDate: Tue Mar 3 01:26:34 2015
Tagged: no
Form: none
Pages: 5
Encrypted: no
Page size: 612 x 792 pts (letter) (rotated 0 degrees)
File size: 242463 bytes
Optimized: no
PDF version: 1.3
So I know I can do something like this to grab the amount of pages:
pdfinfo document.pdf | grep Pages: | awk '{print $2}'
I am trying to get the page size to put something like 612 x 792.
At the moment I am trying things like grep "Page size:" but it's obviously not the right way. Could anyone point me in the right direction?
grep/sed work:
pdfinfo document.pdf | \
grep "Page size:" | \
sed -e 's/^[^:]*:[[:space:]]*//' -e 's/[[:space:]]pts.*//'
using grep to simplify the text to just the line you are interested in, then using sed to chop off the beginning and end of the line (for the example you showed).
In this example, there are two sed options (each is a script). Both change characters matching a given pattern to nothing, e.g.,
s/old/new/
but here new is an empty string.
The "^" character at the beginning is an "anchor", matching the beginning of the line. The "[^:]" uses "^" differently, matching any character except ":" (and the "" says zero-or-more). So given "Page size:", that matches the whole thing. After the ":" on your line, there is some whitespace (which may be spaces or tabs). The POSIX character class "[:space:] matches either, and is put inside brackets as you see: "[[:space:]]". Finally, the "." in the second option matches any character (.) zero or more times (*).

grep pipe searching for one word, not line

For some reason I cannot get this to output just the version of this line. I suspect it has something to do with how grep interprets the dash.
This command:
admin#DEV:~/TEMP$ sendemail
Yields the following:
sendemail-1.56 by Brandon Zehm
More output below omitted
The first line is of interest. I'm trying to store the version to variable.
TESTVAR=$(sendemail | grep '\s1.56\s')
Does anyone see what I am doing wrong? Thanks
TESTVAR is just empty. Even without TESTVAR, the output is empty.
I just tried the following too, thinking this might work.
sendemail | grep '\<1.56\>'
I just tried it again, while editing and I think I have another issue. Perhaps im not handling the output correctly. Its outputting the entire line, but I can see that grep is finding 1.56 because it highlights it in the line.
$ TESTVAR=$(echo 'sendemail-1.56 by Brandon Zehm' | grep -Eo '1.56')
$ echo $TESTVAR
1.56
The point is grep -Eo '1.56'
from grep man page:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output
line.
Your regular expression doesn't match the form of the version. You have specified that the version is surrounded by spaces, yet in front of it you have a dash.
Replace the first \s with the capitalized form \S, or explicit set of characters and it should work.
I'm wondering: In your example you seem to know the version (since you grep for it), so you could just assign the version string to the variable. I assume that you want to obtain any (unknown) version string there. The regular expression for this in sed could be (using POSIX character classes):
sendemail |sed -n -r '1 s/sendemail-([[:digit:]]+\.[[:digit:]]+).*/\1/ p'
The -n suppresses the normal default output of every line; -r enables extended regular expressions; the leading 1 tells sed to only work on line 1 (I assume the version appears in the first line). I anchored the version number to the telltale string sendemail- so that potential other numbers elsewhere in that line are not matched. If the program name changes or the hyphen goes away in future versions, this wouldn't match any longer though.
Both the grep solution above and this one have the disadvantage to read the whole output which (as emails go these days) may be long. In addition, grep would find all other lines in the program's output which contain the pattern (if it's indeed emails, somebody might discuss this problem in them, with examples!). If it's indeed the first line, piping through head -1 first would be efficient and prudent.
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail
sendemail-1.56 by Brandon Zehm
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail | cut -f2 -d "-" | cut -f1 -d" "
1.56

I have some trouble with "grep"and collating symbols

Here is my problem.A existed file named data.f,I use collating symbol "48",I want to match "48"in my file with Collating symbols in bracket expressions.
grep '[[.48.]]' data.f
but there is some error tip:
grep: Invalid collation character
but, there is no problem with character classes in bracket expressions.
grep "[[:alpha:]]" data.f
if you want to grep 48
grep 48 file
if you want to grep "48"
grep '"48"' file
// to avoid discussion in comments I extend my post with more examples
if you want to grep n occurrences of "48" in one line you should use regular expressions
cat file | grep '\(.*"48"\)\{n\}' | grep -v '\(.*"48"\)\{n+1\}'
basically you grep lines with at least n occurrences, and then with invert-match you exclude lines with n+1 occurrences of string, so you get n occurrences
in you comment you mentioned you wanted to grep lines with 5 occurrences of "48", that CAN be separated by other characters (that's the reason I put .* before "48")
so here is the sample
cat file | grep '\(.*"48"\)\{5\}' | grep -v '\(.*"48"\)\{6\}'
Wouldn't grep '48' data.f work?
I have no idea what you mean by “I use collating symbol "48"” (I know what collation classes are, which is what grep expects to see in your input, but I don't know what a collation symbol would be), but from one of your comments, it seems you're actually looking for the exact string [[.48.]] in your file. Here's two ways of doing just that:
grep -F '[[.48.]]' data.f
grep '\[\[.48.]]' data.f
In one of your other comments, you asked for how to ask grep for lines with at least five occurrences of “48” on them. That's a pretty clear regex question:
grep -E '(.*48){5}' data.f

Resources