Filtering output from wget using sed - bash

How would I go about taking the output from a post call made using wget and filtering out everything but a string I want using sed. In other words, let's say I have some wget call that returns (amongst part of some string) :
'userPreferences':'some stuff' }
How would I get the string "some stuff" such that the command would look something like:
sed whatever-command-here | wget my-post-parameters some-URL
Also is that the proper way to chain the two as one line?

You want the output of wget to go to sed, so the order would be wget foo | sed bar
wget -q -O - someurl | sed ...
The -q flag will silence most of wget's output and -O - will write to standard output, so you can then pipe everything to sed.

The pipe works the other way around. They chain the left command's output to the right command's input:
wget ... | sed -n "/'userPreferences':/{s/[^:]*://;s/}$//p}" # keeps quotes
The filtering might be easier to express with GNU grep though:
wget ... | grep -oP "(?<='userPreferences':').*(?=' })" # strips the quotes, too

If you are on a system that supports named pipes (FIFOs) or the /dev/fd method of naming open files, you could avoid a pipe and use < <(...)
sed whatever-command-here < <(wget my-post-parameters some-URL)

Related

Strip characters from cURL command output

I am looking to take the download progress of a cURL command by taking only the first few characters of its progress bar output. Normally, I would use ${string:position:length}, but that doesn't seem to work in this situation.
Here's what I'm working with:
curl -O https://file.download.link/ > output.txt 2>&1
As you can see, I'm redirecting the output of the cURL command to the file output.txt, but let's say I want to only store the first three characters. Using what I just suggested returns a 'bad substitution' error:
echo ${curl -O https://file.download.link/:0:3} > output.txt 2>&1
so I'm out of my depth here.
If you'd like some more context, I was hoping to then change the command to output to a named pipe, so that it would change the progress of a CocoaDialog progress bar. I'm basically giving a GUI representation of the cURL download progress bar.
I would really appreciate any help or advice you could offer, so thank you in advance.
... and my apologies if this is a 'bad' question. I'm fairly new to bash, and scripting in general for that matter.
Here are two methods to get the first three characters of every line/update that curl produces. Note that, after curl prints its header and first output line, each subsequent line/update of output is preceded not by a newline character but by a carriage return, \r. On a terminal, this gives the output its nice update-in-place look. In our case, we have to add, as shown below, a little bit of special handling to interpret the \r correctly.
Using tr and grep
curl -O https://file.download.link/ 2>&1 | tr '\r' '\n' | grep -o '^...' >output.txt
Using awk
curl -O https://file.download.link/ 2>&1 | awk -v RS='\r' '{print substr($0,1,3)}' >output.txt
Sample Output
$ curl -O http://www.google.com/index.html 2>&1 | awk -v RS='\r' '{print substr($0,1,3)}'
%
0
100

bash script grep using variable fails to find result that actually does exist

I have a bash script that iterates over a list of links, curl's down an html page per link, greps for a particular string format (syntax is: CVE-####-####), removes the surrounding html tags (this is a consistent format, no special case handling necessary), searches a changelog file for the resulting string ID, and finally does stuff based on whether the string ID was found or not.
The found string ID is set as a variable. The issue is that when grepping for the variable there are no results, even though I positively know there should be for some of the ID's. Here is the relevant portion of the script:
for link in $(cat links.txt); do
curl -s "$link" | grep 'CVE-' | sed 's/<[^>]*>//g' | while read cve; do
echo "$cve"
grep "$cve" ./changelog.txt
done
done
If I hardcode a known ID in the grep command, the script finds the ID and returns things as expected. I've tried many variations of grepping on this variable (e.g. exporting it and doing command expansion, cat'ing the changelog and piping to grep, setting variable directly via command expansion of the curl chain, single and double quotes surrounding variables, half a dozen other things).
Am I missing something nuanced with the outputted variable from the curl | grep | sed chain? When it is echo'd to stdout or >> to a file, things look fine (a single ID with no odd characters or carriage returns etc.).
Any hints or alternate solutions would be much appreciated. Thanks!
FYI:
OSX:$bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Edit:
The html file that I was curl'ing was chock full of carriage returns. Running the script with set -x was helpful because it revealed the true string being grepped: $'CVE-2011-2716\r'.
+ read -r link
+ curl -s http://localhost:8080/link1.html
+ sed -n '/CVE-/s/<[^>]*>//gp'
+ read -r cve
+ grep -q -F $'CVE-2011-2716\r' ./kernelChangelog.txt
Also investigating from another angle, opening the curled file in vim showed ^M and doing a printf %s "$cve" | xxd also showed the carriage return hex code 0d appended to the grep'd variable. Relying on 'echo' stdout was a wrong way of diagnosing things. Writing a simple html page with a valid CVE-####-####, but then adding a carriage return (in vim insert mode just type ctrl-v ctrl-m to insert the carriage return) will create a sample file that fails with the original script snippet above.
This is pretty standard string sanitization stuff that I should have figured out. The solution is to remove carriage returns, piping to tr -d '\r' is one method of doing that. I'm not sure there is a specific duplicate on SO for this series of steps, but in any case here is my now working script:
while read -r link; do
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read -r cve; do
if grep -q -F "$cve" ./changelog.txt; then
echo "FOUND: $cve";
else
echo "NOT FOUND: $cve";
fi;
done
done < links.txt
HTML files can contain carriage returns at the ends of lines, you need to filter those out.
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read cve; do
Notice that there's no need to use grep, you can use a regular expression filter in the sed command. (You can also use the tr command in sed to remove characters, but doing this for \r is cumbersome, so I piped to tr instead).
It should look like this:
# First: Care about quoting your variables!
# Use read to read the file line by line
while read -r link ; do
# No grep required. sed can do that.
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | while read -r cve; do
echo "$cve"
# grep -F searches for fixed strings instead of patterns
grep -F "$cve" ./changelog.txt
done
done < links.txt

wget grep sed to extract links and save them to a file

I need to download all page links from http://en.wikipedia.org/wiki/Meme and save them to a file all with one command.
First time using the commmand line so I'm unsure of the exact commands, flags, etc to use. I only have a general idea of what to do and had to search around for what href means.
wget http://en.wikipedia.org/wiki/Meme -O links.txt | grep 'href=".*"' | sed -e 's/^.*href=".*".*$/\1/'
The output of the links in the file does not need to be in any specific format.
Using gnu grep:
grep -Po '(?<=href=")[^"]*' links.txt
or with wget
wget http://en.wikipedia.org/wiki/Meme -q -O - |grep -Po '(?<=href=")[^"]*'
You could use wget's spider mode. See this SO answer for an example.
wget spider
wget http://en.wikipedia.org/wiki/Meme -O links.txt | sed -n 's/.*href="\([^"]*\)".*/\1/p'
but this only take 1 href per line, if there is more than 1, other are lost (same as your original line). You also forget to have a group (\( -> \)) in your orginal sed first pattern so \1 refere to nothing

Use pipe of commands as argument for diff

I am having trouble with this simple task:
cat file | grep -E ^[0-9]+$ > file_grep
diff file file_grep
Problem is, I want to do this without file_grep
I have tried:
diff file `cat file | grep -E ^[0-9]+$`
and
diff file "`cat file | grep -E ^[0-9]+$`"
and a few other combinations :-) but I can't get it to work.
I always get an error, when the diff gets extra argument which is content of file filtered by grep.
Something similar always worked for me, when I wanted to echo command outputs from within a script like this (using backtick escapes):
echo `ls`
Thanks
If you're using bash:
diff file <(grep -E '^[0-9]+$' file)
The <(COMMAND) sequence expands to the name of a pseudo-file (such as /dev/fd/63) from which you can read the output of the command.
But for this particular case, ruakh's solution is simpler. It takes advantage of the fact that - as an argument to diff causes it to read its standard input. The <(COMMAND) syntax becomes more useful when both arguments to diff are command output, such as:
diff <(this_command) <(that_command)
The simplest approach is:
grep -E '^[0-9]+$' file | diff file -
The hyphen - as the filename is a specific notation that tells diff "use standard input"; it's documented in the diff man-page. (Most of the common utilities support the same notation.)
The reason that backticks don't work is that they capture the output of a command and pass it as an argument. For example, this:
cat `echo file`
is equivalent to this:
cat file
and this:
diff file "`cat file | grep -E ^[0-9]+$`"
is equivalent to something like this:
diff file "123
234
456"
That is, it actually tries to pass 123234345 (plus newlines) as a filename, rather than as the contents of a file. Technically, you could achieve the latter by using Bash's "process substitution" feature that actually creates a sort of temporary file:
diff file <(cat file | grep -E '^[0-9]+$')
but in your case it's not needed, because of diff's support for -.
grep -E '^[0-9]+$' file | diff - file
where - means "read from standard input".
Try process substitution:
$ diff file <(grep -E "^[0-9]+$" file)
From the bash manpage:
Process Substitution
Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of
naming open files. It takes the form of <(list) or >(list). The process list is run with its input or
output connected to a FIFO or some file in /dev/fd. The name of this file is passed as an argument to
the current command as the result of the expansion. If the >(list) form is used, writing to the file
will provide input for list. If the <(list) form is used, the file passed as an argument should be read
to obtain the output of list.
In bash, the syntax is
diff file <(cat file | grep -E ^[0-9]+$)

bash grep newline

[Editorial insertion: Possible duplicate of the same poster's earlier question?]
Hi, I need to extract from the file:
first
second
third
using the grep command, the following line:
second
third
How should the grep command look like?
Instead of grep, you can use pcregrep which supports multiline patterns
pcregrep -M 'second\nthird' file
-M allows the pattern to match more than one line.
Your question abstract "bash grep newline", implies that you would want to match on the second\nthird sequence of characters - i.e. something containing newline within it.
Since the grep works on "lines" and these two are different lines, you would not be able to match it this way.
So, I'd split it into several tasks:
you match the line that contains "second" and output the line that has matched and the subsequent line:
grep -A 1 "second" testfile
you translate every other newline into the sequence that is guaranteed not to occur in the input. I think the simplest way to do that would be using perl:
perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;'
you do a grep on these lines, this time searching for string ##UnUsedSequence##third:
grep "##UnUsedSequence##third"
you unwrap the unused sequences back into the newlines, sed might be the simplest:
sed -e 's/##UnUsedSequence##/\n'
So the resulting pipe command to do what you want would look like:
grep -A 1 "second" testfile | perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;' | grep "##UnUsedSequence##third" | sed -e 's/##UnUsedSequence##/\n/'
Not the most elegant by far, but should work. I'm curious to know of better approaches, though - there should be some.
I don't think grep is the way to go on this.
If you just want to strip the first line from any file (to generalize your question), I would use sed instead.
sed '1d' INPUT_FILE_NAME
This will send the contents of the file to standard output with the first line deleted.
Then you can redirect the standard output to another file to capture the results.
sed '1d' INPUT_FILE_NAME > OUTPUT_FILE_NAME
That should do it.
If you have to use grep and just don't want to display the line with first on it, then try this:
grep -v first INPUT_FILE_NAME
By passing the -v switch, you are telling grep to show you everything but the expression that you are passing. In effect show me everything but the line(s) with first in them.
However, the downside is that a file with multiple first's in it will not show those other lines either and may not be the behavior that you are expecting.
To shunt the results into a new file, try this:
grep -v first INPUT_FILE_NAME > OUTPUT_FILE_NAME
Hope this helps.
I don't really understand what do you want to match. I would not use grep, but one of the following:
tail -2 file # to get last two lines
head -n +2 file # to get all but first line
sed -e '2,3p;d' file # to get lines from second to third
(not sure how standard it is, it works in GNU tools for sure)
So you just don't want the line containing "first"? -v inverts the grep results.
$ echo -e "first\nsecond\nthird\n" | grep -v first
second
third
Line? Or lines?
Try
grep -E -e '(second|third)' filename
Edit: grep is line oriented. you're going to have to use either Perl, sed or awk to perform the pattern match across lines.
BTW -E tell grep that the regexp is extended RE.
grep -A1 "second" | grep -B1 "third" works nicely, and if you have multiple matches it will even get rid of the original -- match delimiter
grep -E '(second|third)' /path/to/file
egrep -w 'second|third' /path/to/file
you could use
$ grep -1 third filename
this will print a string with match and one string before and after. Since "third" is in the last string you get last two strings.
I like notnoop's answer, but building on AndrewY's answer (which is better for those without pcregrep, but way too complicated), you can just do:
RESULT=`grep -A1 -s -m1 '^\s*second\s*$' file | grep -s -B1 -m1 '^\s*third\s*$'`
grep -v '^first' filename
Where the -v flag inverts the match.

Resources