wget grep sed to extract links and save them to a file - bash

I need to download all page links from http://en.wikipedia.org/wiki/Meme and save them to a file all with one command.
First time using the commmand line so I'm unsure of the exact commands, flags, etc to use. I only have a general idea of what to do and had to search around for what href means.
wget http://en.wikipedia.org/wiki/Meme -O links.txt | grep 'href=".*"' | sed -e 's/^.*href=".*".*$/\1/'
The output of the links in the file does not need to be in any specific format.

Using gnu grep:
grep -Po '(?<=href=")[^"]*' links.txt
or with wget
wget http://en.wikipedia.org/wiki/Meme -q -O - |grep -Po '(?<=href=")[^"]*'

You could use wget's spider mode. See this SO answer for an example.
wget spider

wget http://en.wikipedia.org/wiki/Meme -O links.txt | sed -n 's/.*href="\([^"]*\)".*/\1/p'
but this only take 1 href per line, if there is more than 1, other are lost (same as your original line). You also forget to have a group (\( -> \)) in your orginal sed first pattern so \1 refere to nothing

Related

I want to pipe grep output to sed for input

I'm trying to pipe the output of grep to sed so it will only edit specific files. I don't want sed to edit something without changing it. (Changing the modified date.)
I'm searching with grep and writing with sed. That's it
The thing I am trying to change is a dash, not the normal type, a special type. "-" is normal. "–" isn't normal
The code I currently have:
sed -i 's/– foobar/- foobar/g' * ; perl-rename 's/– foobar/- foobar/' *'– foobar'*
Sorry about the trouble, I'm inexperienced.
Are you sure about what you want to achieve? Let me explain you:
grep "string_in_file" <filelist> | sed <sed_script>
This is first showing the "string_in_file", preceeded by the filename.
If you launch a sed on this, then it will just show you the result of that sed-script on screen, but it will not change the files itself. In order to do this, you need the following:
grep -l "string_in_file" <filelist> | sed <sed_script_on_file>
The grep -l shows you some filenames, and the new sed_script_on_file needs to be a script, reading the file, and altering it.
Thank you all for helping, I'm sorry about not being fast in responding
After a bit of fiddling with the command, I got it:
grep -l 'old' * | xargs -d '\n' sed -i 's/old/new/'
This should only touch files that contain old and leave all other files.
This might be what you're trying to do if your file names don't contain newlines:
grep -l -- 'old' * | xargs sed -i 's/old/new/'

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

Filtering output from wget using sed

How would I go about taking the output from a post call made using wget and filtering out everything but a string I want using sed. In other words, let's say I have some wget call that returns (amongst part of some string) :
'userPreferences':'some stuff' }
How would I get the string "some stuff" such that the command would look something like:
sed whatever-command-here | wget my-post-parameters some-URL
Also is that the proper way to chain the two as one line?
You want the output of wget to go to sed, so the order would be wget foo | sed bar
wget -q -O - someurl | sed ...
The -q flag will silence most of wget's output and -O - will write to standard output, so you can then pipe everything to sed.
The pipe works the other way around. They chain the left command's output to the right command's input:
wget ... | sed -n "/'userPreferences':/{s/[^:]*://;s/}$//p}" # keeps quotes
The filtering might be easier to express with GNU grep though:
wget ... | grep -oP "(?<='userPreferences':').*(?=' })" # strips the quotes, too
If you are on a system that supports named pipes (FIFOs) or the /dev/fd method of naming open files, you could avoid a pipe and use < <(...)
sed whatever-command-here < <(wget my-post-parameters some-URL)

In bash, how to find files which there is "test" string in,Exclude binary files

find . -type f |xargs grep string |awk -F":" '{print $1}' |uniq
the command above,it get all files' name which contain string "test". but the result includes
binary file.
The problem is how to exclude binary file.
thanks you all.
If I understand properly, you want to get the name of all the files in the directory and its subdirectories that contain the string string, excluding binary files.
Reading grep's friendly manual, I was able to catch this:
-I Process a binary file as if it did not contain matching data;
this is equivalent to the --binary-files=without-match option.
Amazing!
Now how about I get rid of find. Is this possible with just grep? Oh, two lines below, still in the funky manual, I read this:
-R, -r, --recursive
Read all files under each directory, recursively; this is
equivalent to the -d recurse option.
That seems great, doesn't it?
How about getting only the file name? Still in grep's funny manual, I read:
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)
Yay! I think we're done:
grep -IlR 'string' .
Remarks.
I also tried to find make me a sandwich in the manual, but my version of grep doesn't seem to support it. YMMV.
The manual is located at man grep.
As William Pursell rightly comments, the -R and -I switches are not available in all implementations of grep. If your grep possesses the make me a sandwich option, it will very likely support the -R and -I switches. YMMV.
Version of Unix that I work with, does not support the command "grep -I/R".
I tried the command:
file `find ./` | grep text | cut -d: -f1 | xargs grep "test"

grepping string from long text

The command below in OSX checks whether an account is disabled (or not).
I'd like to grep the string "isDisabled=X" to create a report of disabled users, but am not sure how to do this since the output is on three lines, and I'm interested in the first 12 characters of line three:
bash-3.2# pwpolicy -u jdoe -getpolicy
Getting policy for jdoe /LDAPv3/127.0.0.1
isDisabled=0 isAdminUser=1 newPasswordRequired=0 usingHistory=0 canModifyPasswordforSelf=1 usingExpirationDate=0 usingHardExpirationDate=0 requiresAlpha=0 requiresNumeric=0 expirationDateGMT=12/31/69 hardExpireDateGMT=12/31/69 maxMinutesUntilChangePassword=0 maxMinutesUntilDisabled=0 maxMinutesOfNonUse=0 maxFailedLoginAttempts=0 minChars=0 maxChars=0 passwordCannotBeName=0 validAfter=01/01/70 requiresMixedCase=0 requiresSymbol=0 notGuessablePattern=0 isSessionKeyAgent=0 isComputerAccount=0 adminClass=0 adminNoChangePasswords=0 adminNoSetPolicies=0 adminNoCreate=0 adminNoDelete=0 adminNoClearState=0 adminNoPromoteAdmins=0
Your ideas/suggestions are most appreciated! Ultimately this will be part of a Bash script. Thanks.
This is how you would use grep to match "isDisabled=X":
grep -o "isDisabled=."
Explanation:
grep: invoke the grep command
-o: Use the --only-matching option for grep (From grep manual: "Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line."
"isDisabled=.": This is the search pattern you give to grep. The . is part of the regular expression, it means "match any character except for newline".
Usage:
This is how you would use it as part of your script:
pwpolicy -u jdoe -getpolicy | grep -oE "isDisabled=."
This is how you can save the result to a variable:
status=$(pwpolicy -u jdoe -getpolicy | grep -oE "isDisabled=.")
If your command was run some time prior, and the results from the command was saved to a file called "results.txt", you use it as input to grep as follows:
grep -o "isDisabled=." results.txt
You can use sed as
cat results.txt | sed -n 's/.*isDisabled=\(.\).*/\1/p'
This will print the value of isDisbaled.

Resources