So I have a file let's call "page.html". Within this file, there's some links/file paths I want to extract. I've been working in BASH trying to get this right but can't seem to do it. The words/links/paths I want to grab all start with "/funny/hello/there/". The goal is for all these words to go to the terminal so I can use them.
This is kinda what I've tried so far, with no luck:
grep -E '^/funny/hello/there/` page.html
and
grep -Po '/funny/hello/there/.*?` page.html
Any help would be greatly appreciated, Thanks.
Here is sample data from the file:
`<td data-title="Blah" class="Blah" >
fdsksldjfah
</td>`
My output gives me all the different line that look like this:
fdsksldjfah
The "/fkljaskdjfl" are all something different though.
What I want the output to look like:
/funny/hello/there/fkljaskdjfl
/funny/hello/there/kfjasdflas
/funny/hello/there/kdfhakjasa
You can use this grep command:
grep -o "/funny/hello/there/[^'\"[:blank:]]*" page.html
However one should avid parsing HTML using shell utilities and use dedicated HTML dom parsers instead.
Related
I am trying to use a list that looks like this:
List file:
1mAF
2mAF
4mAF
7mAF
9mAF
10mAF
11mAF
13mAF
18mAF
27mAF
33mAF
36mAF
37mAF
38mAF
39mAF
40mAF
41mAF
45mAF
46mAF
47mAF
49mAF
57mAF
58mAF
60mAF
61mAF
62mAF
63mAF
64mAF
67mAF
82mAF
86mAF
87mAF
95mAF
96mAF
to grab out lines that contain a word-level match in a tab delimited file that looks like this:
File_of_interest:
11mAF-NODE_111-g7687-JEFF-tig00000037_arrow-g7396-AFLA_058530 11mAF cluster63
17mAF-NODE_343-g9350 17mAF cluster07
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
22mAF-NODE_36-g3735 22mAF cluster28
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
47mAF-NODE_30-g3242-JEFF-tig00000037_arrow-g7396-AFLA_058530 47mAF cluster21
67mAF-NODE_54-g4372 67mAF cluster06
69mAF-NODE_27-g2754 69mAF cluster39
71mAF-NODE_44-g4178 71mAF cluster25
73mAF-NODE_47-g4895 73mAF cluster57
78mAF-NODE_4-g688 78mAF cluster53
but when I do grep -w -f list file_of_interest these are the only ones I get:
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
and this misses a bunch of the values that are in the original list for example note that "67mAF" is in the list and in the file but it isn't returned.
I have tried removing everything after "mAF" in the list and trying again -- no change. I have rewritten the list in a completely new file to no avail. Oddly, I get more of them if I "sort" the list into a new file and then do the grep, but I still don't get all of them. I have also removed all invisible characters using sed (sed $'s/[^[:print:]\t]//g'). no change.
I am on OSX and both files were created on OSX, but normally grep -f -w works in the fashion i'm describing above.
I am completely flummoxed. Is I thought grep -w -f would look for all word-level matches of items in the file in the target file... am I wrong?
Thanks!
My guess is at least one of these files originates from a Windows machine and has CRLF line endings. file(1) might be used to tell you. If that is the case do:
fromdos FILE
or, alternatively:
dos2unix FILE
I have the following Linux command which I am using to extract data from one very large log file.
sed -n "/<trade>/,/<\/trade>/p" Large.log > output.xml
However, the output is generated in a single file output.xml. My intention is to create a new file every time the "/<trade>/,/<\/trade>/p" is matched. Every new file will be named after the <id> tag which is inside the <trade> </trade> tags.
Something likes this...
sed -n "/<trade>/,/<\/trade>/p" Large.log > "/<id>/,/<\/id>/p".xml
However, that, of course, does not work and I am not sure how to apply a regex as a naming rule.
P.S At this point, I am also not sure if I should use sed or maybe I should try achieving this with awk
I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.
My script currently looks like this:
egrep -o '\w*\.[^\d\s]\w{2,3}\b' trace.txt > url.txt
I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.
Any help is appriceated
If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)
Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:
grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF
Result:
aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.
Apologies for a seemingly inane question. But I have spent the whole day trying to figure it out and it drives me up the walls. I'm trying to write a seemingly simple bash script that would take a list of files in the directory from ls, replace part of the file names using sed, get unique names from the list and pass them onto some command. Like so:
inputs=`ls *.ext`
echo $inputs
test1_R1.ext test1_R2.ext test2_R1.ext test2_R2.ext
Now I would like to put it through sed to replace 1.ext and 2.ext with * to get test1_R* etc. Then I'd like to remove resulting duplicates by running sort -u to arrive to the following $outputs variable:
echo $outputs
test1_R* test2_R*
And pass this onto a command, like so
cat $outputs
I can do something like this in a command line:
ls *.ext | sed s/..ext/\*/g | sort -u
But if I try to assign the above to a variable in the script it just returns the output from the ls. I have tried several ways to do it: including the whole pipe in the script. Running each command separately and assigning it to a variable, then passing that variable to the next command and writing the outputs to files then passing the file to the next command. But so far none of this managed to achieve what I aimed to. I think my problem lies in (except general cluelessness aroung bash scripting) inability to run seq on a variable within script. There seems to be a lot of advice around in how to pass variables to pattern or replacement string in sed, but they all seem to take files as input. But I understand that it might not be the proper way of doing it anyway. Therefore I would really appreciate if someone could suggest an elegant way to achieve, what I'm trying to.
Many thanks!
Update 2/06/2014
Hi Barmar, thanks for your answer. Can't say it solved the problem, but it helped pin-pointing it. Seems like the problem is in me using the asterisk. I have to say, I'm very puzzled. The actual file names I've got are:
test1_R1.fastq.gz test1_R2.fastq.gz test2_R1.fastq.gz test2_R2.fastq.gz
If I'm using the code you suggested, which seems to me the right way do to it:
ins=$(ls *.fastq.gz | sed 's/..fastq.gz/\*/g' | sort -u)
Sed doesn't seem to do anything and I'm getting the output of ls:
test1_R1.fastq.gz test1_R2.fastq.gz test2_R1.fastq.gz test2_R2.fastq.gz
Now if I replace that backslash with anything else, the sed works, but it also returns whatever character I'm putting in front (or after) the asteriks:
ins=$(ls *.fastq.gz | sed 's/..fastq.gz/"*/g' | sort -u)
test1_R"* test2_R"*
That's odd enough, but surely I can just put an "R" in front of the asteriks and then replace R in the search pattern string, right? Wrong! If I do that whichever way: 's/R..fastq.gz/R*/g' 's/...fastq.gz/R*/g' 's/[A-Z]..fastq.gz/R*/g' I'm back to the original names! And even if I end up with something like test1_RR* test2_RR* and try to run it through sed again and replace "_R" for "_" or "RR" for "R", I'm having no luck and I'm back to the original names. And yet I can replace the rest of the file name no problem, just not to get me test1_R* I need.
I have a feeling I should be escaping that * in some very clever way, but nothing I've tried seems to work. Thanks again for your help!
This is how you capture the result of the whole pipeline in a variable:
var=$(ls *.ext | sed s/..ext/\*/g | sort -u)
I have an xml file which has the following structure that contains numerous <Episodes></Episodes> to which the structure looks like this:
<Episode>
<id>4195462</id>
<Combined_episodenumber>8</Combined_episodenumber>
<Combined_season>2</Combined_season>
<DVD_chapter></DVD_chapter>
<DVD_discid></DVD_discid>
<DVD_episodenumber></DVD_episodenumber>
<DVD_season></DVD_season>
<Director>Jay Karas</Director>
<EpImgFlag>2</EpImgFlag>
<EpisodeName>Karl's Wedding</EpisodeName>
<EpisodeNumber>8</EpisodeNumber>
<FirstAired>2011-11-08</FirstAired>
<GuestStars>Katee Sackhoff|Carla Gallo</GuestStars>
<IMDB_ID></IMDB_ID>
<Language>en</Language>
<Overview>Karl Hevacheck, aka the Human Genius, gets married.</Overview>
<ProductionCode>209</ProductionCode>
<Rating>7.6</Rating>
<RatingCount>20</RatingCount>
<SeasonNumber>2</SeasonNumber>
<Writer>Kevin Etten</Writer>
<absolute_number></absolute_number>
<filename>episodes/211751/4195462.jpg</filename>
<lastupdated>1362547148</lastupdated>
<seasonid>471254</seasonid>
<seriesid>211751</seriesid>
</Episode>
I've figured out how to pull the information between a single tag like so
value=$(grep -m 1 "<Rating>" path_to_file | sed 's/<.*>\(.*\)<\/.*>/\1/')
but I can't find a way to verify that I am looking at the correct episode ie. to check If this is the correct branch which is for <Combined_season>2</Combined_season> <EpisodeNumber>8</EpisodeNumber> before saving the values for specific attributes. I know this can somehow be done using a combination of sed and awk but can't seem to figure it out anyhelp on how I can do this would be greatly appreciated.
Use a proper XML parser not sed or awk. You can still call your XML parser from your bash script just like you would with sed or awk. It's a bad idea to use sed or awk because XML is a structured file, sed and awk typical work with line oriented files. You will just give yourself a headache by using the wrong tool for the job. I suggest using a dedicated tools or a language such a php, python or perl (or any other language not starting with p) that has libraries for parsing XML.