extracting data from txt file? - dos

Extract data from a text file, the file consists of the following, say:
<img src="a.jpg" alt="abc" height="12px" width="12px">
<div class="ab3" id="1122">
<img src="b.jpg" alt="abc" height="12px" width="12px">
<div class=cd5" id="9876">
I want to extract the "id" value from the above shown text file...
the output should be:
1122
9876
I tried using findstr, find etc(DOS-COMMANDS), but not able to find the perfect regular expression for the same,
any other way is there, any help?

I agree with #izogfif, you should consider some other tools for this task.
But, to answer what you asked for, I got this regex:
id="[0-9]+"
It will give you output like this:
id="1122"
id="9876"
From there you can save those results (or use a pipe, however you do that in DOS), and then this regex:
[0-9]*
Will give you this output:
1122
9876

Use the following code:
( id=")[^"]*"
This will match any Id's value.
You can replace id with any attribute you are searching for.

Related

How to delete a part of a string between two patterns including one of them

I have a CSV document like this:
RL|S1|C19.concoct_part_0 concoct.26
RL|S1|C26.concoct_part_4 concoct.7
RL|S1|C26.concoct_part_5 concoct.7
I want it to be like this:
RL|S1|C19 concoct.26
RL|S1|C26 concoct.7
RL|S1|C26 concoct.7
How do I do it in Vim?
Thanks to #oguz ismail the solution is the following
:%s/\.[^\t]*//
Using command line mode you can run:
:%norm f.dt<space><enter>
Note: you must to type space and enter instead of writing it.

Filter CSV lines

I have a huge CSV file (4gb) and I need to filter the lines that contains specific strings. For example I need to filter the lines that have the text "McDonalds, BurgerKing or KFC".
I need multiple strings, like an OR.
Something like:
array_of_names = ["McDonalds", "Burger King, "KFC"]
foreach line in csv
if line.contains_any_of(array_of_names)
output << line
end
end
I think that I can do something with grep but I honestly don't have an idea. I guess I need a shell script.
Can anyone help me?
You could do it with grep like this:
grep "McDonalds\|Burger King\|KFC" your_file.csv

Search in a webpage using bash

I am trying to retrieve a webpage, search it for some pattern, retrieve that value and do some calculations with it. My Problem is, i can't seem to figure out how to search for the pattern in a given string.
Lets say i retrieve a Page like this
content=$(curl -L http://google.com)
now i want to search for a value im interested in, which is basically a html tag.
<div class="digits">123,456,789</div>
No i did try to find this by using sed. My Attempt looked like this:
n=$(echo "$content"|sed '<div class=\"digits\">(\\d\\d,\\d\\d\\d,\\d\\d\\d)</div>')
i want to pull that value every, lets say 10 minutes, save it and estimate when 124,xxx,xxx will be met.
My Problem is i don't really know how to save those values, but i think i can figure that out on my own. Im more interested in how to retrieve that substring as i always get an error because of the "<".
i hope someone is able and willing to help me :)
Better use a proper parser with xpath :
xmllint --html --xpath '//*[#class="digits"]' http://domain.tld/
But it seems that the example url you gave in the comments don't contains this class name. You can prove it by running first :
curl -Ls url | grep -oP '<div\s+class="digits">\K[^<]+'
It's best to use a proper parser as #sputnick suggested.
Or you can try something like this:
curl -L url | perl -ne '/<div class="digits">([\d,]+)<.div>/ && {print $1, "\n"}'

Remove all lines from a given text file based on a given list of IDs

I have a list of IDs like so:
11002
10995
48981
And a tab delimited file like so:
11002 Bacteria;
10995 Metazoa
I am trying to delete all lines in the tab delimited file containing one of the IDs from the ID list file. For some reason the following won't work and just returns the same complete tab delimited file without any line removed whatsoever:
grep -v -f ID_file.txt tabdelimited_file.txt > New_tabdelimited_file.txt
I also tried numerous other combinations with grep, but currently I draw blank here.
Any idea why this is failing?
Any help would be greatly appreciated
Since you tagged this with awk, here is one way of doing it:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{ids[$1]++;next}!($1 in ids)' idFile tabFile > new_tabFile
BTW your grep command is correct. Just double check if your file is not formatted for windows.

Using bash in order to extract data from a HTML forum list

I'm looking to create a quick script, but I've ran into some issues.
<li type="square"> Y </li>
I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find
<li type="square"> </li>
and tell me what is inbetween the two. The general formatting of the file is very messy:
<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>
<br/><br/><li type="square">Chris</li><more html stuff><br/>
I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.
EDIT -
<div class="post">
<hr class="hrcolor" width="100%" size="1" />
<div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
</div>
is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:
awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt
Only gives outputs the first list item, and not the rest.
You generally should not use regex to parse html files.
Instead you can use my Xidel to perform pattern matching on it:
xidel 4287022.html -e '<li type="square">{.}</li>*'
Or with traditional XPath:
xidel 4287022.html -e '//li[#type="square"]'
You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.
Using sed:
sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html
awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html
This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.
Result
-dave -chris -sarah -amber

Resources