I want to extract a string from a text file MODIS_list.txt:
wget https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MOD09GA/2018/062/ -O MODIS_list.txt
then to extract the name of MODIS file:
less MODIS_list.txt | grep -o -P '(?<=hdf">).*(?<=(MOD09GA.A2018062.h18v04.006)).*(?=</a>)'
which gives as output
MOD09GA.A2018062.h18v04.006.2018064030133.hdf
Let's say I would like to loop over more file changing, for example the date or the product.
prod_var=MOD09GA
prod_date=2018062
how can insert these two variables in the grep command!??
I tried in the following syntax but it does not work:
less MODIS_list.txt | grep -o -P '(?<=hdf">).*(?<=($prod_var.A$prod_date.h18v04.006)).*(?=</a>)'
Nevertheless, instead of using a monster regex, I suggest you to convert your html file into an xml file and to select the node you want by an xpath selection as follows:
tidy -q -f /dev/null -asxml --numeric-entities yes MODIS_list.txt | /usr/bin/xpath -q -e "//a[contains(#href,'$prod_var.A$prod_date.h18v04.006.2018064030133.hdf')]/text()"
The command you want to execute is:
grep -o -P "(?<=hdf\\\">).*(?<=($prod_var.A$prod_date.h18v04.006)).*(?=</a>)" MODIS_list.txt
As wolfrevokcats says (but you need to know what they are speaking about), you have to change the single quotes into double quotes. The problem is that you have a quote after the string hdf which has to be escaped twice: once for the shell, and once for grep, but again you need to know what I am speaking about. Another solution that avoids the problem of escaping the quotes at the right side of 'hdf' is to use a '.' as follows:
grep -o -P "(?<=hdf.>).*(?<=($prod_var.A$prod_date.h18v04.006)).*(?=</a>)" MODIS_list.txt
While grepping, you may concatenate constant string and variable.
Example:
Dumpy:~ admin$ cat /tmp/file.txt
user is john
user is pol
user is bob
user is mark
user is mike
Dumpy:~ admin$ export usrname='john'
Dumpy:~ admin$ grep --color 'user is '$usrname /tmp/file.txt
user is john
Related
I have a bash variable which has the following content:
SSH exit status 255 for i-12hfhf578568tn
i-12hdfghf578568tn is able to connect
i-13456tg is not able to connect
SSH exit status 255 for 1.2.3.4
I want to search the string starting with i- and then extract only that instance id. So, for the above input, I want to have output like below:
i-12hfhf578568tn
i-12hdfghf578568tn
i-13456tg
I am open to use grep, awk, sed.
I am trying to achieve my task by using following command but it gives me whole line:
grep -oE 'i-.*'<<<$variable
Any help?
You can just change your grep command to:
grep -oP 'i-[^\s]*' <<<$variable
Tested on your input:
$ cat test
SSH exit status 255 for i-12hfhf578568tn
i-12hdfghf578568tn is able to connect
i-13456tg is not able to connect
SSH exit status 255 for 1.2.3.4
$ var=`cat test`
$ grep -oP 'i-[^\s]*' <<<$var
i-12hfhf578568tn
i-12hdfghf578568tn
i-13456tg
grep is exactly what you need for this task, sed would be more suitable if you had to reformat the input and awk would be nice if you had either to reformat a string or make some computation of some fields in the rows, columns
Explanation:
-P is to use perl regex
i-[^\s]* is a regex that will match literally i- followed by 0 to N non space character, you could change the * by a + if you want to impose that there is at least 1 char after the - or you could use {min,max} syntax to impose a range.
Let me know if there is something unclear.
Bonus:
Following the comment of Sundeep, you can use one of the improved versions of the regex I have proposed (the first one does use PCRE and the second one posix regex):
grep -oP 'i-\S*' <<<$var
or
grep -o 'i-[^[:blank:]]*' <<<$var
You could use following too(I tested it with GNU awk):
echo "$var" | awk -v RS='[ |\n]' '/^i-/'
You can also use this code (Tested in unix)
echo $test | grep -o "i-[0-z]*"
Here,
-o # Prints only the matching part of the lines
i-[0-z]* # This regular expression, matches all the alphabetical and numerical characters following 'i-'.
Using wget, a webpage is downloaded as a .txt file. This file saved is named using part of the url of the webpage, eg. wget http://www.example.com/page/12345/ -O 12345.txt, for convenience.
I am running commands from a shell script .sh file, as it can execute multiple commands, one line at time, eg.
After a file is downloaded, I use sed to parse for text / characters I want to keep. Part of the text I want includes blah blah Product ID a5678.
What I want is to use sed to find a5678 and use this to rename the file 12345.txt to a5678.txt.
# script.sh
wget http://www.example.com/page/12345/ -O 12345.txt
sed -i '' 's/pattern/replace/g' 12345.txt
sed command to find a5678 # in line blah blah Product ID a5678
some more sed commands
mv 12345.txt a5678.txt (or use a variable $var.txt)?
How do I do this?
I may also want to use this same ID a5678 and create a folder with the same name a5678. Hence the .txt file is inside the folder like so /a5678/a5678.txt.
mkdir a5678 (or mkdir $var)? && cd a5678
I've searched for answers for half a day, but can't find any. The closest I found is
Find instance of word in files and change it to the filename but it is the exact opposite of what I want. I've also thought about using variables eg. https://askubuntu.com/questions/76808/how-do-i-use-variables-in-a-sed-command but I don't know how to save the found characters as a variable.
Very much look forward to some help! Thank you! I am on a Mac running Sierra.
Trying to minimize, so fit this into your logic.
in=12345.txt
out=$( grep ' Product ID ' $in | sed 's/.* Product ID \([^ ]*\) .*/\1/' )
mkdir -p $out
mv $in $out/$out.txt
Thank you all! With your inspiration, I solved my problem by (without using grep):
in=12345
out=$(sed -n '/pattern/ s/.*ID *//p' $in.txt)
mv $in.txt $out.txt
cd ..
mv $in $out
I can't read or apply any other commands like cat or strings on .txt files because it is not allowed. I need to read a file named flag.txt, but this file is also on the blacklist. So, is there any way to read *.txt using the head command? The head command is allowed.
blacklist=\
'flag\|<\|$\|"\|'"'"'\|'\
'cat\|tac\|*\|?\|less\|more\|pico\|nano\|edit\|hexdump\|xxd\|'\
'sed\|tail\|diff\|grep\|paste\|strings\|bas64\|sort\|uniq\|cut\|awk\|'\
'bzip\|gzip\|xz\|tar\|ar\|'\
'mv\|cp\|ln\|nl\|'\
'python\|perl\|sh\|cc\|g++\|php\|hd\|g++\|gcc\|curl\|tcp\|udp\|'\
'scp\|sftp\|wget\|nc\|netcat'
Thanks
do you want some alternative of the command head *.txt? if so, ls/findand xargs will help, but it can not identify .txt file, it will read all the file under the directory.
ls -1| xargs head
You can use the ` (backtick) in the following way:
head `ls -1`
Backtick has a very special meaning. Everything you type between
backticks is evaluated (executed) by the shell before the main command
So the command will do the following:
`ls -1` - will result with the file names
head - will show the start of the files listed in ls -1
More info about backtick can be found in this answer
If you need a glob that matches flag.txt but can use neither * not the string flag, you can use fl[a]g.txt instead. Then, to print the entire file using head, use -c and pass it the size of the file:
head -c $(stat -c '%s' fl[a]g.txt) fl[a]g.txt
Another approach would be to use the shell to read the file:
while IFS= read -r c; do echo $c; done < fl[a]g.txt
You could also just use paste:
paste fl[a]g.txt
i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.
Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)
Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).
I have this bash script and works
DIRECTORY='1.20_TRUNK/mips-tuxbox-oe1.6'
# Download html page and save to tmp folder to ump.tmp file
wget -O 'ump.tmp' 'http://download.oscam.cc/index.php?&direction=0&order=mod&directory=$DIRECTORY&'
ft='index.php?action=downloadfile&filename=oscam-svn'
st='-webif-Distribution.tar.gz&directory=$DIRECTORY&'
File ump.tmp containts e.g. three links
I need find solution for find only number 10082 in first "a" links of the page. But this number is amended. When you run the script e.g per month, it may be different
I do not have the "cat" command. I have receiver and not linux. Receiver have enigma system and "cat" isnĀ“t implemented
I tested through comparison "sed", but it does not work.
sed -n "/filename=oscam-svn/,/-mips-tuxbox-webif/p" ump.tmp
Using a proper XHTML parser :
$ xmllint --html --xpath '//a/#href[contains(., "downloadfile")]' ump.tmp 2>/dev/null |
grep -oP "oscam-svn\K\d+"
But there's not this string in the given HTML file
"Find" is kind of vague, but you can use grep to get the link with the number 10082 in it from the temp file.
$ grep "10082" ump.tmp