bash script grep using variable fails to find result that actually does exist - bash

I have a bash script that iterates over a list of links, curl's down an html page per link, greps for a particular string format (syntax is: CVE-####-####), removes the surrounding html tags (this is a consistent format, no special case handling necessary), searches a changelog file for the resulting string ID, and finally does stuff based on whether the string ID was found or not.
The found string ID is set as a variable. The issue is that when grepping for the variable there are no results, even though I positively know there should be for some of the ID's. Here is the relevant portion of the script:
for link in $(cat links.txt); do
curl -s "$link" | grep 'CVE-' | sed 's/<[^>]*>//g' | while read cve; do
echo "$cve"
grep "$cve" ./changelog.txt
done
done
If I hardcode a known ID in the grep command, the script finds the ID and returns things as expected. I've tried many variations of grepping on this variable (e.g. exporting it and doing command expansion, cat'ing the changelog and piping to grep, setting variable directly via command expansion of the curl chain, single and double quotes surrounding variables, half a dozen other things).
Am I missing something nuanced with the outputted variable from the curl | grep | sed chain? When it is echo'd to stdout or >> to a file, things look fine (a single ID with no odd characters or carriage returns etc.).
Any hints or alternate solutions would be much appreciated. Thanks!
FYI:
OSX:$bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Edit:
The html file that I was curl'ing was chock full of carriage returns. Running the script with set -x was helpful because it revealed the true string being grepped: $'CVE-2011-2716\r'.
+ read -r link
+ curl -s http://localhost:8080/link1.html
+ sed -n '/CVE-/s/<[^>]*>//gp'
+ read -r cve
+ grep -q -F $'CVE-2011-2716\r' ./kernelChangelog.txt
Also investigating from another angle, opening the curled file in vim showed ^M and doing a printf %s "$cve" | xxd also showed the carriage return hex code 0d appended to the grep'd variable. Relying on 'echo' stdout was a wrong way of diagnosing things. Writing a simple html page with a valid CVE-####-####, but then adding a carriage return (in vim insert mode just type ctrl-v ctrl-m to insert the carriage return) will create a sample file that fails with the original script snippet above.
This is pretty standard string sanitization stuff that I should have figured out. The solution is to remove carriage returns, piping to tr -d '\r' is one method of doing that. I'm not sure there is a specific duplicate on SO for this series of steps, but in any case here is my now working script:
while read -r link; do
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read -r cve; do
if grep -q -F "$cve" ./changelog.txt; then
echo "FOUND: $cve";
else
echo "NOT FOUND: $cve";
fi;
done
done < links.txt

HTML files can contain carriage returns at the ends of lines, you need to filter those out.
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read cve; do
Notice that there's no need to use grep, you can use a regular expression filter in the sed command. (You can also use the tr command in sed to remove characters, but doing this for \r is cumbersome, so I piped to tr instead).

It should look like this:
# First: Care about quoting your variables!
# Use read to read the file line by line
while read -r link ; do
# No grep required. sed can do that.
curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | while read -r cve; do
echo "$cve"
# grep -F searches for fixed strings instead of patterns
grep -F "$cve" ./changelog.txt
done
done < links.txt

Related

For Loop Issues with CAT and tr

I have about 700 text files that consist of config output which uses various special characters. I am using this script to remove the special characters so I can then run a different script referencing an SED file to remove the commands that should be there leaving what should not be in the config.
I got the below from Remove all special characters and case from string in bash but am hitting a wall.
When I run the script it continues to loop and writes the script into the output file. Ideally, it just takes out the special characters and creates a new file with the updated information. I have not gotten to the point to remove the previous text file since it probably wont be needed. Any insight is greatly appreciated.
for file in *.txt for file in *.txt
do
cat * | tr -cd '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]' >> "$file" >> "$file".new_file.txt
done
A less-broken version of this might look like:
#!/usr/bin/env bash
for file in *.txt; do
[[ $file = *.new_file.txt ]] && continue ## skip files created by this same script
tr -cd '[:alnum:]\n\r' <"$file" \
| tr '[:upper:]' '[:lower:]' \
>> "$file".new_file.txt
done
Note:
We're referring to the "$file" variable being set by for.
We aren't using cat. It slows your script down with no compensating benefits whatsoever. Instead, using <"$file" redirects from the specific input file being iterated over at present.
We're skipping files that already have .new_file.txt extensions.
We only have one output redirection (to the new_file.txt version of the file; you can't safely write to the file you're using as input in the same pipeline).
Using GNU sed:
sed -i 's/[^[:alnum:]\n\r]//g;s/./\l&/g' *.txt

Getting rid of some special symbol while reading from a file

I am writing a small script which is getting some configuration options from a settings file with a certain format (option=value or option=value1 value2 ...).
settings-file:
SomeOption=asdf
IFS=HDMI1 HDMI2 VGA1 DP1
SomeOtherOption=ghjk
Script:
for VALUE in $(cat settings | grep IFS | sed 's/.*=\(.*\)/\1/'); do
echo "$VALUE"x
done
Now I get the following output:
HDMI1x
HDMI2x
VGA1x
xP1
Expected output:
HDMI1x
HDMI2x
VGA1x
DP1x
I obviously can't use the data like this since the last read entry is mangled up somehow. What is going on and how do I stop this from happening?
Regards
Generally you can use awk like this:
awk -F'[= ]' '$1=="IFS"{for(i=2;i<=NF;i++)print $i"x"}' settings
-F'[= ] splits the line by = or space. The following awk program checks if the first field, the variable name equals IFS and then iterates trough column 2 to the end and prints them.
However, in comments you said that the file is using Windows line endings. In this case you need to pre-process the file before using awk. You can use tr to remove the carriage return symbols:
tr -d '\r' settings | awk -F'[= ]' '$1=="IFS"{for(i=2;i<=NF;i++)print $i"x"}'
The reason is likely that your settings file uses DOS line endings.
Once you've fixed that (with dos2unix for example), your loop can also be modified to the following, removing two utility invocations:
for value in $( sed -n -e 's/^IFS.*=\(.*\)/\1/p' settings ); do
echo "$value"x
done
Or you can do it all in one go, removing the need to modify the settings file at all:
tr -d '\r' <settings |
for value in $( sed -n -e 's/^IFS.*=\(.*\)/\1/p' ); do
echo "$value"x
done

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

Bash variables not acting as expected

I have a bash script which parses a file line by line, extracts the date using a cut command and then makes a folder using that date. However, it seems like my variables are not being populated properly. Do I have a syntax issue? Any help or direction to external resources is very appreciated.
#!/bin/bash
ls | grep .mp3 | cut -d '.' -f 1 > filestobemoved
cat filestobemoved | while read line
do
varYear= $line | cut -d '_' -f 3
varMonth= $line | cut -d '_' -f 4
varDay= $line | cut -d '_' -f 5
echo $varMonth
mkdir $varMonth'_'$varDay'_'$varYear
cp ./$line'.mp3' ./$varMonth'_'$varDay'_'$varYear/$line'.mp3'
done
You have many errors and non-recommended practices in your code. Try the following:
for f in *.mp3; do
f=${f%%.*}
IFS=_ read _ _ varYear varMonth varDay <<< "$f"
echo $varMonth
mkdir -p "${varMonth}_${varDay}_${varYear}"
cp "$f.mp3" "${varMonth}_${varDay}_${varYear}/$f.mp3"
done
The actual error is that you need to use command substitution. For example, instead of
varYear= $line | cut -d '_' -f 3
you need to use
varYear=$(cut -d '_' -f 3 <<< "$line")
A secondary error there is that $foo | some_command on its own line does not mean that the contents of $foo gets piped to the next command as input, but is rather executed as a command, and the output of the command is passed to the next one.
Some best practices and tips to take into account:
Use a portable shebang line - #!/usr/bin/env bash (disclaimer: That's my answer).
Don't parse ls output.
Avoid useless uses of cat.
Use More Quotes™
Don't use files for temporary storage if you can use pipes. It is literally orders of magnitude faster, and generally makes for simpler code if you want to do it properly.
If you have to use files for temporary storage, put them in the directory created by mktemp -d. Preferably add a trap to remove the temporary directory cleanly.
There's no need for a var prefix in variables.
grep searches for basic regular expressions by default, so .mp3 matches any single character followed by the literal string mp3. If you want to search for a dot, you need to either use grep -F to search for literal strings or escape the regular expression as \.mp3.
You generally want to use read -r (defined by POSIX) to treat backslashes in the input literally.

Reading a file line by line in ksh

We use some package called Autosys and there are some specific commands of this package. I have a list of variables which i like to pass in one of the Autosys commands as variables one by one.
For example one such variable is var1, using this var1 i would like to launch a command something like this
autosys_showJobHistory.sh var1
Now when I launch the below written command, it gives me the desired output.
echo "var1" | while read line; do autosys_showJobHistory.sh $line | grep 1[1..6]:[0..9][0..9] | grep 24.12.2012 | tail -1 ; done
But if i put the var1 in a file say Test.txt and launch the same command using cat, it gives me nothing. I have the impression that command autosys_showJobHistory.sh does not work in that case.
cat Test.txt | while read line; do autosys_showJobHistory.sh $line | grep 1[1..6]:[0..9][0..9] | grep 24.12.2012 | tail -1 ; done
What I am doing wrong in the second command ?
Wrote all of below, and then noticed your grep statement.
Recall that ksh doesn't support .. as an indicator for 'expand this range of values'. (I assume that's your intent). It's also made ambiguous by your lack of quoting arguments to grep. If you were using syntax that the shell would convert, then you wouldn't really know what reg-exp is being sent to grep. Always better to quote argments, unless you know for sure that you need the unquoted values. Try rewriting as
grep '1[1-6]:[0-9][0-9]' | grep '24.12.2012'
Also, are you deliberately using the 'match any char' operator '.' OR do you want to only match a period char? If you want to only match a period, then you need to escape it like \..
Finally, if any of your files you're processing have been created on a windows machine and then transfered to Unix/Linux, very likely that the line endings (Ctrl-MCtrl-J) (\r\n) are causing you problems. Cleanup your PC based files (or anything that was sent via ftp) with dos2unix file [file2 ...].
If the above doesn't help, You'll have to "divide and conquer" to debug your problem.
When I did the following tests, I got the expected output
$ echo "var1" | while read line ; do print "line=${line}" ; done
line=var1
$ vi Test.txt
$ cat Test.txt
var1
$ cat Test.txt | while read line ; do print "line=${line}" ; done
line=var1
Unrelated to your question, but certain to cause comment is your use of the cat commnad in this context, which will bring you the UUOC award. That can be rewritten as
while read line ; do print "line=${line}" ; done < Test.txt
But to solve your problem, now turn on the shell debugging/trace options, either by changing the top line of the script (the shebang line) like
#!/bin/ksh -vx
Or by using a matched pair to track the status on just these lines, i.e.
set -vx
while read line; do
print -u2 -- "#dbg: Line=${line}XX"
autosys_showJobHistory.sh $line \
| grep 1[1..6]:[0..9][0..9] \
| grep 24.12.2012 \
| tail -1
done < Test.txt
set +vx
I've added an extra debug step, the print -u2 -- .... (u2=stderror, -- closes option processing for print)
Now you can make sure no extra space or tab chars are creeping in, by looking at that output.
They shouldn't matter, as you have left your $line unquoted. As part of your testing, I'd recommend quoting it like "${line}".
Then I'd comment out the tail and the grep lines. You want to see what step is causing this to break, right? So does the autosys_script by itself still produce the intermediate output you're expecting? Then does autosys + 1 grep produce out as expected, +2 greps, + tail? You should be able to easily see where you're loosing your output.
IHTH

Resources