This is a crawler from GitHub that I am to implement on myself but am unable to read the bash since am a novice. Can this be explained in answer
#!/bin/bash
# Create an array files that contains list of filenames
files=($(< url.txt))
cities=($(< city.txt))
url="http://www.grotal.com/"
citycodes=($(<citycode.txt))
# Read through the url.txt file and execute wget command for every filename
while IFS='=| ' read -r param uri; do
for file in "${files[#]}"; do
for city in "${cities[#]}"; do
mkdir "${city}"
mkdir "${city}/${file}"
wget -O "${city}/${file}/${file}${citycodes[#]}" "${uri}${url}${city}/${file}-${citycodes[#]}/"
done
done
done < url.txt
specifically these (even if u choose to downvote...)
while IFS='=| ' read -r param uri;
and then this:
done < url.txt
Let's break this down into pieces:
read, unless given a non-default -d argument to specify a terminator to use in place of the newline, reads a single line from stdin (that is, reads up to the next newline); splits that line on IFS characters, and writes each field into a different variable. If it stops being able to read more data before reaching a newline, then it emits a nonzero exit status, even if it successfully populated the variables given. (The -r argument prevents read from treating backslashes as continuation characters rather than literals; unless you have a specific reason to have continuation characters available in the context at hand, you should make a habit of passing -r to read by default).
< url.txt redirects a read handle on url.txt into stdin for the command (including a compound command such as a while loop) to which it's appended.
A while loop runs the conditional command it's given, checks whether that conditional reports success or failure, and then proceeds to run the body and restart on success, or exit on failure.
Thus, if you have IFS='=| ' read -r param uri, it will read a single line from stdin; assign everything up to the first =, | or space to the variable named param, and assign what's left to the variable uri.
If you put that in the conditional part of a while loop, then the loop will operate until that read fails -- as it will if there isn't more content (up to and including a newline character) available to be read.
For more in-depth discussion of the idiom and its uses, see BashFAQ #1.
Some asides:
Using mkdir -p -- "${city}/${file}" will let you have only a single mkdir command that creates both directories (and avoids generating error messages if they already exist).
Using readarray -t files < url.txt is a more robust way to read the contents of url.txt into an array named files, though it requires bash 4.0 or newer. For older versions of the shell, consider IFS=$'\n' read -r -d '' -a files <url.txt || (( ${#files[#]} )). These will behave far better than the original idiom if you have wildcards, whitespace, or other unexpected content in your input files.
Related
I have a file ('list') which contains a large list of filenames, with all kinds of character combinations such as:
-sensei
I am using the following script to process this list of files:
#!/bin/bash
while read -r line
do
html2text -o ./text/$line $line
done < list
Which is giving me 'Cannot open input file' errors.
What is the correct way of dealing with these filenames, to prevent any errors?
I have changed the example list above to now include only one filename (out of many) which does not work, no matter how I quote or don't quote it.
#!/bin/bash
while read -r line
do
html2text -o "./text/$line" "$line"
done < list
The error I get is:
Unrecognized command line option "-sensei", try "-help".
As such this question does not resolve this issue.
Something like this should fix your issues (unless the file list has CRLF line endings):
while IFS='' read -r file
do
html2text -o ./text/"$file" -- "$file"
done < filelist.txt
notes:
IFS='' read -r is mandatory when you want to capture a line accurately
most commands support -- for signaling the end of options; whatever the following arguments might be, they will not be treated as options. BTW, an other common work-around for filenames that start with - is to prepend ./ to them.
Are you sure the files are present? Usually there should not be a problem with your script. For example:
#!/bin/bash
while read -r line
do
touch ./text/$line
done < $1
ls -l /tmp/text
works perfectly fine for me (with your example input). Bash should escape them for you. Are you in the right directory? If you are sure the files are present, there is a problem with html2text.
Also make sure not to have a trailing blank line in your input file.
i trying to make a script to organize a pair of list i have, and process with other programs, but im a little bit stuck now.
I want from a List in Txt process every line first creating a folder to each line in the list and then process due to different scripts i have.
But my problem is is the list i give to the script is like 3-4 elements works great and create there own directory, but if i put a list with +1000 lines, then my script process only a few elements thru the scripts.
EDIT: the process are like 30-35 scripts, different language python,bash,python and golang
Any suggestions?
cat $STORES+NEW.txt | while read NEWSTORES
do
cd $STORES && mkdir $NEWSTORES && cd $NEWSTORES && mkdir .Files
python3 checkstatus.py -n $NEWSTORES
checkemployes $NEWSTORES -status
storemanagers -s $NEWSTORES -o $NEWSTORES+managers.txt
curl -s https://redacted.com/store?=$NEWSTORES | grep -vE "<|^[\*]*[\.]*$NEWSTORES" | sort -u | awk 'NF' > $NEWSTORES+site.txt
..
..
..
..
..
..
cd ../..
done
I'm not supposed to give an answer yet but I mistakenly answered my what should be a comment reply. Anyway here a few things I can suggest:
Avoid unnecessary use of cat.
Open your input file using another FD to prevent commands that read input inside the loop from eating the input: IFS= read -ru 3 NEWSTORES; do ...; done 3< "$STORES+NEW.txt" or { IFS= read -ru "$FD" NEWSTORES; do ...; done; } {FD}< "$STORES+NEW.txt". Also see https://stackoverflow.com/a/28837793/445221.
Not completely related but don't use while loop in a pipeline since it will execute in a subshell. In the future if you try to alter a variable and expect it to be saved outside the loop, it won't. You can use lastpipe to avoid it but it's unnecessary most of the time.
Place your variable expansions around double quotes to prevent unwanted word splitting and filename expansion.
Use -r option unless you want backslashes to escape characters.
Specify IFS= before read to prevent stripping of leading and trailing spaces.
Using readarray or mapfile makes it more convenient: readarray -t ALL_STORES_DATA < "$STORES+NEW.txt"; for NEWSTORES IN "${ALL_STORES_DATA[#]}"; do ...; done
Use lowercase characters on your variables when you don't use them in a global manner to avoid conflict with bash's variables.
I have a situation where a Bash script runs and parses a user-supplied JSON file using jq. Since it's supplied by the user, it's possible for them to include values in the JSON to perform an injection attack.
I'd like to know if there's a way to overcome this. Please note, the setup of: 'my script parsing a user-supplied JSON file' cannot be changed, as it's out of my control. Only thing I can control is the Bash script.
I've tried using jq with and without the -r flag, but in each case, I was successfully able to inject.
Here's what the Bash script looks like at the moment:
#!/bin/bash
set -e
eval "INCLUDES=($(cat user-supplied.json | jq '.Include[]'))"
CMD="echo Includes are: "
for e in "${INCLUDES[#]}"; do
CMD="$CMD\\\"$e\\\" "
done
eval "$CMD"
And here is an example of a sample user-supplied.json file that demonstrates an injection attack:
{
"Include": [
"\\\";ls -al;echo\\\""
]
}
The above JSON file results in the output:
Includes are: ""
, followed by a directory listing (an actual attack would probably be something far more malicious).
What I'd like instead is something like the following to be outputted:
Includes are: "\\\";ls -al;echo\\\""
Edit 1
I used echo as an example command in the script, which probably wasn’t the best example, as then the solution is simply not using eval.
However the actual command that will be needed is dotnet test, and each array item from Includes needs to be passed as an option using /p:<Includes item>. What I was hoping for was a way to globally neutralise injection regardless of the command, but perhaps that’s not possible, ie, the technique you go for relies heavily on the actual command.
You don't need to use eval for dotnet test either. Many bash extensions not present in POSIX sh exist specifically to make eval usage unnecessary; if you think you need eval for something, you should provide enough details to let us explain why it isn't actually required. :)
#!/usr/bin/env bash
# ^^^^- Syntax below is bash-only; the shell *must* be bash, not /bin/sh
include_args=( )
IFS=$'\n' read -r -d '' -a includes < <(jq -r '.Include[]' user-supplied.json && printf '\0')
for include in "${includes[#]}"; do
include_args+=( "/p:$include" )
done
dotnet test "${include_args[#]}"
To speak a bit to what's going on:
IFS=$'\n' read -r -d '' -a arrayname reads up to the next NUL character in stdin (-d specifies a single character to stop at; since C strings are NUL-terminated, the first character in an empty string is a NUL byte), splits on newlines, and puts the result into arrayname.
The shorter way to write this in bash 4.0 or later is readarray -t arrayname, but that doesn't have the advantage of letting you detect whether the program generating the input failed: Because we have the && printf '\0' attached to the jq code, the NUL terminator this read expects is only present if jq succeeds, thus causing the read's exit status to reflect success only if jq reported success as well.
< <(...) is redirecting stdin from a process substitution, which is replaced with a filename which, when read from, returns the output of running the code ....
The reason we can set include_args+=( "/p:$include" ) and have it be exactly the same as include_args+=( /p:"$include" ) is that the quotes are read by the shell itself and used to determine where to perform string-splitting and globbing; they're not persisted in the generated content (and thus later passed to dotnet test).
Some other useful references:
BashFAQ #50: I'm trying to put a command in a variable, but the complex cases always fail! -- explains in depth why you can't store commands in strings without using eval, and describes better practices to use instead (storing commands in functions; storing commands in arrays; etc).
BashFAQ #48: Eval command and security issues -- Goes into more detail on why eval is widely frowned on.
You don't need eval at all.
INCLUDES=( $(jq '.Include[]' user-supplied.json) )
echo "Includes are: "
for e in "${INCLUDES[#]}"; do
echo "$e"
done
The worst that can happen is that the unquoted command substitution may perform word-splitting or pathname expansion where you don't want it (which is a problem in your original as well), but there's no possibility for arbitrary command execution.
I'm reading through a lookup file and performing a set of actions for each line in the file. However the while loop only reads the first line in the file and exits. Here's the current code that I have.
sql_from_lkp $lookup
function sql_from_lkp {
lkp=$1
while read line; do
sql_from_columns ${line}
echo ${line}
done < ${lkp}
}
function sql_from_columns {
table_name=$1
table_column_info_file=${table_name}_columns
line_count=`cat $table_info_file | wc -l`
....
}
By selectively commenting the code, I found that if I comment the line_count line, the while loop goes through every line in the file and works fine. So the input is getting consumed by the cat statement.
I've checked other answers and understood that ssh usually consumes the file inputs inside while loops if -n option is not used. But not sure how to fix this case. Need some help.
You've mistyped a variable name: $table_info_file should be $table_column_info_file.
If you correct that, your problem will go away.
By referring to a non-extant variable - the mistyped $table_info_file - you're essentially executing cat | wc -l (no filename argument passed to cat) in sql_from_columns(), which makes cat read from stdin.
Therefore - after having read the 1st line in the while loop - the cat command in sql_from_columns() consumes the entire rest of your input (< ${lkp}), which is why the while loop exits after the 1st iteration.
Generally,
You should double-quote all your variable references so as not to subject their values to word-splitting and globbing.
Bash won't allow you to call functions before they're defined, so as presented in your question, your code fundamentally couldn't work.
While the legacy `...` syntax for command substitutions is still supported, it has pitfalls that can be avoided with the modern $(...) syntax.
A more efficient way to count lines is to pass the input file to wc -l via < rather than via cat and a pipeline (wc also accepts filename operands directly, but it then prints the input filename after the counts).
Incidentally, you probably would have caught your mistyped variable reference more easily had you done that, as Bash would have reported an ambiguous redirect error in the absence of a filename following <.
Here's a reformulation that addresses all the issues:
function sql_from_lkp {
lkp=$1
while read line; do
sql_from_columns "${line}"
echo "${line}"
done < "${lkp}"
}
function sql_from_columns {
table_name=$1
table_column_info_file=${table_name}_columns
line_count=$(wc -l < "$table_column_info_file")
# ...
}
sql_from_lkp "$lookup"
Note that I've only added double quotes where strictly needed to make the command robust; it wouldn't hurt to add them whenever a parameter (variable) is referenced.
I have a delimited (|) input file (TableInfo.txt) that has data as shown below
dbName1|Table1
dbName1|Table2
dbName2|Table3
dbName2|Table4
...
I have a shell script (LoadTables.sh) that parses each line and calls a executable passing args from the line like dbName, TableName. This process reads data from a SQL Server and loads it into HDFS.
while IFS= read -r line;do
fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
query=$(< ../sqoop/"${fields[1]}".sql)
sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt
Right now my process is running in sequential for each line in the file and its time consuming based on the number of entries in the file.
Is there any way I can execute the process in parallel? I have heard about using xargs/GNU parallel/ampersand and wait options. I am not familiar on how to construct and use it. Any help is appreciated.
Note:I don't have GNU parallel installed on the Linux machine. So xargs is the only option as I have heard some cons on using ampersand and wait option.
Put an & on the end of any line you want to move to the background. Replacing the silly (buggy) array-splitting method used in your code with read's own field-splitting, this looks something like:
while IFS='|' read -r db table; do
../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt
...FYI, re: what I meant about "buggy" --
fields=( $(foo) )
...performs not only string-splitting but also globbing on the output of foo; thus, a * in the output is replaced with a list of filenames in the current directory; a name such as foo[bar] can be replaced with files named foob, fooa or foor; the globfail shell option can cause such an expansion to result in a failure, the nullglob shell option can cause it to result in an empty result; etc.
If you have GNU xargs, consider the following:
# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
db=${1%|*}; table=${1##*|}
query=$(<"../sqoop/${table}.sql")
exec ../ProcessName "$db" "$table" "$query"
' _ < ../TableInfo.txt