store txt files separately for each subcategories - bash

I have several experiments. Each experiment has several replicate files. I want to place all these replicate files into one text file in the following way.
Lets say there are 3 experiments and each experiment has 2 replicate files.(Experiment and replicate number can be more than this)
/home/data/study1/EXP1_30/EXP1_replicate_1_30.txt
/home/data/study1/EXP1_30/EXP1_replicate_2_30.txt
/home/data/study1/EXP1_60/EXP1_replicate_1_60.txt
/home/data/study1/EXP1_60/EXP1_replicate_2_60.txt
/home/data/study1/EXP2_30/EXP2_replicate_1_30.txt
/home/data/study1/EXP2_30/EXP2_replicate_2_30.txt
/home/data/study1/EXP2_60/EXP2_replicate_1_60.txt
/home/data/study1/EXP2_60/EXP2_replicate_2_60.txt
/home/data/study1/EXP3_30/EXP3_replicate_1_30.txt
/home/data/study1/EXP3_30/EXP3_replicate_2_30.txt
/home/data/study1/EXP3_60/EXP3_replicate_1_60.txt
/home/data/study1/EXP3_60/EXP3_replicate_2_60.txt
output file1.txt will look like
/home/data/study1/EXP1/EXP1_replicate_1_30.txt,/home/data/study1/EXP1/EXP1_replicate_2_30.txt \
/home/data/study1/EXP2/EXP2_replicate_1_30.txt,/home/data/study1/EXP2/EXP2_replicate_2_30.txt \
/home/data/study1/EXP3/EXP3_replicate_1_30.txt,/home/data/study1/EXP3/EXP3_replicate_2_30.txt
output file2.txt will look like
/home/data/study1/EXP1/EXP1_replicate_1_60.txt,/home/data/study/EXP1/EXP1_replicate_2_60.txt \
/home/data/study1/EXP2/EXP2_replicate_1_60.txt,/home/data/study1/EXP2/EXP2_replicate_2_60.txt \
/home/data/study1/EXP3/EXP3_replicate_1_60.txt,/home/data/study1/EXP3/EXP3_replicate_2_60.txt
....
My code with for loops:
ID=(30 60)
exp=("EXP1" "EXP2" "EXP3")
d=""
for txtfile in /home/data/study1/${exp[0]}/${exp[0]}*_${ID[0]}.txt
do
printf "%s%s" "$d" "$txtfile"
d=","
done
printf " \\"
printf "\n"
d=""
for txtfile in /home/data/study1/${exp[1]}/${exp[1]}*_${ID[0]}.txt
do
printf "%s%s" "$d" "$txtfile"
d=","
done
printf " \\"
printf "\n"
d=""
for txtfile in /home/data/study1/${exp[2]}/${exp[2]}*_${ID[0]}.txt
do
printf "%s%s" "$d" "$txtfile"
d=","
done
I am using for loops with index numbers for each experiment and replicates which is very time consuming. Is there any easy way?

I think that this does what you want:
#!/bin/bash
ids=( 30 60 )
dir=/home/data/study1
# join glob on comma, add slash at end
# modified from http://stackoverflow.com/a/3436177/2088135
join() { local IFS=,; echo "$* "'\'; } #' <- to fix syntax highlighting
i=0
for id in "${ids[#]}"; do
s=$(for exp in "$dir"/EXP*"$id"; do join "$exp/"*"$id".txt; done)
# trim off final slash and output to file
echo "${s%?}" > file$((++i)).txt
done
Output (note that when testing, I set dir=.):
$ cat file1.txt
./EXP1_30/EXP1_replicate_1_30.txt,./EXP1_30/EXP1_replicate_2_30.txt \
./EXP2_30/EXP2_replicate_1_30.txt,./EXP2_30/EXP2_replicate_2_30.txt \
./EXP3_30/EXP3_replicate_1_30.txt,./EXP3_30/EXP3_replicate_2_30.txt
$ cat file2.txt
./EXP1_60/EXP1_replicate_1_60.txt,./EXP1_60/EXP1_replicate_2_60.txt \
./EXP2_60/EXP2_replicate_1_60.txt,./EXP2_60/EXP2_replicate_2_60.txt \
./EXP3_60/EXP3_replicate_1_60.txt,./EXP3_60/EXP3_replicate_2_60.txt

You can use the following bash script:
#!/bin/bash
i=0; n=0; files=""
sort -t_ -k5 files.txt | while read line ; do
files="$files $line"
i=$((i+1))
if [ $((i%6)) -eq 0 ] ; then
n=$((n+1))
cat $files > "$n.txt"
files=""
fi
done

You can also make use of a subshell and do it from the command line (your data in dat/experiment.txt) with:
$ ( first=0; cnt=0; grep 30 dat/experiment.txt | sort | while read line; do \
[ "$first" = 0 ] && first=1 || { [ "$cnt" = 0 ] && echo ' \'; }; echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && cnt=0; done; \
echo "" ) >outfile1.txt
$ ( first=0; cnt=0; grep 60 dat/experiment.txt | sort | while read line; do \
[ "$first" = 0 ] && first=1 || { [ "$cnt" = 0 ] && echo ' \'; }; echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && cnt=0; done; \
echo "" ) >outfile2.txt
Admittedly, the one liner ended up being longer than originally anticipated to match your line continuations -- exactly. If you omit the line continuations in the outfiles, the line reduces to (e.g.):
$ (cnt=0; grep 30 dat/experiment.txt | sort | while read line; do echo -n $line; \
((cnt++)); [ "$cnt" = 1 ] && echo -n ","; [ "$cnt" = 2 ] && echo "" && cnt=0; \
done ) >outfile1.txt
output:
$ cat outfile1.txt
/home/data/study1/EXP1_30/EXP1_replicate_1_30.txt,/home/data/study1/EXP1_30/EXP1_replicate_2_30.txt \
/home/data/study1/EXP2_30/EXP2_replicate_1_30.txt,/home/data/study1/EXP2_30/EXP2_replicate_2_30.txt \
/home/data/study1/EXP3_30/EXP3_replicate_1_30.txt,/home/data/study1/EXP3_30/EXP3_replicate_2_30.txt \
$ cat outfile2.txt
/home/data/study1/EXP1_60/EXP1_replicate_1_60.txt,/home/data/study1/EXP1_60/EXP1_replicate_2_60.txt \
/home/data/study1/EXP2_60/EXP2_replicate_1_60.txt,/home/data/study1/EXP2_60/EXP2_replicate_2_60.txt \
/home/data/study1/EXP3_60/EXP3_replicate_1_60.txt,/home/data/study1/EXP3_60/EXP3_replicate_2_60.txt \

Related

how will I fix the echo when counting chars in a string

I have an issue with the echo in the for loop, as I want to count string chars especailly for "*" it but it prints all the files in current directory.
clearvar() {
int=0
str=0
uniqchar=0
}
countstring(){
for c in $(echo "${1}" | fold -w1); do
echo "$c"
if [[ $c == [0-9] ]];then
int=$(( $int + 1 ))
elif [[ $c == [a-Z] ]];then
str=$(( $str + 1 ))
else
uniqchar=$(( $uniqchar + 1 ))
fi
done
}
while [ $# -gt 0 ];do
echo "Argument input: $1"
read -p "Input: " string
rmws=$(echo $string | tr -d " ")
mashed=$rmws$1
countstring $mashed
echo -e "int: $int\nstr: $str\nuniquechar: $uniqchar\nWhole string: $mashed"
clearvar
shift
done
Example output:
Argument input: io1
Input: fj^*23
f
j
^
file1
file2
file3
2
3
i
o
1
int: 3
str: 4
uniquechar: 4
Whole string: fj^*2wio1
it interprets as echo * instead of echo "*".
so I expect it to not print out the file names.
rmws=$(echo $string | tr -d " ")
If string=* this just executes echo * and expands the *.
The same happens in:
countstring $mashed
Both these expansions are unquoted. Quote them in double quotes. As a rule of a thumb - always use double quotes.
Also the same happens in the for loop:
for c in $(echo "${1}" | fold -w1)
the expansion, as elsewhere, is unquoted, so * expands. You have to quote. That's why the for i in $(..) is considered bad style - because such bugs happen. You can't do for i in "$(...)" because then you would iterate over one element. To iterate over lines or elements in a stream use a while IFS= read -r loop. You can print every character on each separate line with ex. sed 's/./&\n/g' and iterate over lines, or use bash extension read -n1 to read one character.
while IFS= read -r -n1 c; do
..
done <<<"$1"
The <<<"$1" is a bash's "here string".
You don't need $ in arithmetic expansion. Just:
int=$(( int + 1 ))
str=$(( str + 1 ))
uniqchar=$(( uniqchar + 1 ))
or in bash you can even do:
(( int++ ))
# and so on
Your script could become:
clearvar() {
int=0
str=0
uniqchar=0
}
countstring(){
while IFS= read -r -n1 c; do
echo "$c"
if [[ $c == [0-9] ]];then
(( int++ ))
elif [[ $c == [a-Z] ]];then
(( str++ ))
else
(( uniqchar++ ))
fi
done <<<"$1"
}
while [ $# -gt 0 ]; do
echo "Argument input: $1"
read -p "Input: " string
rmws="$(echo "$string" | tr -d " ")"
mashed="$rmws$1"
countstring "$mashed"
echo "int: $int"
echo "str: $str"
echo "uniquechar: $uniqchar"
echo "Whole string: $mashed"
clearvar
shift
done
Notes:
echo has portability issues. Prefer to use printf instead.
I prefer while (($#)); do in place of while [ $# -eq 0 ]; do.
PS. I would use tr:
countstring() {
printf "%s" "$1" | tr -cd '[0-9]' | wc -c
printf "%s" "$1" | tr -cd '[a-zA-Z]' | wc -c
printf "%s" "$1" | tr -d '[0-9a-zA-Z]' | wc -c
}

Bash - assign variables to yad values - sed usage in for loop

In the code below I am attempting to assign variables to the two yad values Radius and Amount.
This can be done with awk by printing the yad values to file but I want to avoid this if I can.
The string (that is, both yad values) is assigned a variable and trimmed of characters, as required, using sed. However, the script stops at this line;
radius=$(sed 's|[amount*,]||g')
Two questions
is there a better way of tackling this; and
why is the script not completing? I have not been able to figure out the syntax.
EDIT: don't need the loop and working on the sed syntax
#!/bin/bash
#ifs.sh
values=`yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2'`
radius=$(echo "$values" | sed 's|[amount*,]||g')
amount=$(echo "$values" | sed 's/.a://')
if [ $? = 1 ]; then
echo " " >/dev/null 2>&1; else
echo "Radius = $radius"
echo "Amount = $amount"
fi
exit
Alternatives
# with separator
# radius="${values%????????}"
# amount="${values#????????}"
# without separator
# radius=$(echo "$values" | sed s'/........$//')
# amount=$(echo "$values" | sed 's/^........//')
It's easier than you think:
$ values=( $(echo '7.000000 0.100000 ') )
$ echo "${values[0]}"
7.000000
$ echo "${values[1]}"
0.100000
Replace $(echo '7.000000 0.100000 ') with yad ... so the script would be:
values=( $(yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2') )
if [ $? -eq 0 ]; then
echo "Radius = ${values[0]}"
echo "Amount = ${values[1]}"
fi
EDIT: Changed answer based on #Ed Morton
#!/bin/bash
#ifs.sh
values=($(yad --form --center --width=300 --title="Test" --separator=' ' \
--button=Skip:1 \
--button=Apply:0 \
--field="Radius":NUM \
'0!0..30!1!0' \
--field="Amount":NUM \
'0!0..5!0.01!2'))
if [ $? -eq 0 ]; then
radius="${values[0]}"
amount="${values[1]}"
fi
exit
bash -x Output
+ '[' 0 -eq 0 ']'
+ radius=7.000000
+ amount=1.000000
+ exit

How to pipe aws s3 cp to gzip to be used with "$QUERY" | psql utility

I have following command
"$QUERY" | psql -h $DB_HOST -p $DB_PORT -U $DB_USERNAME $DB_NAME
Where $QUERY is a command that loads files from a bucket, unzip it, and put to the database. It looks like following:
COPY my_table
FROM PROGRAM 'readarray -t files <<<"$(aws s3 ls ${BUCKET_PATH} | tr [:space:] "\n")"; for (( n = ${#files[#]} - 1; n >= 0; n--)); do if [[ ${files[$n]} =~ .csv.gz$ ]]; then aws s3 cp ${BUCKET_PATH}${files[$n]} >(gzip -d -c); break; fi done'
WITH DELIMITER ',' CSV
Here is formatted bash code:
#!/usr/bin/env bash
raw_files=`aws s3 ls ${BUCKET_PATH} | tr [:space:] "\n"`
readarray -t files <<<"$raw_files"
for (( n = ${#files[#]} - 1; n >= 0; n--)); do
if [[ ${files[$n]} =~ .csv.gz$ ]];
then aws s3 cp ${BUCKET_PATH}${files[$n]} >(gzip -d -c);
break; # for test purposes to be no load all files, jsut one
fi
done
aws-CLI version
#: aws --version
#: aws-cli/1.11.13 Python/3.5.2 Linux/4.13.0-43-generic botocore/1.4.70
This script works. But when I try to use it with psql, it fails, and I cannot understand why.
How can I fix it?
Here is a script that loads data from s3 bucket and merges it to fat file:
#!/usr/bin/env bash
bucket_path=$1
limit_files=$2
target_file_name=$3
echo "Source bucket $bucket_path"
if [ -z $target_file_name ]; then
target_file_name="fat.csv.gz"
echo "Default target file $target_file_name"
fi
echo "Total files $(aws s3 ls $bucket_path | wc -l)"
readarray -t files <<<"$(aws s3 ls $bucket_path | tr [:space:] "\n")"
for (( n = ${#files[#]} - 1, i=1; n >= 0; n--)); do
if [[ ${files[$n]} =~ .csv.gz$ ]]; then
aws s3 cp --quiet $bucket_path${files[$n]} >(cat >> "$target_file_name");
echo "$((i++)), ${files[$n]}, current size: $(du -sh $target_file_name)"
if [ ! -z $limit_files ] && [ $i -gt $limit_files ]; then
echo "Final size $(du -sh $target_file_name)"
exit 0
fi
fi
done
exit 0
It works correctly.
But when I try pipe this fat.csv.gz to psql db using the following code
echo "COPY my_table
FROM PROGRAM 'gzip -d -c fat.csv.gz'
WITH DELIMITER ',' CSV" | psql -h $DB_HOST -p $DB_PORT -U $DB_USERNAME $DB_NAME
I am getting the error:
ERROR: must be superuser to COPY to or from a file
It looks like a specific of working of pg (I guess it's due to security reasons) - link
So, the problem now that I don't know how to rework my script to be pipe the fat.csv.gz. I cannot get such privilege and should find a workaround.
I finally wrote the following bash script downloads files from s3, merge them to 50MB archives and pipe to pg in a sub process. Hope it will be helpful for somebody:
get_current_timestamp() (
date '+%s.%N'
)
execute_sql() (
write_log "Importing data from s3 to pg..."
import_data_from_s3 "$EVENTS_PATH"
write_log "Importing data from s3 to pg...done"
)
columns() (
local columns=`echo "SELECT array_to_string(
array(SELECT column_name::text
FROM information_schema.columns
WHERE table_name ILIKE '${TMP_TABLE}'
AND column_name NOT ILIKE '${DATE_FIELD}'), ',')" | \
psql --tuples-only -h $DB_HOST -p $DB_PORT -U $DB_USERNAME $DB_NAME`
echo -n "${columns}"
)
get_timestamp_difference() (
FROM=$1
TO=$2
echo $FROM $TO | awk '{
diff = $2-$1
if (diff >= 86400) {
printf "%i days ", diff/86400
}
if (diff >= 3600) {
printf "%i hours ", (diff/3600)%24
}
if (diff >= 60) {
printf "%i mins ", (diff/60)%60
}
printf "%f secs", diff%60
}'
)
pretty_size() (
if [ ! -z $1 ]; then
local size=$1;
else
local size=`cat <&0`;
fi
echo "${size}" | \
awk '{ \
split( "B KB MB GB" , v ); \
s=1; \
while( $1>=1024 ) { \
$1/=1024; s++ \
} \
printf "%.1f%s", $1, v[s] \
}' | \
add_missing_eol >&1
)
import_data_from_s3() (
local bucket_path=$1
local limit_files=$2
local target_file_name=$3
write_log "Source bucket $bucket_path"
if [ -z ${target_file_name} ]; then
target_file_name="fat.csv.gz"
write_log "Default target file $target_file_name"
fi
if [ ! -z ${limit_files} ]; then
write_log "Import ${limit_files} files"
else
write_log "Import all files"
fi
write_log "Total files $(aws s3 ls $bucket_path | wc -l)"
readarray -t files <<<"$(aws s3 ls $bucket_path | tr [:space:] "\n")"
write_log "Remove old data files..."
find . -maxdepth 1 -type f -name "*${target_file_name}" -execdir rm -f {} +;
write_log "Remove old data files...done"
TMP_TABLE_COLUMNS=$(columns)
write_log "Importing columns: ${DW_EVENTS_TMP_TABLE_COLUMNS}"
declare -A pids
local total_data_amount=0
local file_size_bytes=0
local file_size_bytes=0
local size_limit=$((50*1024*1024))
for (( n = ${#files[#]} - 1, file_counter=1, fat_file_counter=1; n >= 0; n--)); do
if [[ ! ${files[$n]} =~ .csv.gz$ ]]; then continue; fi
file="${fat_file_counter}-${target_file_name}"
aws s3 cp --quiet ${bucket_path}${files[$n]} >(cat >> "${file}");
file_size_bytes=$(stat -c%s "$file")
if [ $file_size_bytes -gt $size_limit ]; then
import_zip "${file}" "$(pretty_size ${file_size_bytes})" & pids["${file}"]=$!;
total_data_amount=$((total_data_amount+file_size_bytes))
write_log "Files read: ${file_counter}, total size(zipped): $(pretty_size ${total_data_amount})"
((fat_file_counter++))
fi
# write_log "${file_counter}, ${files[$n]}, current size: $(du -sh $file)"
if [ ! -z ${limit_files} ] && [ ${file_counter} -gt ${limit_files} ]; then
write_log "Final size $(du -sh ${file})"
if [ ! ${pids["${file}"]+0} ]; then
import_zip "${file}" "$(pretty_size ${file_size_bytes})" & pids["${file}"]=$!;
fi
break;
fi
((file_counter++))
done
# import rest file that can less than limit size
if [ ! ${pids["${file}"]+0} ]; then
import_zip "${file}" "$(pretty_size ${file_size_bytes})" & pids["${file}"]=$!;
fi
write_log "Waiting for all pids: ${pids[*]}"
for pid in ${pids[*]}; do
wait $pid
done
write_log "All sub process have finished. Total size(zipped): $(pretty_size ${total_data_amount})"
)
import_zip() (
local file=$1
local size=$2
local start_time=`get_current_timestamp`
write_log "pid: $!, size: ${size}, importing ${file}...";
gzip -d -c ${file} | \
psql --quiet -h ${DB_HOST} -p ${DB_PORT} -U ${DB_USERNAME} ${DB_NAME} \
-c "COPY ${TMP_TABLE}(${TMP_TABLE_COLUMNS})
FROM STDIN
WITH DELIMITER ',' CSV";
rm $file;
local end_time=`get_current_timestamp`
write_log "pid: $!, time: `get_timestamp_difference ${start_time} ${end_time}`, size: ${size}, importing ${file}...done";
)

Best way to merge two lines with same pattern

I have a text file like below
Input:
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,CALLS_TREATED,0
I am wondering the best way to merge two lines into:
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0,CALLS_TREATED,0
With this as the input file:
$ cat file
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,CALLS_TREATED,0
We can get the output you want with:
$ awk -F, -v OFS=, 'NR==1{first=$0;next;} {print first,$6,$7;}' file
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0,CALLS_TREATED,0
This is a more general solution that reads both files, item by item, where items are separated by comma. After the first mismatch, remaining items from the first line are appended to the output, followed by remaining items from the second line.
The most complicated tool this uses is sed. Looking at it again, even sed can be replaced.
#!/bin/bash
inFile="$1"
tmp=$(mktemp -d)
sed -n '1p' <"$inFile" | tr "," "\n" > "$tmp/in1"
sed -n '2p' <"$inFile" | tr "," "\n" > "$tmp/in2"
{ while true; do
read -r f1 <&3; r1=$?
read -r f2 <&4; r2=$?
[ $r1 -ne 0 ] || [ $r2 -ne 0 ] && break
[ $r1 -ne 0 ] && echo "$f2"
[ $r2 -ne 0 ] && echo "$f1"
if [ "$f1" == "$f2" ]; then
echo "$f1"
else
while echo "$f1"; do
read -r f1 <&3 || break
done
while echo "$f2"; do
read -r f2 <&4 || break
done
fi
done; } 3<"$tmp/in1" 4<"$tmp/in2" | tr '\n' ',' | sed 's/.$/\n/'
rm -rf "$tmp"
Assuming your input file looks like this:
$ cat in.txt
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,CALLS_TREATED,0
You can then run the script as:
$ ./merge.sh in.txt
05-29-2015,03:15:00,SESM1_0,ABC,interSesm,REDIRECTED_CALLS,0,CALLS_TREATED,0

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, . I want to get only the http://www.google.com part.
When I run a grep,
cat website.htm | grep -E '<a href=".*">' > links.txt
this returns the entire line to me that it found on not the link I want, so I tried using a cut command:
cat drawspace.txt | grep -E '<a href=".*">' | cut -d’”’ --output-delimiter=$'\n' > links.txt
Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’”’ --output-delimiter=$'\n' grepedText.txt > links.txt.
But I wanted to do this in one command if possible... So I tried doing an AWK command.
cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’
But this wouldn't run either. It was asking me for more input, because I wasn't finished....
I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....
The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.
P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.
I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.
I also want to point out I'm in the correct directory
HTML:
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
Simple Symmetry </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
Faces and a Vase </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
Blind Contour Drawing </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
Seeing Values </td>
</tr>
Expected output:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.
$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
You can use grep for this:
grep -Po '(?<=href=")[^"]*' file
It prints everything after href=" until a new double quote appears.
With your given input it returns:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">', you can get rid of the useless use of cat with grep '<a href=".*">' drawspace.txt.
Another example
$ cat a
hello asdas
hello asdas
other things
$ grep -Po '(?<=href=")[^"]*' a
httafasdf
hello
My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:
$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html
Output:
References
file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1
It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).
Use grep to extract all the lines with links in them and then use sed to pull out the URLs:
grep -o '<a href=".*">' *.html | sed 's/\(<a href="\|\">\)//g' > link.txt;
As per comment of triplee, using regex to parse HTML or XML files is essentially not done. Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.
In case of parsing XML or HTML, one can easily use xmlstarlet.
In case of an XHTML file, you can use :
xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
-t -m '//x:a/#href' -v . -n
where -N gives the XHTML namespace if any, this is recognized by
<html xmlns="http://www.w3.org/1999/xhtml">
However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :
$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
| xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
-t -m '//x:a/#href' -v . -n
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
assuming a well-formed HTML document with only 1 href link per line, here's one awk approach without needing backreferences to regex:capturing groups
{m,g}awk 'NF*=2<NF' OFS= FS='^.*<[Aa] [^>]*[Hh][Rr][Ee][Ff]=\"|\".*$'
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Here is a (more general) dash script, which can compare the URLs (delimited by ://) in two files (call this script with the --help flag to find out how to use it):
#!/bin/dash
PrintURLs () {
extract_urls_command="$insert_NL_after_URLs_command|$strip_NON_URL_text_command"
if [ "$domains_flag" = "1" ]; then
extract_urls_command="$extract_urls_command|$get_domains_command"
fi
{
eval path_to_search=\"\$$1\"
current_file_group="$2"
if [ ! "$skip_non_text_files_flag" = "1" ]; then
printf "\033]0;%s\007" "Loading non text files from group [$current_file_group]...">"$print_to_screen"
eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.docx' \\\) "$find_params" -exec unzip -q -c '{}' 'word/_rels/document.xml.rels' \\\;
eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.xlsx' \\\) "$find_params" -exec unzip -q -c '{}' 'xl/worksheets/_rels/*' \\\;
eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.pptx' -o -name '*.ppsx' \\\) "$find_params" -exec unzip -q -c '{}' 'ppt/slides/slide1.xml' \\\;
eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.odt' -o -name '*.ods' -o -name '*.odp' \\\) "$find_params" -exec unzip -q -c '{}' 'content.xml' \\\;
eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.pdf' \\\) "$find_params" -exec pdftotext '{}' '-' \\\;
fi
eval find \"\$path_to_search\" ! -type d ! -path '.' "$find_params"|{
count=0
while IFS= read file; do
if [ ! "$(file -bL --mime-encoding "$file")" = "binary" ]; then
count=$((count+1))
printf "\033]0;%s\007" "Loading text files from group [$current_file_group] - file $count...">"$print_to_screen"
cat "$file"
fi
done
}
printf "\033]0;%s\007" "Extracting URLs from group [$current_file_group]...">"$print_to_screen"
} 2>/dev/null|eval "$extract_urls_command"
}
StoreURLsWithLineNumbers () {
count_all="0"
mask="00000000000000000000"
#For <file group 1>: initialise next variables:
file_group="1"
count=0
dff_command_text=""
if [ ! "$dff_command_flag" = "0" ]; then
dff_command_text="Step $dff_command_flag - "
fi
for line in $(PrintURLs file_params_1 1; printf '%s\n' "### Sepparator ###"; for i in $(seq 2 $file_params_0); do PrintURLs file_params_$i 2; done;); do
if [ "$line" = "### Sepparator ###" ]; then
eval lines$file_group\1\_0=$count
eval lines$file_group\2\_0=$count
#For <file group 2>: initialise next variables:
file_group="2";
count="0"
continue;
fi
printf "\033]0;%s\007" "Storing URLs into memory [$dff_command_text""group $file_group]: $((count + 1))...">"$print_to_screen"
count_all_prev=$count_all
count_all=$((count_all+1))
count=$((count+1))
if [ "${#count_all_prev}" -lt "${#count_all}" ]; then
mask="${mask%?}"
fi
number="$mask$count_all"
eval lines$file_group\1\_$count=\"\$number\"
eval lines$file_group\2\_$count=\"\$line\" #URL
done;
eval lines$file_group\1\_0=$count
eval lines$file_group\2\_0=$count
}
trap1 () {
CleanUp
#if not running in a subshell: print "Aborted"
if [ "$dff_command_flag" = "0" ]; then
printf "\nAborted.\n">"$print_to_screen"
fi
#kill all children processes ("-$$": "-" = all processes in the process group with ID "$$" (current shell ID)), suppressing "Terminated" message (sending signal SIGPIPE ("PIPE") instead of SIGTERM ("INT") suppresses the "Terminated" message):
kill -s PIPE -- -$$
exit
}
CleanUp () {
#Restore "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals:
trap - INT
trap - TSTP
#Clear the title:
printf "\033]0;%s\007" "">"$print_to_screen"
#Restore initial IFS:
#IFS="$initial_IFS"
unset IFS
#Restore initial directory:
cd "$initial_dir"
DestroyArray flag_params
DestroyArray file_params
DestroyArray find_params
DestroyArray lines11
DestroyArray lines12
DestroyArray lines21
DestroyArray lines22
##Kill current shell with PID $$:
#kill -INT $$
}
DestroyArray () {
eval eval array_length=\'\$$1\_0\'
if [ -z "$array_length" ]; then array_length=0; fi
for i in $(seq 1 $array_length); do
eval unset $1\_$i
done
eval unset $1\_0
}
PrintErrorExtra () {
{
printf '%s\n' "Command path:"
printf '%s\n' "$current_shell '$current_script_path'"
printf "\n"
#Flag parameters are printed non-quoted:
printf '%s\n' "Flags:"
for i in $(seq 1 $flag_params_0); do
eval current_param="\"\$flag_params_$i\""
printf '%s\n' "$current_param"
done
if [ "$flag_params_0" = "0" ]; then printf '%s\n' "<none>"; fi
printf "\n"
#Path parameters are printed quoted with '':
printf '%s\n' "Paths:"
for i in $(seq 1 $file_params_0); do
eval current_param="\"\$file_params_$i\""
printf '%s\n' "'$current_param'"
done
if [ "$file_params_0" = "0" ]; then printf '%s\n' "<none>"; fi
printf "\n"
#Find parameters are printed quoted with '':
printf '%s\n' "'find' parameters:"
for i in $(seq 1 $find_params_0); do
eval current_param="\"\$find_params_$i\""
printf '%s\n' "'$current_param'"
done
if [ "$find_params_0" = "0" ]; then printf '%s\n' "<none>"; fi
printf "\n"
}>"$print_error_messages"
}
DisplayHelp () {
printf '%s\n' ""
printf '%s\n' " uniql.sh - A script to compare URLs ( containing '://' ) in a file compared to a group of files"
printf '%s\n' " "
printf '%s\n' " Usage:"
printf '%s\n' " "
printf '%s\n' " dash '/path/to/this/script.sh' <flags> '/path/to/file1' ... '/path/to/fileN' [ --find-parameters <find_parameters> ]"
printf '%s\n' " where:"
printf '%s\n' " - The group 1: '/path/to/file1' and the group 2: '/path/to/file2' ... '/path/to/fileN' - are considered the two groups of files to be compared"
printf '%s\n' " "
printf '%s\n' " - <flags> can be:"
printf '%s\n' " --help"
printf '%s\n' " - displays this help information"
printf '%s\n' " --different or -d"
printf '%s\n' " - find URLs that differ"
printf '%s\n' " --common or -c"
printf '%s\n' " - find URLs that are common"
printf '%s\n' " --domains"
printf '%s\n' " - compare and print only the domains (plus subdomains) of the URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag"
printf '%s\n' " --domains-full"
printf '%s\n' " - compare only the domains (plus subdomains) of the URLs but print the full URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag"
printf '%s\n' " --preserve-order or -p"
printf '%s\n' " - preserve the order and the occurences in which the links appear in group 1 and in group 2"
printf '%s\n' " - Warning: when using this flag - process substitution is used by this script - which does not work with the \"dash\" shell (throws an error). For this flag, you can use other \"dash\" syntax compatible shells, like: bash, zsh, ksh"
printf '%s\n' " --skip-non-text"
printf '%s\n' " - skip non-text files from search (does not look into: .docx, .xlsx, .pptx, .ppsx, .odt, .ods, .odp and .pdf files)"
printf '%s\n' " --find-parameters <find_parameters>"
printf '%s\n' " - all the parameters given after this flag, are considered 'find' parameters"
printf '%s\n' " - <find_parameters> can be: any parameters that can be passed to the 'find' utility (which is used internally by this script) - such as: name/path filters"
printf '%s\n' " -h"
printf '%s\n' " - also look in hidden files"
printf '%s\n' " "
printf '%s\n' " Output:"
printf '%s\n' " - '<' - denote URLs from the group 1: '/path/to/file1'"
printf '%s\n' " - '>' - denote URLs from the group 2: '/path/to/file2' ... '/path/to/fileN'"
printf '%s\n' " "
printf '%s\n' " Other commands that might be useful:"
printf '%s\n' " "
printf '%s\n' " - filter results - print lines containing string (highlight):"
printf '%s\n' " ...|grep \"string\""
printf '%s\n' " "
printf '%s\n' " - filter results - print lines not containing string:"
printf '%s\n' " ...|grep -v \"string\""
printf '%s\n' " "
printf '%s\n' " - filter results - print lines containing: string1 or string2 or ... stringN:"
printf '%s\n' " ...|awk '/string1|string2|...|stringN/'"
printf '%s\n' " "
printf '%s\n' " - filter results - print lines not containing: string1 or string2 or ... stringN:"
printf '%s\n' " ...|awk '"'!'"/string1|string2|...|stringN/'"
printf '%s\n' " "
printf '%s\n' " - filter results - print lines in '/file/path/2' that are in '/file/path/1':"
printf '%s\n' " grep -F -f '/file/path/1' '/file/path/2'"
printf '%s\n' " "
printf '%s\n' " - filter results - print lines in '/file/path/2' that are not in '/file/path/1':"
printf '%s\n' " grep -F -vf '/file/path/1' '/file/path/2'"
printf '%s\n' " "
printf '%s\n' " - filter results - print columns <1> and <2> from output:"
printf '%s\n' " awk '{print \$1, \$2}'"
printf '%s\n' ""
}
# Print to "/dev/tty" = Print error messages to screen only
print_to_screen="/dev/tty"
#print_error_messages='&2'
print_error_messages="$print_to_screen"
initial_dir="$PWD" #Store initial directory value
initial_IFS="$IFS" #Store initial IFS value
NL2=$(printf '%s' "\n\n") #Store New Line for use with sed
insert_NL_after_URLs_command='sed -E '"'"'s/([a-zA-Z]*\:\/\/)/'"\\${NL2}"'\1/g'"'"
strip_NON_URL_text_command='sed -n '"'"'s/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ ^\t^>^<]*\).*/\4\5\7/p'"'"
get_domains_command='sed '"'"'s/.*:\/\/\(.*\)/\1/g'"'"'|sed '"'"'s/\/.*//g'"'"
prepare_for_output_command='sed -E '"'"'s/ *([0-9]*)[\ *](<|>) *([0-9]*)[\ *](.*)/\2 \4 \1/g'"'"
remove_angle_brackets_command='sed -E '"'"'s/(<|>) (.*)/\2/g'"'"
find_params=""
#Process parameters:
different_flag="0"
common_flag="0"
domains_flag="0"
domains_full_flag="0"
preserve_order_flag="0"
dff_command1_flag="0"
dff_command2_flag="0"
dff_command3_flag="0"
dff_command4_flag="0"
dff_command_flag="0"
skip_non_text_files_flag="0"
find_parameters_flag="0"
hidden_files_flag="0"
help_flag="0"
flag_params_count=0
file_params_count=0
find_params_count=0
for param; do
if [ "$find_parameters_flag" = "0" ]; then
case "$param" in
"--different" | "-d" | "--common" | "-c" | "--domains" | \
"--domains-full" | "--preserve_order" | "-p" | "--dff_command1" | "--dff_command2" | \
"--dff_command3" | "--dff_command4" | "--skip-non-text" | "--find-parameters" | "-h" | \
"--help" )
flag_params_count=$((flag_params_count+1))
eval flag_params_$flag_params_count=\"\$param\"
case "$param" in
"--different" | "-d" )
different_flag="1"
;;
"--common" | "-c" )
common_flag="1"
;;
"--domains" )
domains_flag="1"
;;
"--domains-full" )
domains_full_flag="1"
;;
"--preserve_order" | "-p" )
preserve_order_flag="1"
;;
"--dff_command1" )
dff_command1_flag="1"
dff_command_flag="1"
;;
"--dff_command2" )
dff_command2_flag="1"
dff_command_flag="2"
;;
"--dff_command3" )
dff_command3_flag="1"
dff_command_flag="3"
;;
"--dff_command4" )
dff_command4_flag="1"
dff_command_flag="4"
;;
"--skip-non-text" )
skip_non_text_files_flag="1"
;;
"--find-parameters" )
find_parameters_flag="1"
;;
"-h" )
hidden_files_flag="1"
;;
"--help" )
help_flag="1"
;;
esac
;;
* )
file_params_count=$((file_params_count+1))
eval file_params_$file_params_count=\"\$param\"
;;
esac
elif [ "$find_parameters_flag" = "1" ]; then
find_params_count=$((find_params_count+1))
eval find_params_$find_params_count=\"\$param\"
fi
done
flag_params_0="$flag_params_count"
file_params_0="$file_params_count"
find_params_0="$find_params_count"
if [ "$help_flag" = "1" -o \( "$file_params_0" = "0" -a "$find_params_0" = "0" \) ]; then
DisplayHelp
exit 0
fi
#Check if any of the necessary utilities is missing:
error="false"
man -f find >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'find' utility is not installed!"; error="true"; }
man -f file >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'file' utility is not installed!"; error="true"; }
man -f kill >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'kill' utility is not installed!"; error="true"; }
man -f seq >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'seq' utility is not installed!"; error="true"; }
man -f ps >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'ps' utility is not installed!"; error="true"; }
man -f sort >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'sort' utility is not installed!"; error="true"; }
man -f uniq >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'uniq' utility is not installed!"; error="true"; }
man -f sed >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'sed' utility is not installed!"; error="true"; }
man -f grep >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'grep' utility is not installed!"; error="true"; }
if [ "$skip_non_text_files_flag" = "0" ]; then
man -f unzip >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'unzip' utility is not installed!"; error="true"; }
man -f pdftotext >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'pdftotext' utility is not installed!"; error="true"; }
fi
if [ "$error" = "true" ]; then
printf "\n"
CleanUp; exit 1
fi
#Process parameters/flags and check for errors:
find_params="$(for i in $(seq 1 $find_params_0;); do eval printf \'\%s \' "\'\$find_params_$i\'"; done;)"
if [ -z "$find_params" ]; then
find_params='-name "*"'
fi
if [ "$hidden_files_flag" = "1" ]; then
hidden_files_string=""
elif [ "$hidden_files_flag" = "0" ]; then
hidden_files_string="\( "'! -path '"'"'*/.*'"'"" \)"
fi
find_params="$hidden_files_string"" -a ""$find_params"
current_shell="$(ps -p $$ 2>/dev/null)"; current_shell="${current_shell##*" "}"
current_script_path=$(cd "${0%/*}" 2>/dev/null; printf '%s' "$(pwd -P)/${0##*/}")
error="false"
if [ "$different_flag" = "0" -a "$common_flag" = "0" ]; then
error="true"
printf '\n%s\n' "ERROR: Expected either -c or -d flag!">"$print_error_messages"
elif [ "$different_flag" = "1" -a "$common_flag" = "1" ]; then
error="true"
printf '\n%s\n' "ERROR: The '-c' flag cannot be used together with the '-d' flag!">"$print_error_messages"
fi
if [ "$preserve_order_flag" = "1" -a "$common_flag" = "1" ]; then
error="true"
printf '\n%s\n' "ERROR: The '-p' flag cannot be used together with the '-c' flag!">"$print_error_messages"
fi
if [ "$preserve_order_flag" = "1" -a "$current_shell" = "dash" ]; then
error="true"
printf '\n%s\n' "ERROR: When using the '-p' flag, the \"process substitution\" feature is needed, which is not available in the dash shell (it is available in shells like: bash, zsh, ksh)!">"$print_error_messages"
fi
eval find \'/dev/null\' "$find_params">/dev/null 2>&1||{
error="true"
printf '\n%s\n' "ERROR: Invalid parameters for the 'find' command!">"$print_error_messages"
}
if [ "$error" = "true" ]; then
printf "\n"
PrintErrorExtra
CleanUp; exit 1;
fi
#Check if the file paths given as parameters do exist:
error="false"
for i in $(seq 1 $file_params_0); do
eval current_file=\"\$file_params_$i\"
# If current <file> does not exist:
if [ ! -e "$current_file" ]; then # If current file does not exist:
printf '\n%s\n' "ERROR: File '$current_file' does not exist or is not accessible!">"$print_error_messages"
error="true"
elif [ ! -r "$current_file" ]; then # If current file is not readable:
printf '\n%s\n' "ERROR: File <$i> = '$current_file' is not readable!">"$print_error_messages"
error="true"
fi
done
if [ "$error" = "true" ]; then
printf "\n"
PrintErrorExtra
CleanUp; exit 1;
fi
#Proceed to finding and comparing URLs:
IFS='
'
#Trap "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals:
trap 'trap1' INT
trap 'trap1' TSTP
if [ "$domains_full_flag" = "0" -o ! "$dff_command_flag" = "0" ]; then
StoreURLsWithLineNumbers
fi
if [ "$domains_full_flag" = "0" ]; then
if [ "$preserve_order_flag" = "0" ]; then
{
for i in $(seq 1 $lines11_0); do
printf "\033]0;%s\007" "Processing group [1] - URL: $i...">"$print_to_screen"
eval printf \'\%s\\\n\' \"\< \$lines11_$i \$lines12_$i\"
done|sort -k 3|uniq -c -f 2
for i in $(seq 1 $lines21_0); do
printf "\033]0;%s\007" "Processing group [2] - URL: $i...">"$print_to_screen"
eval printf \'\%s\\\n\' \"\> \$lines21_$i \$lines22_$i\"
done|sort -k 3|uniq -c -f 2
}|sort -k 4|{
if [ "$different_flag" = "1" ]; then
uniq -u -f 3|sort -k 3|eval "$prepare_for_output_command"
elif [ "$common_flag" = "1" ]; then
uniq -d -f 3|sort -k 3|eval "$prepare_for_output_command"|eval "$remove_angle_brackets_command"
fi
}
elif [ "$preserve_order_flag" = "1" ]; then
if [ "$different_flag" = "1" ]; then
{
URL_count=0
current_line=""
for line in $(eval diff \
\<\(\
count1=0\;\
for i in \$\(seq 1 \$lines11_0\)\; do\
count1=\$\(\(count1 + 1\)\)\;\
eval URL=\\\"\\\$lines12_\$i\\\"\;\
printf \'\%s\\n\' \"File group: 1 URL: \$count1\"\;\
printf \'\%s\\n\' \"\$URL\"\;\
done\;\
printf \'\%s\\n\' \"\#\#\# Sepparator 1\"\;\
\) \
\<\(\
count2=0\;\
for i in \$\(seq 1 $lines21_0\)\; do\
count2=\$\(\(count2 + 1\)\)\;\
eval URL=\\\"\\\$lines22_\$i\\\"\;\
printf \'\%s\\n\' \"File group: 2 URL: \$count2\"\;\
printf \'\%s\\n\' \"\$URL\"\;\
done\;\
printf \'\%s\\n\' \"\#\#\# Sepparator 2\"\;\
\) \
); do
URL_count=$((URL_count + 1))
previous_line="$current_line"
current_line="$line"
#if ( current line starts with "<" and previous line starts with "<" ) OR ( current line starts with ">" and previous line starts with ">" ):
if [ \( \( ! "${current_line#"<"}" = "${current_line}" \) -a \( ! "${previous_line#"<"}" = "${previous_line}" \) \) -o \( \( ! "${current_line#">"}" = "${current_line}" \) -a \( ! "${previous_line#">"}" = "${previous_line}" \) \) ]; then
printf '%s\n' "$previous_line"
fi
done
}
fi
fi
elif [ "$domains_full_flag" = "1" ]; then
# Command to find common domains:
uniql_command1="$current_shell '$current_script_path' -c --domains $(for i in $(seq 1 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
# URLs that are only in first parameter file (file group 1):
uniql_command2="$current_shell '$current_script_path' -d '$file_params_1' \"/dev/null\""
# Command to find common domains:
uniql_command3="$current_shell '$current_script_path' -c --domains $(for i in $(seq 1 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
# URLs that are only in 2..N parameter files (file group 2):
uniql_command4="$current_shell '$current_script_path' -d \"/dev/null\" $(for i in $(seq 2 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
#Store one <command substitution> at a a time (syncronously):
uniql_command1_output="$(eval $uniql_command1 --dff_command1 --find-parameters "$find_params"|sed 's/\([^ *]\) \(.*\)/\1/')"
uniql_command2_output="$(eval $uniql_command2 --dff_command2 --find-parameters "$find_params")"
uniql_command3_output="$(eval $uniql_command3 --dff_command3 --find-parameters "$find_params"|sed 's/\([^ *]\) \(.*\)/\1/')"
uniql_command4_output="$(eval $uniql_command4 --dff_command4 --find-parameters "$find_params")"
if [ "$different_flag" = "1" ]; then
# Find URLs (second escaped process substitution: \<\(...\)) that are not in the common domains list (first escaped process substitution: \<\(...\)):
# URLs in the first file given as parameter (second escaped process substitution: \<\(...\)):
eval grep \-F \-vf \<\( printf \'\%s\' \"\$uniql_command1_output\"\; \) \<\( printf \'\%s\' \"\$uniql_command2_output\"\; \)
# URLs in the files 2..N - given as parameters (second escaped process substitution: \<\(...\)):
eval grep \-F \-vf \<\( printf \'\%s\' \"\$uniql_command3_output\"\; \) \<\( printf \'\%s\' \"\$uniql_command4_output\"\; \)
elif [ "$common_flag" = "1" ]; then
# Find URLs (second escaped process substitution: \<\(...\)) that are in the common domains list (first escaped process substitution: \<\(...\)):
# URLs in the first file given as parameter (second escaped process substitution: \<\(...\)):
eval grep \-F \-f \<\( printf \'\%s\' \"\$uniql_command1_output\"\; \) \<\( printf \'\%s\' \"\$uniql_command2_output\"\; \)
# URLs in the files 2..N - given as parameters (second escaped process substitution: \<\(...\)):
eval grep \-F \-f \<\( printf \'\%s\' \"\$uniql_command3_output\"\; \) \<\( printf \'\%s\' \"\$uniql_command4_output\"\; \)
fi
# grep flags explained:
# -F = do not interpret pattern string (treat string literally)
# -v = select non-matching lines
fi
CleanUp
For the asked question - this should do it:
dash '/path/to/the/above/script.sh' -d '/path/to/file1/containing/URLs.txt' '/dev/null'

Resources