Trying to make a bash script that extracts html code from files - bash

I'm making a script to extract html code from .html files in a directory which happen to have non-html code outside the html tags. I wish for the output overwrite the source files
Here is what I have so far but I'm having trouble getting it work.
#!/bin/bash
for f in `ls .`; do
if [[ $f =~ \.html$ ]]
then
cat $f | tr "\n" "|" | grep -o '<html>.*</html>' | sed 's/|/\n/g' > $f
fi
done

#!/bin/bash
for f in `ls .`; do
if [[ $f =~ \.html$ ]]
then
cat $f | tr "\n" "|" | grep -o '<html>.*</html>' | sed 's/|/\n/g' > $f.temp
mv $f.temp $f
fi
done

You can replace the whole script with:
sed -i '/<[Hh][Tt][Mm][Ll]/,/<\/[Hh][Tt][Mm][Ll]/!d' *.html
Or if you don't need it to be case-insensitive:
sed -i '/<html/,/<\/html/!d' *.html

Related

Shell: Add string to the end of each line, which match the pattern. Filenames are given in another file

I'm still new to the shell and need some help.
I have a file stapel_old.
Also I have in the same directory files like english_old_sync, math_old_sync and vocabulary_old_sync.
The content of stapel_old is:
english
math
vocabulary
The content of e.g. english is:
basic_grammar.md
spelling.md
orthography.md
I want to manipulate all files which are given in stapel_old like in this example:
take the first line of stapel_old 'english', (after that math, and so on)
convert in this case english to english_old_sync, (or after that what is given in second line, e.g. math to math_old_sync)
search in english_old_sync line by line for the pattern '.md'
And append to each line after .md :::#a1
The result should be e.g. of english_old_sync:
basic_grammar.md:::#a1
spelling.md:::#a1
orthography.md:::#a1
of math_old_sync:
geometry.md:::#a1
fractions.md:::#a1
and so on. stapel_old should stay unchanged.
How can I realize that?
I tried with sed -n, while loop (while read -r line), and I'm feeling it's somehow the right way - but I still get errors and not the expected result after 4 hours inspecting and reading.
Thank you!
EDIT
Here is the working code (The files are stored in folder 'olddata'):
clear
echo -e "$(tput setaf 1)$(tput setab 7)Learning directories:$(tput sgr 0)\n"
# put here directories which should not become flashcards, command: | grep -v 'name_of_directory_which_not_to_learn1' | grep -v 'directory2'
ls ../ | grep -v 00_gliederungsverweise | grep -v 0_weiter | grep -v bibliothek | grep -v notizen | grep -v Obsidian | grep -v z_nicht_uni | tee olddata/stapel_old
# count folders
echo -ne "\nHow much different folders: " && wc -l olddata/stapel_old | cut -d' ' -f1 | tee -a olddata/stapel_old
echo -e "Are this learning directories correct? [j ODER y]--> yes; [Other]-->no\n"
read lernvz_korrekt
if [ "$lernvz_korrekt" = j ] || [ "$lernvz_korrekt" = y ];
then
read -n 1 -s -r -p "Learning directories correct. Press any key to continue..."
else
read -n 1 -s -r -p "Learning directories not correct, please change in line 4. Press any key to continue..."
exit
fi
echo -e "\n_____________________________\n$(tput setaf 6)$(tput setab 5)Found cards:$(tput sgr 0)$(tput setaf 6)\n"
#GET && WRITE FOLDER NAMES into olddata/stapel_old
anzahl_zeilen=$(cat olddata/stapel_old |& tail -1)
#GET NAMES of .md files of every stapel and write All to 'stapelname'_old_sync
i=0
name="var_$i"
for (( num=1; num <= $anzahl_zeilen; num++ ))
do
i="$((i + 1))"
name="var_$i"
name=$(cat olddata/stapel_old | sed -n "$num"p)
find ../$name/ -name '*.md' | grep -v trash | grep -v Obsidian | rev | cut -d'/' -f1 | rev | tee olddata/$name"_old_sync"
done
(tput sgr 0)
I tried to add:
input="olddata/stapel_old"
while IFS= read -r line
do
sed -n "$line"p olddata/stapel_old
done < "$input"
The code to change only the english_old_sync is:
lines=$(wc -l olddata/english_old_sync | cut -d' ' -f1)
for ((num=1; num <= $lines; num++))
do
content=$(sed -n "$num"p olddata/english_old_sync)
sed -i "s/"$content"/""$content":::#a1/g"" olddata/english_old_sync
done
So now, this need to be a inner for-loop, of a outer for-loop which holds the variable for english, right?
stapel_old should stay unchanged.
You could try a while + read loop and embed sed inside the loop.
#!/usr/bin/env bash
while IFS= read -r files; do
echo cp -v "$files" "${files}_old_sync" &&
echo sed '/^.*\.md$/s/$/:::#a1/' "${files}_old_sync"
done < olddata/staple_old
convert in this case english to english_old_sync, (or after that what is given in second line, e.g. math to math_old_sync)
cp copies the file with a new name, if the goal is renaming the original file name from the content of the file staple_old then change cp to mv
The -n and -i flag from sed was ommited , include it, if needed.
The script also assumes that there are no empty/blank lines in the content of staple_old file. If in case there are/is add an addition test after the line where the do is.
[[ -n $files ]] || continue
It also assumes that the content of staple_old are existing files. Just in case add an additional test.
[[ -e $files ]] || { printf >&2 '%s no such file or directory.\n' "$files"; continue; }
Or an if statement.
if [[ ! -e $files ]]; then
printf >&2 '%s no such file or directory\n' "$files"
continue
fi
See also help test
See also help continue
Combining them all together should be something like:
#!/usr/bin/env bash
while IFS= read -r files; do
[[ -n $files ]] || continue
[[ -e $files ]] || {
printf >&2 '%s no such file or directory.\n' "$files"
continue
}
echo cp -v "$files" "${files}_old_sync" &&
echo sed '/^.*\.md$/s/$/:::#a1/' "${files}_old_sync"
done < olddata/staple_old
Remove the echo's If you're satisfied with the output so the script could copy/rename and edit the files.

Looping through each file in directory - bash

I'm trying to perform certain operation on each file in a directory but there is a problem with order it's going through. It should do one file at the time. The long line (unzipping, grepping, zipping) works fine on a single file without a script, so there is a problem with a loop. Any ideas?
Script should grep through through each zipped file and look for word1 or word2. If at least one of them exist then:
unzip file
grep word1 and word2 and save it to file_done
remove unzipped file
zip file_done to /donefiles/ with original name
remove file_done from original directory
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The code snipped you've provided has a few problems, e.g. unneeded nested for cycle and erroneous pipeline
(the whole line gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip...).
Note also your code will work correctly only if *.gz files don't have spaces (or special characters) in names.
Also zgrep -c 'word1\|word2' will also match strings like line_starts_withword1_orword2_.
Here is the working version of the script:
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c -E 'word1|word2' $file) # now counter is the number of word1/word2 occurences in $file
if [[ $counter -gt 0 ]]; then
name=$(basename $file .gz)
zcat $file | grep -E 'word1|word2' > ${name}_done
gzip -f -c ${name}_done > /donefiles/$file
rm -f ${name}_done
else
echo 'nothing to do here'
fi
done
What we can improve here is:
since we unzipping the file anyway to check for word1|word2 presence, we may do this to temp file and avoid double-unzipping
we don't need to count how many word1 or word2 is inside the file, we may just check for their presence
${name}_done can be a temp file cleaned up automatically
we can use while cycle to handle file names with spaces
#!/bin/bash
tmp=`mktemp /tmp/gzip_demo.XXXXXX` # create temp file for us
trap "rm -f \"$tmp\"" EXIT INT TERM QUIT HUP # clean $tmp upon exit or termination
find . -maxdepth 1 -mindepth 1 -type f -name '*.gz' | while read f; do
# quotes around $f are now required in case of spaces in it
s=$(basename "$f") # short name w/o dir
gunzip -f -c "$f" | grep -P '\b(word1|word2)\b' > "$tmp"
[ -s "$tmp" ] && gzip -f -c "$tmp" > "/donefiles/$s" # create archive if anything is found
done
It looks like you have an inner loop inside the outer one :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do #<<< HERE
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The inner loop goes through all the files in the directory if one of them contains file1 or file2. You probably want this :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
else
echo "nothing to do here"
fi
done

Renaming files using their content

I have several files which all start with this line:
CREATE PROCEDURE **CHANGING_NAME**
I want to be able to pull the name of the procedure and use it to the rename the file. There is content to each file below this first line.
Has anyone done something like this before?
Thanks
Assuming you have all files in one directory :
#!/bin/bash
for i in *.extension :
do
# Assuming 3rd column of the first line is the new name of the file
# And **CHANGING_NAME** doesn't contain any space or meta characters
newname=$(awk 'NR==1 && /PROCEDURE/ {print $3}' "$i")
if [ "$newname" == "" ]; then
echo "There is no PROCEDURE in the first line";
echo "No new name for file $i";
else
mv "$i" "$newname"
fi
done
With a lot of care and pretending that the **CHANGING_NAME** is well-formed:
for file in *.files; do mv -i -- "$file" "$(awk '{print $3; exit}' $file)" ; done
The -i option is to prevent accidental overriding existing files.
This version works with spaces (and many other strange characters except for /):
for file in *.files; do mv -i -- "$file" "$(sed -n '1s/^CREATE\ PROCEDURE\ \(.*\)$/\1/p' $file)"; done
Since I was never great with awk I might suggest:
#! /bin/bash
#
for i in *.extension
do echo $i
newname=$(head -1 "${i}" | cut -d ' ' -f2)
mv -i "${i}" "${newname}"
done
This assumes all files you're looking for have the same extension. If not, and you need the extension, you could use:
#! /bin/bash
#
for i in *
do echo $i
ext="${i##*.}"
newname=$(head -1 "${i}" | cut -d ' ' -f2)
mv -i "${i}" "${newname}"."${ext}"
done
Both assume all the files are in a single directory.
You can try the next:
perl -lanE 'if($.==1&&/PROCEDURE/){close ARGV;say "$ARGV,$F[2]"}' files*
and if satisfied, change it to
perl -lanE 'if($.==1&&/PROCEDURE/){close ARGV;rename $ARGV,$F[2]}' files*
mv myfile `sed '1 s/.*PROCEDURE\s*//' myfile`
(the sed command will delete the text to the left of the word proceeding PROCEDURE regardless of how many spaces on only the first line and print it out the backticks make it execute in place so it is used as the filename to the mv command)
to move them all and add an extension .ext:
ls *.ext | xargs -I {} mv {} `sed '1 s/.*PROCEDURE\s*//' {}`.ext

Simplest Bash code to find what files from a defined list don't exist in a directory?

This is what I came up with. It works perfectly -- I'm just curious if there's a smaller/crunchier way to do it. (wondering if possible without a loop)
files='file1|file2|file3|file4|file5'
path='/my/path'
found=$(find "$path" -regextype posix-extended -type f -regex ".*\/($files)")
for file in $(echo "$files" | tr '|', ' ')
do
if [[ ! "$found" =~ "$file" ]]
then
echo "$file"
fi
done
You can do this without invoking any external tools:
IFS="|"
for file in $files
do
[ -f "$file" ] || printf "%s\n" "$file"
done
Your code will break if you have file names with whitespace. This is how I would do it, which is a bit more concise.
echo "$files" | tr '|' '\n' | while read file; do
[ -e "$file" ] || echo "$file"
done
You can probably play around with xargs if you want to get rid of the loop all together.
$ eval "ls $path/{${files//|/,}} 2>&1 1>/dev/null | awk '{print \$4}' | tr -d :"
Or use awk
$ echo -n $files | awk -v path=$path -v RS='|' '{printf("! [[ -e %s ]] && echo %s\n", path"/"$0, path"/"$0) | "bash"}'
without whitespace in filenames:
files=(mbox todo watt zoff xorf)
for f in ${files[#]}; do test -f $f || echo $f ; done

how to list file names into a function

I have a folder with a bunch of files, the files only have a url in it i.e
http://itunes.apple.com/us/app/keynote/id361285480?mt=8
Here is my code. How can I get it to do this for each url in each file?
var='{"object":"App","action":"scrape","args":{"itunes_url":"!!!!HERE!!!!"}}'
string=$(echo "$var" | sed -e 's/"/\\"/g')
string='{"request":"'"$string"'"}'
api="http://api.lewis.com"
output=$(curl -s -d "request=$string" "$api")
code=$(echo "$output" | tr '{', '\n' | sed -n "2p" | sed -e 's/:/ /' | awk '{print $2}')
if [ "${code:0:1}" -ne "2" ]; then
# :(
echo "Error: response code $code was returned, "
else
string=$(echo "$output" | tr '{', '\n' | sed -e '/"signature":\(.*\)/d;/"data":\(.*\)/d;/"signature":\(.*\)/d;/"code":\(.*\)/d' |sed -e 's/\\"//g;s/\\\\\\\//\//g;s/\\//g' | tr '}', '\n' | sed -e 's/"//' | sed '/^$/d')
echo "$string"
fi
use a for loop
for filename in folder/*; do
-- your code where you do something using $filename --
done
og if you prefer to give the filenames as arguments to the script then:
for filename do
-- your code where you do something using $filename --
done
then run your script followed by the files
./script.sh folder/*
You could do:
for file in *; do
for line in $(cat $file); do
# Stuff goes here
done
done
Or even just:
for line in $(cat *); do
# Stuff goes here
done

Resources