combining all files that contains the same word into a new text file with leaving new lines between individual files - shell

it is my first question here. I have a folder called "materials", which has 40 text files in it. I am basically trying to combine the text files that contain the word "carbon"(both in capitalized and lowercase form)in it into a single file with leaving newlines between them. I used " grep -w carbon * " to identify the files that contain the word carbon. I just don't know what to do after this point. I really appreciate all your help!

grep -il carbon materials/*txt | while read line; do
echo ">> Adding $line";
cat $line >> result.out;
echo >> result.out;
done
Explanation
grep searches the strings in the files. -i ignores the case for the searched string. -l prints on the filename containing the string
while command loops over the files containing the string
cat with >> appends to the results.out
echo >> adds new line after appending each files content to result.out
Execution
$ ls -1 materials/*.txt
materials/1.txt
materials/2.txt
materials/3.txt
$ grep -i carbon materials/*.txt
materials/1.txt:carbon
materials/2.txt:CARBON
$ grep -irl carbon materials/*txt | while read line; do echo ">> Adding $line"; cat $line >> result.out; echo >> result.out; done
>> Adding materials/1.txt
>> Adding materials/2.txt
$ cat result.out
carbon
CARBON

Try this (assuming your filenames don't contain newline characters):
grep -iwl carbon ./* |
while IFS= read -r f; do cat "$f"; echo; done > /tmp/combined
If it is possible that your filenames may contain newline characters and your shell is bash, then:
grep -iwlZ carbon ./* |
while IFS= read -r -d '' f; do cat "$f"; echo; done > /tmp/combined
grep is assumed to be GNU grep (for the -w and -Z options). Note that these will leave a trailing newline character in the file /tmp/combined.

Related

How to write a command line script that will loop through every line in a text file and append a string at the end of each? [duplicate]

How do I add a string after each line in a file using bash? Can it be done using the sed command, if so how?
If your sed allows in place editing via the -i parameter:
sed -e 's/$/string after each line/' -i filename
If not, you have to make a temporary file:
typeset TMP_FILE=$( mktemp )
touch "${TMP_FILE}"
cp -p filename "${TMP_FILE}"
sed -e 's/$/string after each line/' "${TMP_FILE}" > filename
I prefer echo. using pure bash:
cat file | while read line; do echo ${line}$string; done
I prefer using awk.
If there is only one column, use $0, else replace it with the last column.
One way,
awk '{print $0, "string to append after each line"}' file > new_file
or this,
awk '$0=$0"string to append after each line"' file > new_file
If you have it, the lam (laminate) utility can do it, for example:
$ lam filename -s "string after each line"
Pure POSIX shell and sponge:
suffix=foobar
while read l ; do printf '%s\n' "$l" "${suffix}" ; done < file |
sponge file
xargs and printf:
suffix=foobar
xargs -L 1 printf "%s${suffix}\n" < file | sponge file
Using join:
suffix=foobar
join file file -e "${suffix}" -o 1.1,2.99999 | sponge file
Shell tools using paste, yes, head
& wc:
suffix=foobar
paste file <(yes "${suffix}" | head -$(wc -l < file) ) | sponge file
Note that paste inserts a Tab char before $suffix.
Of course sponge can be replaced with a temp file, afterwards mv'd over the original filename, as with some other answers...
This is just to add on using the echo command to add a string at the end of each line in a file:
cat input-file | while read line; do echo ${line}"string to add" >> output-file; done
Adding >> directs the changes we've made to the output file.
Sed is a little ugly, you could do it elegantly like so:
hendry#i7 tmp$ cat foo
bar
candy
car
hendry#i7 tmp$ for i in `cat foo`; do echo ${i}bar; done
barbar
candybar
carbar

mv: Cannot stat - No such file or directory

I have piped the output of ls command into a file. The contents are like so:
[Chihiro]_Grisaia_no_Kajitsu_-_01_[1920x816_Blu-ray_FLAC][D2B961D6].mkv
[Chihiro]_Grisaia_no_Kajitsu_-_02_[1920x816_Blu-ray_FLAC][38F88A81].mkv
[Chihiro]_Grisaia_no_Kajitsu_-_03_[1920x816_Blu-ray_FLAC][410F74F7].mkv
My attempt to rename these episodes according to episode number is as follows:
cat grisaia | while read line;
#get the episode number
do EP=$(echo $line | egrep -o "_([0-9]{2})_" | cut -d "_" -f2)
if [[ $EP ]]
#escape special characters
then line=$(echo $line | sed 's/\[/\\[/g' | sed 's/\]/\\]/g')
mv "$line" "Grisaia_no_Kajitsu_${EP}.mkv"
fi
done
The mv commands exit with code 1 with the following error:
mv: cannot stat
'\[Chihiro\]_Grisaia_no_Kajitsu_-01\[1920x816_Blu-ray_FLAC\]\[D2B961D6\].mkv':
No such file or directory
What I really don't get is that if I copy the file that could not be stat and attempt to stat the file, it works. I can even take the exact same string that is output and execute the mv command individually.
If you surround your variable ($line) with double quotes (") you don't need to escape those special characters. So you have two options there:
Remove the following assignation completely:
then # line=$(echo $line | sed 's/\[/\\[/g' | sed 's/\]/\\]/g')`
or
Remove the double quotes in the following line:
mv $line "Grisaia_no_Kajitsu_${EP}.mkv"
Further considerations
Parsing the output of ls is never a good idea. Think about filenames with spaces. See this document for more information.
The cat here is unnecessary:
cat grisaia | while read line;
...
done
Use this instead to avoid an unnecessary pipe:
while read line;
...
done < grisaia
Why is good to avoid pipes in some scenarios? (answering comment)
Pipes create subshells (which are expensive), and you can also make some mistakes as the following:
last=""
cat grisaia | while read line; do
last=$line
done
echo $last # surprise!! it outputs an empty string
The reason is that $last inside the loop belongs to another subshell.
Now, see the same approach wothout pipes:
while read line; do
last=$line
done < grisaia
echo $last # it works as expected and prints the last line

Bash: Filter directory when piping from `ls` to `tee`

(background info)
Writing my first bash psuedo-program. The program downloads a bunch of files from the network, stores them in a sub-directory called ./network-files/, then removes all the files it downloaded. It also logs the result to several log files in ./logs/.
I want to log the filenames of each file deleted.
Currently, I'm doing this:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | tee -a $network_files_log $verbose_log $network_log
($base_directory is a variable defining the base directory for the app, $network_files_log etc are variables defining the location of various log files)
This produces some pretty grody and unreadable output:
Tue Jun 21 04:55:46 UTC 2016 >>> Removing files: /home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png /home/vagrant/load-simulator/network-files/442119100.png /home/vagrant/load-simulator/network-files/464324101.png /home/vagrant/load-simulator/network-files/525787337.png /home/vagrant/load-simulator/network-files/581100197.png /home/vagrant/load-simulator/network-files/640387393.png /home/vagrant/load-simulator/network-files/650797708.png /home/vagrant/load-simulator/network-files/827538696.png /home/vagrant/load-simulator/network-files/833069509.png /home/vagrant/load-simulator/network-files/8580204.png /home/vagrant/load-simulator/network-files/858174053.png /home/vagrant/load-simulator/network-files/998266826.png
Any good way to strip out the /home/vagrant/load-simulator/network-files/ part from each of those file paths? I suspect there's something I should be doing with sed or grep, but haven't had any luck so far.
You might also consider using find. Its perfect for walking directories, removing files and using customized printf for output:
find $PWD/x -type f -printf "%f\n" -delete >>$YourLogFile.log
Don't use ls at all; use a glob to populate an array with the desired files. You can then use parameter expansion to shorten each array element.
d=$base_directory/network-files
files=( "$d"/* )
printf '%s Removing files: %s' "$(date -u)" "${files[*]#$d/}" | tee ...
You could do it a couple of ways. To directly answer the question, you could use sed to do it with the substitution command like:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | sed -e "s,$base_directory/network-files/,," | tee -a $network_files_log $verbose_log $network_log
which adds sed -e "s,$base_directory/network-files/,," to the pipeline. It will substitute the string found in base_directory with the empty string, so long as there are not any commas in base_directory. If there are you could try a different separator for the parts of the sed command, like underscore: sed -e "s_$base_directory/network-files__"
Instead though, you could just have the subshell cd to that directory and then the string wouldn't be there in the first place:
echo -e "$(date -u) >>> Removing files: $(cd "$base_directory/network-files/"; ls -1 | tr '\n' ' ')" | tee -a "$network_files_log" "$verbose_log" "$network_log"
Or you could avoid some potential pitfalls with echo and use printf like
{ printf '%s >>>Removing files: '; printf '%s ' "$(cd "$base_directory/network-files"; ls -1)"; printf '\n'; } | tee -a ...
testdata="/home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png"
echo -e $testdata | sed -e 's/\/[^ ]*\///g'
Pipe your output to sed the replace that captured group with nothing.
The regex: \/[^ ]*\/
Start with a /, captured everything that is not a space until it gets to the last /.

Using cut on stdout with tabs

I have a file which contains one line of text with tabs
echo -e "foo\tbar\tfoo2\nx\ty\tz" > file.txt
I'd like to get the first column with cut. It works if I do
$ cut -f 1 file.txt
foo
x
But if I read it in a bash script
while read line
do
new_name=`echo -e $line | cut -f 1`
echo -e "$new_name"
done < file.txt
Then I get instead
foo bar foo2
x y z
What am I doing wrong?
/edit: My script looks like that right now
while IFS=$'\t' read word definition
do
clean_word=`echo -e $word | external-command'`
echo -e "$clean_word\t<b>$word</b><br>$definition" >> $2
done < $1
External command removes diacritics from a Greek word. Can the script be optimized any further without changing external-command?
What is happening is that you did not quote $line when reading the file. Then, the original tab-delimited format was lost and instead of tabs, spaces show in between words. And since cut's default delimiter is a TAB, it does not find any and it prints the whole line.
So quoting works:
while read line
do
new_name=`echo -e "$line" | cut -f 1`
#----------------^^^^^^^
echo -e "$new_name"
done < file.txt
Note, however, that you could have used IFS to set the tab as field separator and read more than one parameter at a time:
while IFS=$'\t' read name rest;
do
echo "$name"
done < file.txt
returning:
foo
x
And, again, note that awk is even faster for this purpose:
$ awk -F"\t" '{print $1}' file.txt
foo
x
So, unless you want to call some external command while looping the file, awk (or sed) is better.

How to delete all text on a line appearing after a particular symbol?

I have a file, file1.txt, like this:
This is some text.
This is some more text. ② This is a note.
This is yet some more text.
I need to delete any text appearing after "②", including the "②" and any single space appearing immediately before, if such a space is present. E.g., the above file would become file2.txt:
This is some text.
This is some more text.
This is yet some more text.
How can I delete the "②", anything coming after, and any preceding single space?
The solutions at How can I remove all text after a character in bash? do not seem to work, perhaps because "②" is not an ordinary character.
The file is saved in UTF-8.
A Perl solution:
$ perl -CS -i~ -p -E's/ ②.*//' file1.txt
You'll end up with the correct data in file1.txt and a backup of the original file in file1.txt~.
I hope you do realize most unix utilities do not work with unicode. I assume your input is in UTF-8, if not you have to adjust accordingly.
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
echo -e $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out="$out $line"
echo -e $out
out=''
else
out="$out $line"
fi
done
if [ "$out" != '' ] ; then
echo -e $out
fi
fi |
(perl -pe 's/( 0020)* 2461 .*$/ 000a/;s/ *//g') |
while read line
do
px $line
done | (iconv -f UTF16 -t UTF8 )
sed -e "s/[[:space:]]②[^\.]*\.//"
However, I am not sure that the ② symbol is parsed correctly. Maybe you have to use UTF8 codes or something like.
Try this:
sed -e '/②/ s/[ ]*②.*$//'
/②/ look only for the lines containing the magic symbol;
[ ]* for any number (matches none) of spaces before the magic symbol;
.*$ everything else till the end of line.

Resources