Merging, then splitting files - bash

Using a for loop, I can merge all of the files in a directory that end with *.txt:
for filename in *.txt; do
cat "${filename}"
echo
done > output.txt
After doing this, I will run output.txt through various scripts, in which the text will be changed considerably. After that, I want to split the files, at the same places at which they were merged, into different files (output01.txt, output02.txt, etc.).
How can I split the files at the same place they were merged?
This cannot be based on line number, because the scripts will add \t in places.
I think a solution that might work is to place "#########" at the end of each of the initial *.txt files before merging them, but I don't know how to get BASH to split the files again at that mark.

Instead of that for loop for concatenating, you can just use cat *.txt.
Anyway, why don't you just perform the scripts on each file independently within the for loop?
If you really want to combine and re-segregate, you can use:
for filename in *.txt; do
cat "${filename}"
echo "#####"
done > output.txt
# Pass output.txt through whatever
awk 'BEGIN { fileno = 1; file = sprintf("output%02d.txt", fileno) };
{ if($1 ~ /#####/) { fileno++;
file = sprintf("output%02d.txt", fileno);
next }
else print >file
}' output.txt

The canonical answer would be:
tar c *.txt > output.txt
You could split/unmerge them exactly by doing
tar xf output.txt # in the current directory
tar x -C /tmp/splitfiles/ -f output.txt
Now if you really want to do stuff like that in a loop and extract to stdout/a pipe, you could:
while read fname < <(tar tf output.txt)
do
# extract named to pipe
tar -xOf output.txt "$fname" | myprogram "$fname"
done
However, that would possibly not be very efficient. You could consider just doing
while read fname < <(tar x -v -C /tmp/splitfiles/ -f output.txt)
do
# handle extracted file
myprogram "/tmp/splitfiles/$fname"
unlink "/tmp/splitfiles/$fname" # drop the temp file
done
This will be completely asynchronous (so if extraction or even the transmission of the archive is slow, the first files can already be processed while waiting for more data to arrive).
See also my other answer https://stackoverflow.com/a/8341221/85371 (look for the older answer part, since that question was changed to be very specific later)

As Fredrik wrote here you can use csplit to split your merged file.

Related

Writing a Bash script that takes a text file as input and pipes the text file through several commands

I keep text files with definitions in a folder. I like to convert them to spoken word so I can listen to them. I already do this manually by running a few commands to insert some pre-processing codes into the text files and then convert the text to spoken word like so:
sed 's/\..*$/[[slnc 2000]]/' input.txt inserts a control code after first period
sed 's/$/[[slnc 2000]]/' input.txt" inserts a control code at end of each line
cat input.txt | say -v Alex -o input.aiff
Instead of having to retype these each time, I would like to create a Bash script that pipes the output of these commands to the final product. I want to call the script with the script name, followed by an input file argument for the text file. I want to preserve the original text file so that if I open it again, none of the control codes are actually inserted, as the only purpose of the control codes is to insert pauses in the audio file.
I've tried writing
#!/bin/bash
FILE=$1
sed 's/$/ [[slnc 2000]]/' FILE -o FILE
But I get hung up immediately as it says sed: -o: No such file or directory. Can anyone help out?
If you just want to use foo.txt to generate foo.aiff with control characters, you can do:
#!/bin/sh
for file; do
test "${file%.txt}" = "${file}" && continue
sed -e 's/\..*$/[[slnc 2000]]/' "$file" |
sed -e 's/$/[[slnc 2000]]/' |
say -v Alex -o "${file%.txt}".aiff
done
Call the script with your .txt files as arguments (eg, ./myscript *.txt) and it will generate the .aiff files. Be warned, if say overwrites files, then this will as well. You don't really need two sed invocations, and the sed that you're calling can be cleaned up, but I don't want to distract from the core issue here, so I'm leaving that as you have it.
This will:-
a} Make a list of your text files to process in the current directory, with find.
b} Apply your sed commands to each text file in the list, but only for the current use, allowing you to preserve them intact.
c} Call "say" with the edited files.
I don't have say, so I can't test that or the control codes; but as long as you have Ed, the loop works. I've used it many times. I learned it as a result of exposure to FORTH, which is a language that still permits unterminated loops. I used to have problems with remembering to invoke next at the end of the script in order to start it, but I got over that by defining my words (functions) first, in FORTH style, and then always placing my single-use commands at the end.
#!/bin/sh
next() {
[[ -s stack ]] && main
end
}
main() {
line=$(ed -s stack < edprint+.txt)
infile=$(cat "${line}" | sed 's/\..*$/[[slnc 2000]]/' | sed 's/$/[[slnc 2000]]/')
say "${infile}" -v Alex -o input.aiff
ed -s stack < edpop+.txt
next
}
end() {
rm -v ./stack
rm -v ./edprint+.txt
rm -v ./edpop+.txt
exit 0
}
find *.txt -type -f > stack
cat >> edprint+.txt << EOF
1
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
next

Script to copy directory of filenames to .txt file

This should be fairly easy and I understand the logic of it but my shell scripting is rather beginner.
Basically, I have a directory with a hundred files or so, and I want to copy their filenames to a .txt file. One line per filename. I know I'd want a loop for all the files in the directory, copy name to text file, repeat until there are no more files but not sure how to write that out in a .sh file.
(Also, just out of pure curiosity, how would I omit the file extensions? In this case, they're all the same extension but potentially in the future they may not be, and while I need the extensions right now I may not in the future. I'm assuming there might be a flag for this or would I use '.' as a delimiter to stop copying at that point?)
Thanks in advance!
It could be very easy with ls:
ls -1 [directory] > filename.txt
Note the flag -1, it tells ls to output filenames one per line regardless what the output is. Usually ls acts like ls -C if the stdout is a tty, and acts like ls -1 otherwise. Explicitly specifying this flag forces ls to output one per line.
If you want to do it manually, this is an example:
#!/bin/sh
cd [directory]
for i in *
do
echo "$i"
done > filename.txt
To omit extensions, you can use string replacement:
echo "${i%.*}"
For the first part, you can do
ls <dirname> > files.txt
I alias ls to ls -F, so to avoid any extraneous characters in the output, you would do
printf "%s\n" * > ../filename.txt
I put the output txt file in a different directory so the list of files does not include "filename.txt"
If you want to omit file extensions:
printf "%s\n" * | sed 's/\.[^.]*$//' > ../filename.txt

Script that lists all file names in a folder, along with some text after each name, into a txt file

I need to create a file that lists all the files in a folder into a text file, along with a comma and the number 15 after. For example
My folder has video.mp4, video2.mp4, picture1.jpg, picture2.jpg, picture3.png
I need the text file to read as follows:
video.mp4,15
video2.mp4,15
picture1.jpg,15
picture2.jpg,15
picture3.png,15
No spaces, just filename.ext,15 on each line. I am using a raspberry pi. I am aware that the command ls > filename.txt would put all the file names into a folder, but how would I get a ,15 after every line?
Thanks
bash one-liner:
for f in *; do echo "$f,15" >> filename.txt; done
To avoid opening the output file on each iteration you may redirect the entire output with > filename.txt:
for f in *; do echo "$f,15"; done > filename.txt
$ printf '%s,15\n' *
picture1.jpg,15
picture2.jpg,15
picture3.png,15
video.mp4,15
video2.mp4,15
This will work if those are the only files in the directory. The format specifier %s,15\n will be applied to each of printf's arguments (the names in the current directory) and they will be outputted with ,15 appended (and a newline).
If there are other files, then the following would work too, regardless of whether there are files called like this or not:
$ printf '%s,15\n' video.mp4 video2.mp4 picture1.jpg picture2.jpg "whatever this is"
video.mp4,15
video2.mp4,15
picture1.jpg,15
picture2.jpg,15
whatever this is,15
Or, on all MP4, PNG and JPEG files:
$ printf '%s,15\n' *.mp4 *.jpg *.png
video.mp4,15
video2.mp4,15
picture1.jpg,15
picture2.jpg,15
picture3.png,15
Then redirect this to a file with printf ...as above... >output.txt.
If you're using Bash, then this will not make use of any external utility, as printf is built into the shell.
You need to do something like this:
#!/bin/bash
for i in $(ls folder_name); do
echo $i",15" >> filename.txt;
done
It's possible to do this in one line, however, if you want to create a script, consider code readability in the long run.
Edit 1: better solution
As #CristianRamon-Cortes suggested in the comments below, you should not rely on the output of ls because of the problems explained in this discussion: why not parse ls. As such, here's how you should write the script instead:
#!/bin/bash
cd folder_name
for i in *; do
echo $i",15" >> filename.txt;
done
You can skip the part cd folder_name if you are already in the folder.
Edit 2: Enhanced solution:
As suggested by #kusalananda, you'd better do the redirection after done to avoid opening the file in each iteration of the for loop, so the script will look like this:
#!/bin/bash
cd folder_name
for i in *; do
echo $i",15";
done > filename.txt
Just 1 command line using 2 msr commands recusively (-r) search specific files:
msr -rp your-dir1,dir2,dirN -l -f "\.(mp4|jpg|png)$" -PAC | msr -t .+ -o '$0,15' -PIC > save-file.txt
If you want to sort by time, add --wt to first command like: msr --wt -l -rp your-dirs
Sort by size? Add --sz but only the prior one is effective if use both --sz and --wt.
If you want to exclude some directory, add like: --nd "^(test|garbage)$"
remove tail \r\n in save-file.txt : msr -p save-file.txt -S -t "\s+$" -o "" -R
See msr.exe / msr.gcc48 etc in my open project https://github.com/qualiu/msr tools directory.
A solution without a loop:
ls | xargs -i echo {},15 > filename.txt

Split large gzip file while adding header line to each split

I want to automate the process of splitting large gzip file to smaller gzip file each split containing 10000000 lines (Last split will be left over and will less than 10000000).
Here is how I am doing at the moment and I am actually repeating by calculating number of left over lines.
gunzip -c large_gzip_file.txt.gz | tail -n +10000001 | head -n 10000000 > split1_.txt
gzip split1_.txt
gunzip -c large_gzip_file.txt.gz | tail -n +20000001 | head -n 10000000 > split2_.txt
gzip split2_.txt
I continue this by repeating as shown all the way until the end. Then I open these and manually add the header line. How can this be automated.
I search online where i see awk and other solutions but didn't see for gzip or similar to this scenario.
I would approach it like this:
gunzip the file
use head to get the first line and save it off to another file
use tail to get the rest of the file and pipe it to split to produce files of 10,000,000 lines each
use sed to insert the header into each file, or just cat the header with each file
gzip each file
You'll want to wrap this in a script or a function to make it easier to rerun at a later time. Here's an attempt at a solution, lightly tested:
#!/bin/bash
set -euo pipefail
LINES=10000000
file=$(basename $1 .gz)
gunzip -k ${file}.gz
head -n 1 $file >header.txt
tail -n +2 $file | split -l $LINES - ${file}.part.
rm -f $file
for part in ${file}.part.* ; do
[[ $part == *.gz ]] && continue # ignore partial results of previous runs
gzip -c header.txt $part >${part}.gz
rm -f $part
done
rm -f header.txt
To use:
$ ./splitter.sh large_gzip_file.txt.gz
I would further improve this by using a temporary directory (mktemp -d) for the intermediate files and ensuring the script cleans up after itself at exit (with a trap). Ideally it would also sanity check the arguments, possibly accepting a second argument indicating the number of lines per part, and inspect the contents of the current directory to ensure it doesn't clobber any preexisting files.
I don't think awk is for splitting gzip file into smaller pieces files, it's for text-processing. Below is my way to solve your issue, hope it helps:
step1:
gunzip -c large_gzip_file.txt.gz | split -l 10000000 - split_file_
split command can split a file into pieces, you can specify the size of each piece and also provide prefix for all pieces.
the large gzip file will be splited to multiple files with name prefix split_file_
step2:
save header content into file header_file.csv
step3:
for f in split_file*; do
cat header_file.csv $f > $f.new
mv $f.new $f
done
Here I assume you work in the splited file directory, if not, replace split_file* with the absolute path, for example /path/to/split_file*. Iterates all files with name pattern split_file*, add header content to the beginning of each match file

Write a shell script that replaces multiple strings in multiple files

I need to search through many files in a directory for a list of keywords and add a prefix to all of them. For example, if various files in my directory contained the terms foo, bar, and baz, I would need to change all instances of these terms to: prefix_foo, prefix_bar, and prefix_baz.
I'd like to write a shell script to do this so I can avoid doing the search one keyword at a time in SublimeText (there are a lot of them). Unfortunately, my shell-fu is not that strong.
So far, following this advice, I have created a file called "replace.sed" with all of the terms formatted like this:
s/foo/prefix_foo/g
s/bar/prefix_bar/g
s/baz/prefix_baz/g
The terminal command it suggests to use with this list is:
sed -f replace.sed < old.txt > new.txt
I was able to adapt this to replace instances within the file (instead of creating a new file) by setting up the following script, which I called inline.sh:
#!/bin/sh -e
in=${1?No input file specified}
mv $in ${bak=.$in.bak}
shift
"$#" < $bak > $in
Putting it all together, I ended up with this command:
~/inline.sh old.txt sed -f replace.sed
I tried this and it works, for one file at a time. How would I adapt this to search and replace through all of the files in my entire directory?
for f in *; do
[[ -f "$f" ]] && ~/inline.sh "$f" sed -f ~/replace.sed
done
In a script:
#!/bin/bash
files=`ls -1 your_directory | egrep keyword`
for i in ${files[#]}; do
cp ${i} prefix_${i}
done
This will, of course, leave the originals where they are.

Resources