Concatenate specific number of files - bash

I have a bunch of files named uv_set_XXXXXXXX where the 6 Xs stand for the usual format year, month and day. Imagine I have 325 files of this type. I would like to concatenate by groups of 50 files, so in the end I have 7 files (6 files of 50 and 1 of 25).
I have been thinking in using cat but I can't see an option to select a number of files from a list. I could do this with Python, but just wondering if some Unix command line utility does it more directly.
Thanks.

With GNU parallel you can use the following command
parallel -n50 "cat {} > out{#}" ::: uv_set_*
This will merge the first 50 files into out1, the next 50 files into out2, and so on.

I would just break down and do this in Awk.
awk 'FNR==1 && (++i%50 == 0) {
if(NR>1) close p;
p = "dest_" ++j }
{ print >p }' uv_set_????????
This creates files dest_1 through dest_7, the first 6 with 50 files in each and the last with the remainder.
Closing the previous file is necessary because the system only allows Awk to have a limited number of open file handles (though the limit is typically higher than 7 so it's probably not important in your example).
Thinking out loud dept, just to prevent anyone else from wasting time on repeating this dead end.
You could use xargs -L 50 cat to concatenate 50 files at a time, but there is no simple way to pass in a new redirection for standard output for each invocation. You could try to hack your way around that with something like
# XXX Do not use: incomplete
printf '%s\n' uv_set_???????? |
xargs -L 50 sh -c 'cat "$#" > ... something' _
but I can't come up with elegant way to have a different something each time.

Related

Using Bash to concatenate the first 10 bytes of many files

I am trying to write a simple Bash loop to concatenate the first 10 bytes of all the files in a directory.
So far, I have the code block:
for filename in /content/*.bin;
do
cat -- (`head --bytes 10 $filename`) > "file$i.combined"
done
However, the the syntax is clearly incorrect here. I know the inner command:
head --bytes 10 $filename
...returns what I need; the first 10 bytes of the passed filename. And when I use:
cat -- $filename > "file$i.combined"
...the code works, only it concats the entire file contents.
How can I combine the two functions so that my loop concatenates the first 10 bytes of all the looped files?
The loop will do the concatenation for you (or rather, the output of the sequential executions of head are written in order to the standard output of the loop itself); you don't need to use cat.
for filename in /content/*.bin; do
head --bytes 10 "$filename"
done > "file$i.combined"
With head from GNU coreutils:
head -qc 10 /content/*.bin > combined
For the very unlikely case you run into problems with ARG_MAX (either because of very long filenames, or a huge number of files) you can use
find /content -maxdepth 1 -name '*.bin' -exec head -qc 10 {} + > combined
You should use #chepner's answer, but to show how you could use cat...
Here we have a recursive function that rolls everything up into a single copy of cat.
catHeads() {
local first="$1"; shift || return
case $# in
0) head --bytes 10 "$first" ;;
1) cat <(head --bytes 10 "$first") <(head --bytes 10 "$1");;
*) cat <(head --bytes 10 "$first") <(catHeads "$#");;
esac
}
catHeads /content/*.bin >"file$i.combined"
Don't ever actually do this. Use chepner's answer (or Socowi's) instead.

bash: split ascii file into n parts; iterate over ONLY those files

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?
How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12
With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04
Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

How to use the awk command in a loop to produce several thinned data files

I have several large data files with 8 columns and 120,000 rows. Now I want to keep 1 line every 200 lines starting from the 100th line. I have the script file thin.sh as:
awk '(NR%200==100)' original_file > thinned_file
However, now I have 30 original files, which means I have to revise the command little by little for 30 times and the original files share similar names as:
data.0000.dat, data.0001.dat data.0002.dat, ..., data.0029.dat
I suppose there must be some way to embed the awk command into a loop to accomplish my goal, maybe something like:
for(i=0;i<30;i++);
do
awk '(NR%200==100)' data.$i.dat > data.$i_thinned.dat
done
But I realize there're two digits of 00 in front of $i in the file names. Can I use sprintf("%s") or something? If so, how to arrange the order of awk and sprinf?
I use ubuntu and bash.
With seq:
for i in $(seq -f %04g 1 29); do
awk 'NR % 200 == 100' "data.${i}.dat" > "data.${i}_thinned.dat"
done
Alternatively with bash:
for i in {0001..0029}; do
The quotes are not strictly necessary in the first snippet because we know $i does not contain anything nefarious, but it's better to be paranoid about expansion in shell scripts. The braces in "data.${i}_thinned.dat" are necessary so the shell doesn't look for a variable $i_thinned to use. They are not strictly necessary in "data.${i}.dat" because shell variable names cannot have . in them, but consistency is nice.
All you need is:
awk 'FNR==1{close(out); out=FILENAME; sub(/\.dat/,"_thinned&",out)} (FNR%200==100){print > out}' data.[0-9][0-9][0-9][0-9].dat
I used data.[0-9][0-9][0-9][0-9].dat as the file name globbing pattern instead of data.*.dat in case you rerun the script in the same dir where you previously generated all of the "_thinned" files.
Ingedients(GAWK)
1 FNR - The record number in the current file
1 match - Matches a regex string and can capture groups into an array.
1 print - Prints following data(if none is provided then defaults to current record)
1 *.dat - All files ending with .dat in the current director.
Instructions
In the condition block check that the current record number in the current file when divided by 200 leaves a remainder of 100.
If it does then run the next block {..}
Take the current file name and match up to the last dot, capture everything before this with (.*) into array a.
Print into a file using the captured date a[1] with the extension _thinned.dat
Finally add *.dat to the end to read all .dat files in the current directory
Resulting code
gawk '(FNR%200==100){match(FILENAME,/(.*)\./,a);print >(a[1]"_thinned.dat")}' *.dat

How to remove the path part from a list of files and copy it into another file?

I need to accomplish the following things with bash scripting in FreeBSD:
Create a directory.
Generate 1000 unique files whose names are taken from other random files in the system.
Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file.
The script must show information about the speed of its execution in ms.
What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well...
[Update: I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. I show one method for constructing the list in my response using a temporary file. I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. I will wait a while for comments and then shorten and clean up this answer. Thanks.]
Creating the file listing
You can do a lot with GNU find(1). The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes).
find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'
I'm assuming that you want to be random across all filenames (i.e. no links) so you'll grab the entries from the whole file system. I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. You'll want to adjust by excluding or pruning certain directories from the search.
A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. By examining the file with vim and sed (i.e. looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. You could then pipe this through cut, grep or sed, awk and friends in order to to create the files in the way you want. For example from the shell prompt:
~/# touch `cat tmp.txt |cut -f1`
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done
I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i.
The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind). A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above.
#!/bin/sh
# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum
for file in `cat tmp.txt`
do
name=`basename $file`
size=`wc -c $file |awk '{print $1}'`
# Uncomment the next line to see the values on STDOUT
# printf "Location: $name \nSize: $size \n"
# Uncomment the next line to put data into the respective .dat files
# printf "Location: $file \nSize: $size \n" > $name.dat
done
# vim: ft=sh
If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! So, how to randomly select 1000 files from our listing of 800k for processing? There's several ways to go about it.
Randomly selecting from the file listing
We have a listing of all the files on the system (!). Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file (tmp.txt). We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division (%) on it using the number of lines in the file as the divisor. Then we just take that number and select the line in the file to which it corresponds with awk or sed (e.g. sed -n <$RANDOMNUMBER>p filelist), iterate 1000 times and presto! We have a new list of 1000 random files. Or not ... it's really slow! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk.
See Alex's blog for the details. My only problems with his technique came with setting the count= switch to a high enough number. For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. I think I was probably mixing up characters and bytes. Any explanations?
So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem:
#!/bin/sh
IFS='
'
# We create tmp.txt with
# find / -type f > tmp.txt # tweak as needed.
#
files="tmp.txt"
# Get the number of lines and maximum line length for later
bytesize=`wc -c < $files`
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`
# A function to generate a random number modulo the
# number of bytes in the file. We'll use this to find a
# random location in our file where we can grab a line
# using dd and sed.
genrand () {
echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc
}
rm -f randlist.txt
i=1
while [ $i -le 1000 ]
do
# This probably works but is way too slow: sed -n `genrand`p $files
# Instead, use Alex Lines' dd seek method:
dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt
true $((i=i+1)) # Bourne shell equivalent of $i++ iteration
done
for file in `cat randlist.txt`
do
name=`basename $file`
size=`wc -c <"$file"`
echo -e "Location: $file \n\n Size: $size" > $name.dat
done
# vim: ft=sh
What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list
I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). Considering there's not much statistics associated with this process, I omitted that. You can add your own however.
cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
## This extracts only the file name from the path
file_name=`basename $file_path`
## This grabs the files size in bytes
file_size=`wc -c < $file_path`
## Create the file and place info regarding original file within new file
echo -e "$file_name \nThis file is $file_size bytes "> $file_name
done

Very slow loop using grep or fgrep on large datasets

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
Loop Running VERY Slow
:
Very slow foreach loop
here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance
sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.
Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.
Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.
The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.
EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.
Here's one way to speed things up:
while read i
do
LOOK=$(echo $i)
fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile
When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.
By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.
If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.
while read i
do
LOOK=$(echo $i)
find $i -type f | xargs fgrep $LOOK >> /data/output.txt
done < /data/datafile
The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:
find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt
On the Mac it's
find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt
As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.
Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.
Since you are searching for simple strings (and not regexp) you may want to use comm:
comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt
It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.
On a 8 core this takes 100 sec on these files:
$ wc find_this; cat in_this.* | wc
3637371 4877980 307366868 find_this
16000000 20000000 1025893685
Be sure to have a reasonably new version of sort. It should support --parallel.
You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.
Another hint: you can combine strings that you are looking for in one regular expression.
In this case grep will do only one pass for all combined lines.
Example:
Instead of
for i in ABC DEF GHI JKL
do
grep $i file >> results
done
you can do
egrep "ABC|DEF|GHI|JKL" file >> results

Resources