Using bash, how to awk/grep from the middle of a given file and skip 1Gig for instance? In other words, I don't want awk/grep to search through the first 1Gig of the file but want to start my search in the middle of the file.
You can use dd like this:
# make a 10GB file of zeroes
dd if=/dev/zero bs=1G count=10 > file
# read it, skipping first 9GB and count what you get
dd if=file bs=1G skip=9 | wc -c
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.84402 s, 582 MB/s
1073741824
Note that I am just demonstrating a concept of how easily you can skip 9GB. In practice, you may prefer to use a 100MB memory buffer and skip 90 of them, rather than allocating a whole GB. So, in practice, you might prefer:
dd if=file bs=100M skip=90 | wc -c
Note also that I am piping to wc rather than awk because my test data is not line oriented - it is just zeros.
Or, if your record size is 30kB and you want to skip a million records and discard diagnostic output:
dd if=file bs=30K skip=1000000 2> /dev/null | awk ...
Note that:
your line numbers will be "wrong" in awk (because awk didn't "see" them), and
your first line may be incomplete (because dd isn't "line oriented") but I guess that doesn't matter.
Note also, that it is generally very advantageous to use a large block size. So, if you want 8MB, you will do much better with bs=1m count=8 than with bs=8 count=1000000 which will cause a million writes of 8 bytes each.
Note also, that if you like processing very large files, you can get GNU Parallel to divide them up for processing in parallel by multiple subprocesses. So, for example, the following code takes the 10GB file we made at the start and starts 10 parallel jobs counting the bytes in each 1GB chunk:
parallel -a file --recend "" --pipepart --block 1G wc -c
If you know the full size of the file (lets say 5 million lines) you can do this:
tail -2000000 filename|grep "yourfilter"
This way you will do whatever editing, or printing, starting below the first 3 million lines
Not tested the performance on very large files, compared to tail | grep, but you could try GNU sed:
sed -n '3000001,$ {/your regex/p}' file
skips the first 3 millions lines and then prints all lines matching the your regex regular expression. Same with awk:
awk 'NR>3000000 && /your regex/' file
Related
My requirement is to chop off the header and trailer records from a large file, I'm using a file of size 2.5GB with 1.8 million records. For doing so, I'm executing:
head -n $((count-1)) largeFile | tail -n $((count-2)) > outputFile
Whenever I select count>=725,000 records (size=1,063,577,322), the prompt is returning an error:
tail:unable to malloc memory
I assumed that the pipe buffer went full and tried:
head -n 1000000 largeFile | tail -n 720000 > outputFile
which should also fail since i'm passing count> 725000 to head, but, it generated the output.
Why it is so? As head is generating same amount of data (or more), both commands should fail, but the command is depending on tail count. Is it not the way where, first head writes into pipe and then tail uses pipe as input. If it is not, how parallelism is supported here, since tail works from end which is not known until head completes execution. Please correct me, I've assumed lot of things here.
PS: For the time being I've used grep to remove header and trailer. Also, ulimit on my machine returns:
pipe (512 byte) 64 {32 KB}
Thanks guys...
Just do this instead:
awk 'NR>2{print prev} {prev=$0}' largeFile > outputFile
it'll only store 1 line in memory at a time so no need to worry about memory issues.
Here's the result:
$ seq 5 | awk 'NR>2{print prev} {prev=$0}'
2
3
4
I did not test this with a large file, but it will avoid a pipe.
sed '1d;$d' largeFile > outputFile
Ed Morton and Walter A have already given workable alternatives; I'll take a stab at explaining why the original is failing. It's because of the way tail works: tail will read from the file (or pipe), starting at the beginning. It stores the last lines seen, and then when it reaches the end of the file, it outputs the stored lines. That means that when you use tail -n 725000, it needs to store the last 725,000 lines in memory, so it can print them when it reaches the end of the file. If 725,000 lines (most of a 2.5GB file) won't fit in memory, you get a malloc ("memory allocate") error.
Solution: use a process that doesn't have to buffer most of the file before outputting it, as both Ed and Walter's solutions do. As a bonus, they both trim the first line in the same process.
Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.
I am using Oracle's UTL_FILE package to generate some files. The file names have a certain format like <name>_<date>_<time>_<sequence> where sequence starts from 000.
Now I want to split the files if the original file is greater than x Mb. In that case I will get (in case of 2 files) :
<name>_<date>_<time>_001
<name>_<date>_<time>_002
where 001 is 10 Mb (max) and 002 is < 10 Mb.
The only way I see to do this is to count every line in bytes which will be written by the UTL_FILE.put command and then to decide whether to write more or to spilt.
This seems me very CPU consuming process.
Is there a way to do this differently in PL/SQL?
I don't have enough badges to comment. Hence answering.
That's interesting and challenging. But why do you want to do this in PL/SQL only? You can easily achieve this task by writing a shell script.
Let us say the file name is File1 and the size is 5.6GB. Then the file should split into 3 files and the naming of the files should be File1, File2, File3
You can use du -BG <file> to get the size in GB.
size=$(du -BG your_file | cut -dG -f1)
then
[ $size -ge 3 ] && split -d -b2G your_file file
output will be file00 file01
I'm trying to remove the first 37 lines from a very, very large file. I started trying sed and awk, but they seem to require copying the data to a new file. I'm looking for a "remove lines in place" method, that unlike sed -i is not making copies of any kind, but rather is just removing lines from the existing file.
Here's what I've done...
awk 'NR > 37' file.xml > 'f2.xml'
sed -i '1,37d' file.xml
Both of these seem to do a full copy. Is there any other simple CLI that can do this quickly without a full document traversal?
There's no simple way to do inplace editing using UNIX utilities, but here's one inplace file modification solution that you might be able to modify to work for you (courtesy of Robert Bonomi at https://groups.google.com/forum/#!topic/comp.unix.shell/5PRRZIP0v64):
bytes=$(head -37 "$file" |wc -c)
dd if="$file" bs="$bytes" skip=1 conv=notrunc of="$file"
The final file should be $bytes bytes smaller than the original (since the goal was to remove $bytes bytes from the beginning), so to finish we must remove the final $bytes bytes. We're using conv=notrunc above to make sure that the file doesn't get completely emptied rather than just truncated (see below for example). On a GNU system such as Linux doing the truncation afterwards can be accomplished by:
truncate -s "-$bytes" "$file"
For example to delete the first 5 lines from this 12-line file
$ wc -l file
12 file
$ cat file
When chapman billies leave the street,
And drouthy neibors, neibors, meet;
As market days are wearing late,
And folk begin to tak the gate,
While we sit bousing at the nappy,
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
First use dd to remove the target 5 lines (really "$bytes" bytes) from the start of the file and copy the rest from the end to the front but leave the trailing "$bytes" bytes as-is:
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 conv=notrunc of=file
1+1 records in
1+1 records out
253 bytes copied, 0.0038458 s, 65.8 kB/s
$ wc -l file
12 file
$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
s, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
and then use truncate to remove those leftover bytes from the end:
$ truncate -s "-$bytes" "file"
$ wc -l file
7 file
$ cat file
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
If we had tried the above without dd ... conv=notrunc:
$ wc -l file
12 file
$ bytes=$(head -5 file |wc -c)
$ dd if=file bs="$bytes" skip=1 of=file
dd: file: cannot skip to specified offset
0+0 records in
0+0 records out
0 bytes copied, 0.0042254 s, 0.0 kB/s
$ wc -l file
0 file
See the google groups thread I referenced for other suggestions and info.
Unix file semantics do not allow truncating the front part of a file.
All solutions will be based on either:
Reading the file into memory and then writing it back (ed, ex, other editors). This should be fine if your file is <1GB or if you have plenty of RAM.
Writing a second copy and optionally replacing the original (sed -i, awk/tail > foo). This is fine as long as you have enough free diskspace for a copy, and don't mind the wait.
If the file is too large for any of these to work for you, you may be able to work around it depending on what's reading your file.
Perhaps your reader skips comments or blank lines? If so, you can then craft a message the reader ignores, make sure it has the same number of bytes as the 37 first lines in your file, and overwrite the start of the file with dd if=yourdata of=file conv=notrunc.
ed is the standard editor:
ed -s file <<< $'1,37d\nwq'
The copy will have to be created at some point - why not at the time of reading the "modified" file; streaming the altered copy instead of storing it?
What I'm thinking - create a named pipe "file2" that is the output of that same awk 'NR > 37' file.xml or whatever; then whoever reads file2 will not see the first 37 lines.
The drawback is that it will run awk each time the file is processed, so it's feasible only if it's read rarely.
This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 6 years ago.
I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
Perhaps, I can do this using:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).
Is there a better way to do this?
When you want to split a file, use split:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
Look into the split command, it should do what you want (and more):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.