Append to the top of a large file: bash - bash

I have a nearly 3 GB file that I would like to add two lines to the top of. Every time I try to manually add these lines, vim and vi freeze up on the save (I let them try to save for about 10 minutes each). I was hoping that there would be a way to just append to the top, in the same way you would append to the bottom of the file. The only things I have seen so far however include a temporary file, which I feel would be slow due to the file size.
I was hoping something like:
grep -top lineIwant >> fileIwant
Does anyone know a good way to append to the top of the file?

Try
cat file_with_new_lines file > newfile

I did some benchmarking to compare using sed with in-place edit (as suggested here) to cat (as suggested here).
~3GB bigfile filled with dots:
$ head -n3 bigfile
................................................................................
................................................................................
................................................................................
$ du -b bigfile
3025635308 bigfile
File newlines with two lines to insert on top of bigfile:
$ cat newlines
some data
some other data
$ du -b newlines
26 newlines
Benchmark results using dumbbench v0.08:
cat:
$ dumbbench -- sh -c "cat newlines bigfile > bigfile.new"
cmd: Ran 21 iterations (0 outliers).
cmd: Rounded run time per iteration: 2.2107e+01 +/- 5.9e-02 (0.3%)
sed with redirection:
$ dumbbench -- sh -c "sed '1i some data\nsome other data' bigfile > bigfile.new"
cmd: Ran 23 iterations (3 outliers).
cmd: Rounded run time per iteration: 2.4714e+01 +/- 5.3e-02 (0.2%)
sed with in-place edit:
$ dumbbench -- sh -c "sed -i '1i some data\nsome other data' bigfile"
cmd: Ran 27 iterations (7 outliers).
cmd: Rounded run time per iteration: 4.464e+01 +/- 1.9e-01 (0.4%)
So sed seems to be way slower (80.6%) when doing in-place edit on large files, probably due to moving the intermediary temp file to the location of the original file afterwards. Using I/O redirection sed is only 11.8% slower than cat.
Based on these results I would use cat as suggested in this answer.

Try doing this :
using sed :
sed -i '1i NewLine' file
Or using ed :
ed -s file <<EOF
1i
NewLine
.
w
q
EOF

The speed of such an operation depends greatly on the underlying file system. To my knowledge there isn't a FS optimized for this particular operation. Most FS organize files using full disk blocks, excepted for the last one, which may be partially used by the end of the file. Indeed, a file of size N would take N/S blocks, where S is the block size, and one more block for the remaining part of the file (of size N%S, % being the remainder operator), if N is not divisible by S.
Usually, these blocks are referenced by their indices on the disk (or partition), and these indices are stored within the FS metadata, attached to the file entry which allocates them.
From this description, you can see that it could be possible to prepend content whose size would be a multiple of the block size, by just updating the metadata with the new list of blocks used by the file. However, if that prepended content doesn't fill exactly a number of blocks, then the existing data would have to be shifted by that exceeding amount.
Some FS may implement the possibility of having partially used blocks within the list (and not only as the last entry) of used ones for files, but this is not a trivial thing to do.
See these other SO questions for further details:
Prepending Data to a file
Is there a file system with a low level prepend operation
At a higher level, even if that operation is supported by the FS driver, it is still possible that programs don't use the feature.
For the instance of that problem you are trying to solve, the best way is probably a program capable of catening the new content and the existing one to a new file.

cat file
Unix
linux
It append to the the two lines of the file at the same time using the command
sed -i '1a C \n java ' file
cat file
Unix
C
java
Linux
you want to INSERT means using i and Replace means using c

Related

Effective way to delete first N lines of a large 100M-line file in linux shell

Suppose that I have a large text file with 100M lines; What is the most effective way to delete first N lines of it, such as 50M lines.
As I tried, only opening this file with vim will take several minutes.
Are there any more effective ways to accomplish this?
tail -n +50000000 file > tmp && mv tmp file
If you don't have the storage to almost duplicate the input file then to edit it it truly in place (i.e. not using a temp file like all of the command line "inplace" editing tools like sed, perl, etc. use, nor a buffer the size of your input file like ed uses):
bytes=$(head -50000000 file |wc -c)
dd if=file bs="$bytes" skip=1 conv=notrunc of=file
truncate -s "-$bytes" file
See https://stackoverflow.com/a/17331179/1745001 for more info on that last script.

Is it possible to split a huge text file (based on number of lines) unpacking a .tar.gz archive if I cannot extract that file as whole?

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:
Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.
Is it possible at all?
This answer - bash: extract only part of tar.gz archive - describes a different problem.
To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:
tar Oxzf f.tar.gz | split -l1000000
The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:
tar Oxzf f.tar.gz |split -dl1000000 - prefix.
Under this approach:
The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.
The .tar.gz file is read only once.
split, through its many options, has a great deal of flexibility.
Explanation
For the tar command:
O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.
x tells tar to extract the file (as opposed to, say, creating an archive).
z tells tar that the archive is in gzip format. On modern tars, this is optional
f tells tar to use, as input, the file name specified.
For the split command:
-l tells split to split files limited by number of lines (as opposed to, say, bytes).
-d tells split to use numeric suffixes for the output files.
- tells split to get its input from stdin
You can use the --to-stdout (or -O) option in tar to send the output to stdout.
Then use sed to specify which set of lines you want.
#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
e=$(($l+$inc))
tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
sed -n -e "$l,$e p" > part$p.txt
l=$(($l+$inc))
p=$(($p+1))
done
Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.
#!/usr/bin/env bash
set -eu
filenum=1
chunksize=1000000
ii=0
while read line
do
if [ $ii -ge $chunksize ]
then
ii=0
filenum=$(($filenum + 1))
> out/file.$filenum
fi
echo $line >> out/file.$filenum
ii=$(($ii + 1))
done
This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:
tar xfzO big.tar.gz | ./split.sh
This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.
you can use
sed -n 1,20p /Your/file/Path
Here you mention your first line number and the last line number
I mean to say this could look like
sed -n 1,20p /Your/file/Path >> file1
and use start line number and end line number in a variable and use it accordingly.

Bash: how to optimize/parallelize a search through two large files to replace strings?

I'm trying to figure out a way to speed up a pattern search and replace between two large text files (>10Mb). File1 has two columns with unique names in each row. File2 has one column that contains one of the shared names in File1, in no particular order, with some text underneath that spans a variable number of lines. They look something like this:
File1:
uniquename1 sharedname1
uqniename2 sharedname2
...
File2:
>sharedname45
dklajfwiffwf
flkewjfjfw
>sharedname196
lkdsjafwijwg
eflkwejfwfwf
weklfjwlflwf
My goal is to use File1 to replace the sharedname variables with their corresponding uniquename, as follows:
New File2:
>uniquename45
dklajfwif
flkewjfj
>uniquename196
lkdsjafwij
eflkwejf
This is what I've tried so far:
while read -r uniquenames sharednames; do
sed -i "s/$sharednames/$uniquenames/g" $File2
done < $File1
It works but it's ridiculously slow, trudging through those big files. The CPU usage is the rate-limiting step, so I was trying to parallel the modification to use the 8 cores at my disposal, but couldn't get it to work. I also tried splitting File1 and File2 into smaller chunks and running in batches simultaneously, but I couldn't get that to work, either. How would you implement this in parallel? Or do you see a different way of doing it?
Any suggestions would be welcomed.
UPDATE 1
Fantastic! Great answers thanks to #Cyrus and #JJoao and suggestions by other commentators. I implemented both in my script, on the recommendation of #JJoao to test the compute times, and it's an improvement (~3 hours instead of ~5). However, I'm just doing text file manipulation so I don't see how it should be taking any more than a couple of minutes. So, I'm still working on making better use of the available CPUs, so I'm tinkering with the suggestions to see if I can speed it up further.
UPDATE 2: correction to UPDATE 1
I included the modifications into my script and run it as such, but a chunk of my code was slowing it down. Instead, I ran the suggested bits of code individually on the target intermediary files. Here's what I saw:
Time for #Cyrus' sed to complete
real 70m47.484s
user 70m43.304s
sys 0m1.092s
Time for #JJoao's Perl script to complete
real 0m1.769s
user 0m0.572s
sys 0m0.244s
Looks like I'll be using the Perl script. Thanks for helping, everyone!
UPDATE 3
Here's the time taken by #Cyrus' improved sed command:
time sed -f <(sed -E 's|(.*) (.*)|s/^\2/>\1/|' File1 | tr "\n" ";") File2
real 21m43.555s
user 21m41.780s
sys 0m1.140s
With GNU sed and bash:
sed -f <(sed -E 's|(.*) (.*)|s/>\2/>\1/|' File1) File2
Update:
An attempt to speed it up:
sed -f <(sed -E 's|(.*) (.*)|s/^>\2/>\1/|' File1 | tr "\n" ";") File2
#!/usr/bin/perl
use strict;
my $file1=shift;
my %dic=();
open(F1,$file1) or die("cant find replcmente file\n");
while(<F1>){ # slurp File1 to dic
if(/(.*)\s*(.*)/){$dic{$2}=$1}
}
while(<>){ # for all File2 lines
s/(?<=>)(.*)/ $dic{$1} || $1/e; # sub ">id" by >dic{id}
print
}
I prefer #cyrus solution, but if you need to do that often you can use the previous perl script (chmod + install) as
a dict-replacement command.
Usage: dict-replacement File1 File* > output
It would be nice if you could tell us the time of the various solutions...

How to remove the path part from a list of files and copy it into another file?

I need to accomplish the following things with bash scripting in FreeBSD:
Create a directory.
Generate 1000 unique files whose names are taken from other random files in the system.
Each file must contain information about the original file whose name it has taken - name and size without the original contents of the file.
The script must show information about the speed of its execution in ms.
What I could accomplish was to take the names and paths of 1000 unique files with the commands find and grep and put them in a list. Then I just can't imagine how to remove the path part and create the files in the other directory with names taken from the list of random files. I tried a for loop with the basename command in it but somehow I can't get it to work and I don't know how to do the other tasks as well...
[Update: I've wanted to come back to this question to try to make my response more useful and portable across platforms (OS X is a Unix!) and $SHELLs, even though the original question specified bash and zsh. Other responses assumed a temporary file listing of "random" file names since the question did not show how the list was constructed or how the selection was made. I show one method for constructing the list in my response using a temporary file. I'm not sure how one could randomize the find operation "inline" and hope someone else can show how this might be done (portably). I also hope this attracts some comments and critique: you never can know too many $SHELL tricks. I removed the perl reference, but I hereby challenge myself to do this again in perl and - because perl is pretty portable - make it run on Windows. I will wait a while for comments and then shorten and clean up this answer. Thanks.]
Creating the file listing
You can do a lot with GNU find(1). The following would create a single file with the file names and three, tab-separated columns of the data you want (name of file, location, size in kilobytes).
find / -type f -fprintf tmp.txt '%f\t%h/%f\t%k \n'
I'm assuming that you want to be random across all filenames (i.e. no links) so you'll grab the entries from the whole file system. I have 800000 files on my workstation but a lot of RAM, so this doesn't take too long to do. My laptop has ~ 300K files and not much memory, but creating the complete listing still only took a couple minutes or so. You'll want to adjust by excluding or pruning certain directories from the search.
A nice thing about the -fprintf flag is that it seems to take care of spaces in file names. By examining the file with vim and sed (i.e. looking for lines with spaces) and comparing the output of wc -l and uniq you can get a sense of your output and whether the resulting listing is sane or not. You could then pipe this through cut, grep or sed, awk and friends in order to to create the files in the way you want. For example from the shell prompt:
~/# touch `cat tmp.txt |cut -f1`
~/# for i in `cat tmp.txt|cut -f1`; do cat tmp.txt | grep $i > $i.dat ; done
I'm giving the files we create a .dat extension here to distinguish them from the files to which they refer, and to make it easier to move them around or delete them, you don't have to do that: just leave off the extension $i > $i.
The bad thing about the -fprintf flag is that it is only available with GNU find and is not a POSIX standard flag so it won't be available on OS X or BSD find(1) (though GNU find may be installed on your Unix as gfind or gnufind). A more portable way to do this is to create a straight up list of files with find / -type f > tmp.txt (this takes about 15 seconds on my system with 800k files and many slow drives in a ZFS pool. Coming up with something more efficient should be easy for people to do in the comments!). From there you can create the data values you want using standard utilities to process the file listing as Florin Stingaciu shows above.
#!/bin/sh
# portably get a random number (OS X, BSD, Linux and $SHELLs w/o $RANDOM)
randnum=`od -An -N 4 -D < /dev/urandom` ; echo $randnum
for file in `cat tmp.txt`
do
name=`basename $file`
size=`wc -c $file |awk '{print $1}'`
# Uncomment the next line to see the values on STDOUT
# printf "Location: $name \nSize: $size \n"
# Uncomment the next line to put data into the respective .dat files
# printf "Location: $file \nSize: $size \n" > $name.dat
done
# vim: ft=sh
If you've been following this far you'll realize that this will create a lot of files - on my workstation this would create 800k of .dat files which is not what we want! So, how to randomly select 1000 files from our listing of 800k for processing? There's several ways to go about it.
Randomly selecting from the file listing
We have a listing of all the files on the system (!). Now in order to select 1000 files we just need to randomly select 1000 lines from our listing file (tmp.txt). We can set an upper limit of the line number to select by generating a random number using the cool od technique you saw above - it's so cool and cross-platform that I have this aliased in my shell ;-) - then performing modulo division (%) on it using the number of lines in the file as the divisor. Then we just take that number and select the line in the file to which it corresponds with awk or sed (e.g. sed -n <$RANDOMNUMBER>p filelist), iterate 1000 times and presto! We have a new list of 1000 random files. Or not ... it's really slow! While looking for a way to speed up awk and sed I came across an excellent trick using dd from Alex Lines that searches the file by bytes (instead of lines) and translates the result into a line using sed or awk.
See Alex's blog for the details. My only problems with his technique came with setting the count= switch to a high enough number. For mysterious reasons (which I hope someone will explain) - perhaps because my locale is LC_ALL=en_US.UTF-8 - dd would spit incomplete lines into randlist.txt unless I set count= to a much higher number that the actual maximum line length. I think I was probably mixing up characters and bytes. Any explanations?
So after the above caveats and hoping it works on more than two platforms, here's my attempt at solving the problem:
#!/bin/sh
IFS='
'
# We create tmp.txt with
# find / -type f > tmp.txt # tweak as needed.
#
files="tmp.txt"
# Get the number of lines and maximum line length for later
bytesize=`wc -c < $files`
# wc -L is not POSIX and we need to multiply so:
linelenx10=`awk '{if(length > x) {x=length; y = $0} }END{print x*10}' $files`
# A function to generate a random number modulo the
# number of bytes in the file. We'll use this to find a
# random location in our file where we can grab a line
# using dd and sed.
genrand () {
echo `od -An -N 4 -D < /dev/urandom` ' % ' $bytesize | bc
}
rm -f randlist.txt
i=1
while [ $i -le 1000 ]
do
# This probably works but is way too slow: sed -n `genrand`p $files
# Instead, use Alex Lines' dd seek method:
dd if=$files skip=`genrand` ibs=1 count=$linelenx10 2>/dev/null |awk 'NR==2 {print;exit}'>> randlist.txt
true $((i=i+1)) # Bourne shell equivalent of $i++ iteration
done
for file in `cat randlist.txt`
do
name=`basename $file`
size=`wc -c <"$file"`
echo -e "Location: $file \n\n Size: $size" > $name.dat
done
# vim: ft=sh
What I could accomplish was to take the names and paths of 1000 unique files with the commands "find" and "grep" and put them in a list
I'm going to assume that there is a file that holds on each line a full path to each file (FULL_PATH_TO_LIST_FILE). Considering there's not much statistics associated with this process, I omitted that. You can add your own however.
cd WHEREVER_YOU_WANT_TO_CREATE_NEW_FILES
for file_path in `cat FULL_PATH_TO_LIST_FILE`
do
## This extracts only the file name from the path
file_name=`basename $file_path`
## This grabs the files size in bytes
file_size=`wc -c < $file_path`
## Create the file and place info regarding original file within new file
echo -e "$file_name \nThis file is $file_size bytes "> $file_name
done

Shell one-liner to add a line to a sorted file

I want to add a line to a text file so that the result is sorted, where the text file was originally sorted. For example:
cp file tmp; echo "new line" >> tmp; sort tmp > file; rm -f tmp
I'd REALLY like to do it w/o the temp file and w/o the semicolons (using pipes instead?); using sed would be acceptable. Is this possible, and if so, how?
echo "New Line" | sort -o file - file
The -o file means write result to file (and it is explicitly safe to have any of the input files as the output file). The - on its own means 'read standard input' which contains the new line of information. The file at the end means 'also read file'. This would work with any Unix sort from (at least) 7th Edition UNIX™ circa 1978 onwards, and possibly even before that. There are no temporary files or dependencies on other utilities.
Given that a single line is 'sorted' and the file is also in sorted order, you can probably speed the process up by just merging the two sorted inputs:
echo "New Line" | sort -o file -m - file
That also would have worked with even really old sort commands.
This is the shortest one liner I can think of without any temporary files:
$ echo "something" >> file; sort file -o file
Yep, you'll either need to resort or comm them together (if they're already presorted) assuming they have no tabs, which will save you the sort (which can produce temp files and overhead depending on file size).
Alternative:
comm -3 file <(echo "new line") |tr -d '\t'
This might be the "shortest":
sort -m file <(echo "new line")
You can do this without any semicolons and without a temp file, but probably not without depending on some utilities that might not be everywhere (like awk with in-place file modification, or perl).
Why don't you want to use temp files or semicolons?
Edit: since semicolons are ok, how about:
val=$(cat file); { echo "$val"; echo "new line"; } | sort > file
Large files / performance:
Convert your file to an SQLite database with a single indexed column and query that.
Or re-implement a file-based B-tree or hash map yourself, which how SQLite implements indexes...
I think it is impossible to insert into sorted text files efficiently: even if you do a binary search, you still have to copy everything that comes after the insertion point, and that disk operation will be the bottleneck: https://unix.stackexchange.com/questions/87772/add-lines-to-the-beginning-and-end-of-the-huge-file
For search, sgrep might work: https://askubuntu.com/questions/423886/efficiently-search-sorted-file/701237#701237

Resources