How can I remove the first line of a text file using bash/sed script? - bash

I need to repeatedly remove the first line from a huge text file using a bash script.
Right now I am using sed -i -e "1d" $FILE - but it takes around a minute to do the deletion.
Is there a more efficient way to accomplish this?

Try tail:
tail -n +2 "$FILE"
-n x: Just print the last x lines. tail -n 5 would give you the last 5 lines of the input. The + sign kind of inverts the argument and make tail print anything but the first x-1 lines. tail -n +1 would print the whole file, tail -n +2 everything but the first line, etc.
GNU tail is much faster than sed. tail is also available on BSD and the -n +2 flag is consistent across both tools. Check the FreeBSD or OS X man pages for more.
The BSD version can be much slower than sed, though. I wonder how they managed that; tail should just read a file line by line while sed does pretty complex operations involving interpreting a script, applying regular expressions and the like.
Note: You may be tempted to use
# THIS WILL GIVE YOU AN EMPTY FILE!
tail -n +2 "$FILE" > "$FILE"
but this will give you an empty file. The reason is that the redirection (>) happens before tail is invoked by the shell:
Shell truncates file $FILE
Shell creates a new process for tail
Shell redirects stdout of the tail process to $FILE
tail reads from the now empty $FILE
If you want to remove the first line inside the file, you should use:
tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"
The && will make sure that the file doesn't get overwritten when there is a problem.

You can use -i to update the file without using '>' operator. The following command will delete the first line from the file and save it to the file (uses a temp file behind the scenes).
sed -i '1d' filename

For those who are on SunOS which is non-GNU, the following code will help:
sed '1d' test.dat > tmp.dat

You can easily do this with:
cat filename | sed 1d > filename_without_first_line
on the command line; or to remove the first line of a file permanently, use the in-place mode of sed with the -i flag:
sed -i 1d <filename>

No, that's about as efficient as you're going to get. You could write a C program which could do the job a little faster (less startup time and processing arguments) but it will probably tend towards the same speed as sed as files get large (and I assume they're large if it's taking a minute).
But your question suffers from the same problem as so many others in that it pre-supposes the solution. If you were to tell us in detail what you're trying to do rather then how, we may be able to suggest a better option.
For example, if this is a file A that some other program B processes, one solution would be to not strip off the first line, but modify program B to process it differently.
Let's say all your programs append to this file A and program B currently reads and processes the first line before deleting it.
You could re-engineer program B so that it didn't try to delete the first line but maintains a persistent (probably file-based) offset into the file A so that, next time it runs, it could seek to that offset, process the line there, and update the offset.
Then, at a quiet time (midnight?), it could do special processing of file A to delete all lines currently processed and set the offset back to 0.
It will certainly be faster for a program to open and seek a file rather than open and rewrite. This discussion assumes you have control over program B, of course. I don't know if that's the case but there may be other possible solutions if you provide further information.

The sponge util avoids the need for juggling a temp file:
tail -n +2 "$FILE" | sponge "$FILE"

If you want to modify the file in place, you could always use the original ed instead of its streaming successor sed:
ed "$FILE" <<<$'1d\nwq\n'
The ed command was the original UNIX text editor, before there were even full-screen terminals, much less graphical workstations. The ex editor, best known as what you're using when typing at the colon prompt in vi, is an extended version of ed, so many of the same commands work. While ed is meant to be used interactively, it can also be used in batch mode by sending a string of commands to it, which is what this solution does.
The sequence <<<$'1d\nwq\n' takes advantage of modern shells' support for here-strings (<<<) and ANSI quotes ($'...') to feed input to the ed command consisting of two lines: 1d, which deletes line 1, and then wq, which writes the file back out to disk and then quits the editing session.

As Pax said, you probably aren't going to get any faster than this. The reason is that there are almost no filesystems that support truncating from the beginning of the file so this is going to be an O(n) operation where n is the size of the file. What you can do much faster though is overwrite the first line with the same number of bytes (maybe with spaces or a comment) which might work for you depending on exactly what you are trying to do (what is that by the way?).

You can edit the files in place: Just use perl's -i flag, like this:
perl -ni -e 'print unless $. == 1' filename.txt
This makes the first line disappear, as you ask. Perl will need to read and copy the entire file, but it arranges for the output to be saved under the name of the original file.

should show the lines except the first line :
cat textfile.txt | tail -n +2

Could use vim to do this:
vim -u NONE +'1d' +'wq!' /tmp/test.txt
This should be faster, since vim won't read whole file when process.

How about using csplit?
man csplit
csplit -k file 1 '{1}'

This one liner will do:
echo "$(tail -n +2 "$FILE")" > "$FILE"
It works, since tail is executed prior to echo and then the file is unlocked, hence no need for a temp file.

Since it sounds like I can't speed up the deletion, I think a good approach might be to process the file in batches like this:
While file1 not empty
file2 = head -n1000 file1
process file2
sed -i -e "1000d" file1
end
The drawback of this is that if the program gets killed in the middle (or if there's some bad sql in there - causing the "process" part to die or lock-up), there will be lines that are either skipped, or processed twice.
(file1 contains lines of sql code)

tail +2 path/to/your/file
works for me, no need to specify the -n flag. For reasons, see Aaron's answer.

You can use the sed command to delete arbitrary lines by line number
# create multi line txt file
echo """1. first
2. second
3. third""" > file.txt
deleting lines and printing to stdout
$ sed '1d' file.txt
2. second
3. third
$ sed '2d' file.txt
1. first
3. third
$ sed '3d' file.txt
1. first
2. second
# delete multi lines
$ sed '1,2d' file.txt
3. third
# delete the last line
sed '$d' file.txt
1. first
2. second
use the -i option to edit the file in-place
$ cat file.txt
1. first
2. second
3. third
$ sed -i '1d' file.txt
$cat file.txt
2. second
3. third

If what you are looking to do is recover after failure, you could just build up a file that has what you've done so far.
if [[ -f $tmpf ]] ; then
rm -f $tmpf
fi
cat $srcf |
while read line ; do
# process line
echo "$line" >> $tmpf
done

Based on 3 other answers, I came up with this syntax that works perfectly in my Mac OSx bash shell:
line=$(head -n1 list.txt && echo "$(tail -n +2 list.txt)" > list.txt)
Test case:
~> printf "Line #%2d\n" {1..3} > list.txt
~> cat list.txt
Line # 1
Line # 2
Line # 3
~> line=$(head -n1 list.txt && echo "$(tail -n +2 list.txt)" > list.txt)
~> echo $line
Line # 1
~> cat list.txt
Line # 2
Line # 3

Would using tail on N-1 lines and directing that into a file, followed by removing the old file, and renaming the new file to the old name do the job?
If i were doing this programatically, i would read through the file, and remember the file offset, after reading each line, so i could seek back to that position to read the file with one less line in it.

Related

Crop Lines from multiple CSV files using bash

I have a directory of 40 or so csv's. Each csv file has an extra 10 lines at the top that I don't need. I'm new to bash commands, but I have found that I can use
tail -n +10 oldfile.csv > newfile.csv
to cut 10 lines from a file one at a time. How can I do this across all csv's in the directory? I have tried doing this:
for filename in *foo*; do echo tail -n +10 \"$filename\" > \"${filename}\"; done
From what I've read, I thought this would pass in every csv containing foo in its name, run the formula, and leave the filename alone. Where am I going wrong?
You cannot use the same file as input and ouput.
With sed, your can edit the file in place with the -i flag:
for f in *.csv; do
sed -i '1,10d' "$f"
done
or as one-liner for the command line:
for f in *.csv; do sed -i '1,10d' "$f"; done
As a side note, your tail should be tail -n +11 to output 11th line to end of file.
Use a proper loop as below. Am using the native ex editor which Vim uses internally to in-place replacement, so you don't have to move the files back again using mv or any other command.
for file in *.csv
do
ex -sc '1d10|x' "$file"
done
The command moves to first line, selects 10 lines from first, deletes it and saves & closes the file.
Use a command-line friendly version in a single line as
for file in *.csv; do ex -sc '1d10|x' "$file"; done
The ex command is POSIX compatible and can work on all major platforms and distros.
In awk:
$ awk 'FNR>10{ print > "new-" FILENAME }' files*
Explained:
FNR>10 if current record number in current file is greater than 10, condition is true.
print well, output
> "new-" FILENAME redirect output to a new file named new-file, for example.
Edited to writing output to multiple files. Original which just outputed to screen was awk 'FNR>10' files*

Optimize sed for multiple replacements

I have a file, users.txt, with words like,
user1
user2
user3
I want to find these words in another file, data.txt and add a prefix to it. data.txt has nearly 500K lines. For example, user1 should be replaced with New_user1 and so on. I have written simple shell script like
for user in `cat users.txt`
do
sed -i 's/'${user}'/New_&/' data.txt
done
For ~1000 words, this program is taking minutes to process, which surprised me because sed is very fast when to comes to find and replace. I tried to refer to Optimize shell script for multiple sed replacements, but still not much improvement was observed.
Is there any other way to make this process faster?
Sed is known to be very fast (probably only worse than C).
Instead of sed 's/X/Y/g' input.txt, try sed '/X/ s/X/Y/g' input.txt. The latter is known to be faster.
Since you only have a "one line at a time semantics", you could run it with parallel (on multi-core cpu-s) like this:
cat huge-file.txt | parallel --pipe sed -e '/xxx/ s/xxx/yyy/g'
If you are working with plain ascii files, you could speed it up by using "C" locale:
LC_ALL=C sed -i -e '/xxx/ s/xxx/yyy/g' huge-file.txt
You can turn your users.txt into sed commands like this:
$ sed 's|.*|s/&/New_&/|' users.txt
s/user1/New_user1/
s/user2/New_user2/
s/user3/New_user3/
And then use this to process data.txt, either by writing the output of the previous command to an intermediate file, or with process substitution:
sed -f <(sed 's|.*|s/&/New_&/|' users.txt) data.txt
Your approach goes through all of data.txt for every single line in users.txt, which makes it slow.
If you can't use process substitution, you can use
sed 's|.*|s/&/New_&/|' users.txt | sed -f - data.txt
instead.
Or.. in one go, we can do something like this. Let us say, we have a data file with 500k lines.
$>
wc -l data.txt
500001 data.txt
$>
ls -lrtha data.txt
-rw-rw-r--. 1 gaurav gaurav 16M Oct 5 00:25 data.txt
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a test file maybe
1|This is a test file maybe
499999|This is a test file maybe
500000|This is a test file maybe
Let us say that our users.txt has 3-4 keywords, which are to be prefixed with "ab_", in the file "data.txt"
$>
cat users.txt
file
maybe
test
So we want to read users.txt and for every word, we want to change that word to a new word. For ex., "file" to "ab_file", "maybe" to "ab_maybe"..
We can run a while loop, read the input words to be prefixed one by one, and then we run a perl command over the file with the input word stored in a variable. In below example, read word is passed to perl command as $word.
I timed this task and this happens fairly quickly. Did it on my VM hosted on my windows 10 (using Centos7).
time cat users.txt |while read word; do perl -pi -e "s/${word}/ab_${word}/g" data.txt; done
real 0m1.973s
user 0m1.846s
sys 0m0.127s
$>
head -2 data.txt ; echo ; tail -2 data.txt
0|This is a ab_test ab_file ab_maybe
1|This is a ab_test ab_file ab_maybe
499999|This is a ab_test ab_file ab_maybe
500000|This is a ab_test ab_file ab_maybe
In above code, we read the words: test, file, maybe and changed it to ab_test, ab_file, ab_maybe in the data.txt file. head and tail count confirms our operation.
cheers,
Gaurav

Remove commas post second occurrence of comma only in last line and check for flag

I have a bunch of files in a specified path, in which I want to remove all the , post second occurrence of , in the last line only, in an efficient way.
I don't want process to read each line, instead just go directly to the last line and remove all , post second occurrence of ,.
Also, I want a check to be made if last line has EOF in it or not; if it is not available, no changes are to be applied, move to next file.
Sample file:
A,111,aaa,A
B,222,bbb,B
X,EOF,,,,x,X
Output:
A,111,aaa,A
B,222,bbb,B
X,EOF,xX
Example:
for i in $(ls /mypath/*.csv); do
sed '$s/,$//' < $i
done
This should do what you are looking for.
Note: apparently sed is not providing the "-i" option for all
platforms. If this is the case for your platform you have to use a
temporary file
Note also (thanks for glenn jackman's comment on this): This might
only work for the GNU sed implementation. You might need to adapt the
solution for other implementations
for i in $(ls /mypath/*.csv); do
if [[ `tail -n 1 $i | sed -n /EOF/p` != '' ]]; then
sed -i '$s/\([,]\)//3g' $i
fi
done
Use head to copy everything except the last line to a temporary file. Get the last line with tail, process it with sed and append it to the temporary file. Last but not least, replace the original file with the processed one.
for FILE in /mypath/*.csv ;
do
TMP_FILE="${FILE}.processed"
head -n "-1" "$FILE" > "$TMP_FILE"
tail -n "1" "$FILE" | sed 's/,\+/,/g' >> "$TMP_FILE"
mv -f "$TMP_FILE" "$FILE"
done
There is probably a more efficient inplace solution, but it does the job.

Passing input to sed, and sed info to a string

I have a list of files (~1000) and there is 1 file per line in my text file named: 'files.txt'
I have a macro that looks something like the following:
#!/bin/sh
b=$(sed '${1}q;d' files.txt)
cat > MyMacro_${1}.C << +EOF
myFile = new TFile("/MYPATHNAME/$b");
+EOF
and I use this input script by doing
./MakeMacro.sh 1
and later I want to do
./MakeMacro.sh 2
./MakeMacro.sh 3
...etc
So that it reads the n'th line of my files.txt and feeds that string to my created .C macro.
So that it reads the n'th line of my files.txt and feeds that string to my created .C macro.
Given this statement and your tags, I'm going to answer using shell tools and not really address the issue of the .c macro.
The first line of your script contains a sed script. There are numerous ways to get the Nth line from a text file. The simplest might be to use head and tail.
$ head -n "${i}" files.txt | tail -n 1
This takes the first $i lines of files.txt, and shows you the last 1 lines of that set.
$ sed -ne "${i}p" files.txt
This use of sed uses -n to avoid printing by default, then prints the $ith line. For better performance, try:
$ sed -ne "${i}{p;q;}" files.txt
This does the same, but quits after printing the line, so that sed doesn't bother traversing the rest of the file.
$ awk -v i="$i" 'NR==i' files.txt
This passes the shell variable $i into awk, then evaluates an expression that tests whether the number of records processed is the same as that variable. If the expression evaluates true, awk prints the line. For better performance, try:
$ awk -v i="$i" 'NR==i{print;exit}' files.txt
Like the second sed script above, this will quit after printing the line, so as to avoid traversing the rest of the file.
Plenty of ways you could do this by loading the file into an array as well, but those ways would take more memory and perform less well. I'd use one-liners if you can. :)
To take any of these one-liners and put it into your script, you already have the notation:
if expr "$i" : '[0-9][0-9]*$' >/dev/null; then
b=$(sed -ne "${i}{p;q;}" files.txt)
else
echo "ERROR: invalid line number" >&2; exit 1
fi
If I am understanding you correctly, you can do a for loop in bash to call the script multiple times with different arguments.
for i in `seq 1 n`; do ./MakeMacro.sh $i; done
Based on the OP's comment, it seems that he wants to submit the generated files to Condor. You can modify the loop above to include the condor submission.
for i in `seq 1 n`; do ./MakeMacro.sh $i; condor_submit <OutputFile> ; done
i=0
while read file
do
((i++))
cat > MyMacro_${i}.C <<-'EOF'
myFile = new TFile("$file");
EOF
done < files.txt
Beware: you need tab indents on the EOF line.
I'm puzzled about why this is the way you want to do the job. You could have your C++ code read files.txt at runtime and it would likely be more efficient in most ways.
If you want to get the Nth line of files.txt into MyMacro_N.C, then:
{
echo
sed -n -e "${1}{s/.*/myFile = new TFILE(\"&\");/p;q;}" files.txt
echo
} > MyMacro_${1}.C
Good grief. The entire script should just be (untested):
awk -v nr="$1" 'NR==nr{printf "\nmyFile = new TFile(\"/MYPATHNAME/%s\");\n\n",$0 > ("MyMacro_"nr".C")}' files.txt
You can throw in a ;exit before the } if performance is an issue but I doubt if it will be.

how to proceed once a file containing something in shell

I am writing some BASH shell script that will continuously check a file to see if the file already contains "Completed!" before proceeding. (Of course, assume the file is being updated and will eventually contain the phrase "Completed!")
I am not sure how to do this. Thank you for your help.
You can do something like:
while ! grep -q -e 'Completed!' file ; do
sleep 1 # Or some other number of seconds
done
# Here the file contains completed
Amongst the standard utilities, tail has an option to keep reading from a file: tail -f. So filter the output of tail -f.
<some_file tail -f -n +1 | grep 'Completed!' | head -n 1 >/dev/null
There may be a delay due to buffering. You can at least reduce the delay by using fewer tools in the pipeline. In fact, some implementations of tail never buffer when you do tail -f, so the following snippet will return as soon as Completed! is written to the file.
<some_file tail -f -n +1 | sed -e '/Completed!/ q'
This assumes that the file is being appended to by some other tool. If the file is overwritten by the data-producing program after you start tail, this solution won't work. You can search the file periodically. On some systems you can call a notification mechanism to know whenever the file changes, e.g. with inotifywait under Linux.
I've done this in Kornshell:
tail -f somefile | while read line
do
echo $line
[[ $line == *Completed!* ]] && break
done
Note no quotes around the *Completed!* string. This allows the double square brackets to do glob pattern matching instead of string matching.
This seems to work in BASH too. However, the line with the Completed must end in a NL. Otherwise, it'll take an extra line before it breaks the loop.
You can use grep too:
tail -f somefile | while read line
do
echo $line
grep -iq "Completed!" && break
done
The -q parameter means quiet. If your grep doesn't take the -q parameter, you might have to pipe it to /dev/null. The -i is ignore case. Whether you want to do that is up to you.
The advantage is that you aren't doing any processing unless there's a line to read. Using sleep may mean you miss the line, or that you're processing when no line has been added to the file.
Using grep in a pipe you may turn on line buffering mode by adding the --line-buffered option!

Resources