Copying part of a large file using command line - shell

I've a text file with 2 million lines. Each line has some transaction information.
e.g.
23848923748, sample text, feild2 , 12/12/2008
etc
What I want to do is create a new file from a certain unique transaction number onwards. So I want to split the file at the line where this number exists.
How can I do this form the command line?
I can find the line by doing this:
cat myfile.txt | grep 23423423423

use sed like this
sed '/23423423423/,$!d' myfile.txt
Just confirm that the unique transaction number cannot appear as a pattern in some other part of the line (especially, before the correctly matching line) in your file.
There is already a 'perl' answer here, so, i'll give one more AWK way :-)
awk '{BEGIN{skip=1} /number/ {skip=0} // {if (skip!=1) print $0}' myfile.txt

On a random file in my tmp directory, this is how I output everything from the line matching popd onwards in a file named tmp.sh:
tail -n+`grep -n popd tmp.sh | cut -f 1 -d:` tmp.sh
tail -n+X matches from that line number onwards; grep -n outputs lineno:filename, and cut extracts just lineno from grep.
So for your case it would be:
tail -n+`grep -n 23423423423 myfile.txt | cut -f 1 -d:` myfile.txt
And it should indeed match from the first occurrence onwards.

It's not a pretty solution, but how about using -A parameter of grep?
Like this:
mc#zolty:/tmp$ cat a
1
2
3
4
5
6
7
mc#zolty:/tmp$ cat a | grep 3 -A1000000
3
4
5
6
7
The only problem I see in this solution is the 1000000 magic number. Probably someone will know the answer without using such a trick.

You can probably get the line number using Grep and then use Tail to print the file from that point into your output file.
Sorry I don't have actual code to show, but hopefully the idea is clear.

I would write a quick Perl script, frankly. It's invaluable for anything like this (relatively simple issues) and as soon as something more complex rears its head (as it will do!) then you'll need the extra power.
Something like:
#!/bin/perl
my $out = 0;
while (<STDIN>) {
if /23423423423/ then $out = 1;
print $_ if $out;
}
and run it using:
$ perl mysplit.pl < input > output
Not tested, I'm afraid.

Related

Sed through files without using for loop?

I have a small script which basically generates a menu of all the scripts in my ~/scripts folder and next to each of them displays a sentence describing it, that sentence being the third line within the script commented out. I then plan to pipe this into fzf or dmenu to select it and start editing it or whatever.
1 #!/bin/bash
2
3 # a script to do
So it would look something like this
foo.sh a script to do X
bar.sh a script to do Y
Currently I have it run a for loop over all the files in the scripts folder and then run sed -n 3p on all of them.
for i in $(ls -1 ~/scripts); do
echo -n "$i"
sed -n 3p "~/scripts/$i"
echo
done | column -t -s '#' | ...
I was wondering if there is a more efficient way of doing this that did not involve a for loop and only used sed. Any help will be appreciated. Thanks!
Instead of a loop that is parsing ls output + sed, you may try this awk command:
awk 'FNR == 3 {
f = FILENAME; sub(/^.*\//, "", f); print f, $0; nextfile
}' ~/scripts/* | column -t -s '#' | ...
Yes there is a more efficient way, but no, it doesn't only use sed. This is probably a silly optimization for your use case though, but it may be worthwhile nonetheless.
The inefficiency is that you're using ls to read the directory and then parse its output. For large directories, that causes lots of overhead for keeping that list in memory even though you only traverse it once. Also, it's not done correctly, consider filenames with special characters that the shell interprets.
The more efficient way is to use find in combination with its -exec option, which starts a second program with each found file in turn.
BTW: If you didn't rely on line numbers but maybe a tag to mark the description, you could also use grep -r, which avoids an additional process per file altogether.
This might work for you (GNU sed):
sed -sn '1h;3{H;g;s/\n/ /p}' ~/scripts/*
Use the -s option to reset the line number addresses for each file.
Copy line 1 to the hold space.
Append line 3 to the hold space.
Swap the hold space for the pattern space.
Replace the newline with a space and print the result.
All files in the directory ~/scripts will be processed.
N.B. You may wish to replace the space delimiter by a tab or pipe the results to the column command.

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

Delete everything after a certain line in bash

I was wondering if there was a way to delete everything after a certain line of a text file in bash. So say there's a text file with 10 lines, and I want to delete every line after line number 4, so only the first 4 lines remained, how would I go about doing that?
You can use GNU sed:
sed -i '5,$d' file.txt
That is, 5,$ means the range line 5 until the end, and d means to delete.
Only the first 4 lines will remain.
The -i flag tells sed to edit the file in-place.
If you have only BSD sed, then the -i flag requires a backup file suffix:
sed -i.bak '5,$d' file.txt
As #ephemient pointed out, while this solution is simple,
it's inefficient because sed will still read the input until the end of the file, which is unnecessary.
As #agc pointed out, the inverse logic of my first proposal might be actually more intuitive. That is, do not print by default (-n flag),
and explicitly print range 1,4:
sed -ni.bak 1,4p file.txt
Another simple alternative, assuming that the first 4 lines are not excessively long and so they easily fit in memory, and also assuming that the 4th line ends with a newline character,
you can read the first 4 lines into memory and then overwrite the file:
lines=$(head -n 4 file.txt)
echo "$lines" > file.txt
Minor refinements on Janos' answer, ephemient's answer, and cdark's comment:
Simpler (and faster) sed code:
sed -i 4q file
When a filter util can't directly edit a file, there's
sponge:
head -4 file | sponge file
Most efficient for Linux might be truncate -- coreutils sibling util to fallocate, which offers the same minimal I/O of ephemient's more portable, (but more complex), dd-based answer:
truncate -s `head -4 file | wc -c` file
The sed method that #janos is simple but inefficient. It will read every line from the original file, even ones it could ignore (although that can be fixed using 4q), and -i actually creates a new file (which it renames to replace the original file). And there's the annoying bit where you need to use sed -i '5,$d' file.txt with GNU sed but sed -i '' '5,$d' file.txt with BSD sed in order to remove the existing file instead of leaving a backup.
Another method that performs less I/O:
dd bs=1 count=0 if=/dev/null of=file.txt \
seek=$(grep -b ^ file.txt | tail -n+5 | head -n1 | cut -d: -f1)
grep -b ^ file.txt prints out byte offsets on each line, e.g.
$ yes | grep -b ^
0:y
2:y
4:y
...
tail -n+5 skips the first 4 lines, outputting the 5th and subsequent lines
head -n1 takes only the next line (e.g. only the 5th line)
After head reads the one line, it will exit. This causes tail to exit because it has nowhere to output to anymore. This causes grep to exit for the same reason. Thus, the rest of file.txt does not need to be examined.
cut -d: -f1 takes only the first part before the : (the byte offset)
dd bs=1 count=0 if=/dev/null of=file.txt seek=N
using a block size of 1 byte, seek to block N of file.txt
copy 0 blocks of size 1 byte from /dev/null to file.txt
truncate file.txt here (because conv=notrunc was not given)
In short, this removes all data on the 5th and subsequent lines from file.txt.
On Linux there is a command named fallocate which can similarly extend or truncate a file, but that's not portable.
UNIX filesystems support efficiently truncating files in-place, and these commands are portable. The downside is that it's more work to write out.
(Also, dd will print some unnecessary stats to stderr, and will exit with an error if the file has fewer than 5 lines, although in that case it will leave the existing file contents in place, so the behavior is still correct. Those can be addressed also, if needed.)
If I don't know the line number, merely the line content (I need to know that there is nothing below the line containing 'knowntext' that I want to preserve.), then I use.
sed -i '/knowntext/,$d' inputfilename
to directly alter the file, or to be cautious
sed '/knowntext/,$d' inputfilename > outputfilename
where inputfilename is unaltered, and outputfilename contains the truncated version of the input.
I am not competent to comment on the efficiency of this, but I know that files of 20kB or so are dealt with faster than I can blink.
Using GNU awk (v. 4.1.0+, see here). First we create a test file (NOTICE THE DISCLAIMER):
$ seq 1 10 > file # THIS WILL OVERWRITE FILE NAMED file WITH TEST DATA
Then the code and validation (WILL MODIFY THE ORIGINAL FILE NAMED file):
$ awk -i inplace 'NR<=4' file
$ cat file
1
2
3
4
Explained:
$ awk -i inplace ' # edit is targetted to the original file (try without -i ...)
NR<=4 # output first 4 records
' file # file
You could also exit on line NR==5 which would be quicker if you redirected the output of the program to a new file (remove # for action) which would be the same as head -4 file > new_file:
$ awk 'NR==5{exit}1' file # > new_file
When testing, don't forget the seq part first.

How to quickly check a .gz file without unzip? [duplicate]

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Resources