How to quickly check a .gz file without unzip? [duplicate] - bash

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.

zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.

On some systems (e.g., Mac), you need to use gzcat.

On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head

If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.

If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16

This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

Related

Trying to sort a large csv file but the output is not being written to another file

I have a (very) large csv file almost around 70GB which I am trying to sort using the sort command. As much as I am trying, the output is not being written to file. Here is what I tried
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv > /data/data/raw/KKR_38.csv
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv -o /data/data/raw/KKR-38.csv
What happens is that the KKR_38.csv file is created and its size is the same as the KKR.csv file but there is nothing inside it. When I do
head -n 100 /data/data/raw/KKR_38.csv
It prints out 100 empty lines.
If you sort, it is quite normal the empty lines come first. Try this:
tail -100 /data/data/raw/KKR_38.csv
You can use the following commands if you want to not take into account the empty lines:
cat -s /data/data/raw/KKR_38.csv | less #to squeeze the successive empty lines to only one
or if you want to remove them:
sed '/^$/d' /data/data/raw/KKR_38.csv | less
You can redirect the output of those commands to create another file without the empty line (watch out for the space on your file system).

Delete everything after a certain line in bash

I was wondering if there was a way to delete everything after a certain line of a text file in bash. So say there's a text file with 10 lines, and I want to delete every line after line number 4, so only the first 4 lines remained, how would I go about doing that?
You can use GNU sed:
sed -i '5,$d' file.txt
That is, 5,$ means the range line 5 until the end, and d means to delete.
Only the first 4 lines will remain.
The -i flag tells sed to edit the file in-place.
If you have only BSD sed, then the -i flag requires a backup file suffix:
sed -i.bak '5,$d' file.txt
As #ephemient pointed out, while this solution is simple,
it's inefficient because sed will still read the input until the end of the file, which is unnecessary.
As #agc pointed out, the inverse logic of my first proposal might be actually more intuitive. That is, do not print by default (-n flag),
and explicitly print range 1,4:
sed -ni.bak 1,4p file.txt
Another simple alternative, assuming that the first 4 lines are not excessively long and so they easily fit in memory, and also assuming that the 4th line ends with a newline character,
you can read the first 4 lines into memory and then overwrite the file:
lines=$(head -n 4 file.txt)
echo "$lines" > file.txt
Minor refinements on Janos' answer, ephemient's answer, and cdark's comment:
Simpler (and faster) sed code:
sed -i 4q file
When a filter util can't directly edit a file, there's
sponge:
head -4 file | sponge file
Most efficient for Linux might be truncate -- coreutils sibling util to fallocate, which offers the same minimal I/O of ephemient's more portable, (but more complex), dd-based answer:
truncate -s `head -4 file | wc -c` file
The sed method that #janos is simple but inefficient. It will read every line from the original file, even ones it could ignore (although that can be fixed using 4q), and -i actually creates a new file (which it renames to replace the original file). And there's the annoying bit where you need to use sed -i '5,$d' file.txt with GNU sed but sed -i '' '5,$d' file.txt with BSD sed in order to remove the existing file instead of leaving a backup.
Another method that performs less I/O:
dd bs=1 count=0 if=/dev/null of=file.txt \
seek=$(grep -b ^ file.txt | tail -n+5 | head -n1 | cut -d: -f1)
grep -b ^ file.txt prints out byte offsets on each line, e.g.
$ yes | grep -b ^
0:y
2:y
4:y
...
tail -n+5 skips the first 4 lines, outputting the 5th and subsequent lines
head -n1 takes only the next line (e.g. only the 5th line)
After head reads the one line, it will exit. This causes tail to exit because it has nowhere to output to anymore. This causes grep to exit for the same reason. Thus, the rest of file.txt does not need to be examined.
cut -d: -f1 takes only the first part before the : (the byte offset)
dd bs=1 count=0 if=/dev/null of=file.txt seek=N
using a block size of 1 byte, seek to block N of file.txt
copy 0 blocks of size 1 byte from /dev/null to file.txt
truncate file.txt here (because conv=notrunc was not given)
In short, this removes all data on the 5th and subsequent lines from file.txt.
On Linux there is a command named fallocate which can similarly extend or truncate a file, but that's not portable.
UNIX filesystems support efficiently truncating files in-place, and these commands are portable. The downside is that it's more work to write out.
(Also, dd will print some unnecessary stats to stderr, and will exit with an error if the file has fewer than 5 lines, although in that case it will leave the existing file contents in place, so the behavior is still correct. Those can be addressed also, if needed.)
If I don't know the line number, merely the line content (I need to know that there is nothing below the line containing 'knowntext' that I want to preserve.), then I use.
sed -i '/knowntext/,$d' inputfilename
to directly alter the file, or to be cautious
sed '/knowntext/,$d' inputfilename > outputfilename
where inputfilename is unaltered, and outputfilename contains the truncated version of the input.
I am not competent to comment on the efficiency of this, but I know that files of 20kB or so are dealt with faster than I can blink.
Using GNU awk (v. 4.1.0+, see here). First we create a test file (NOTICE THE DISCLAIMER):
$ seq 1 10 > file # THIS WILL OVERWRITE FILE NAMED file WITH TEST DATA
Then the code and validation (WILL MODIFY THE ORIGINAL FILE NAMED file):
$ awk -i inplace 'NR<=4' file
$ cat file
1
2
3
4
Explained:
$ awk -i inplace ' # edit is targetted to the original file (try without -i ...)
NR<=4 # output first 4 records
' file # file
You could also exit on line NR==5 which would be quicker if you redirected the output of the program to a new file (remove # for action) which would be the same as head -4 file > new_file:
$ awk 'NR==5{exit}1' file # > new_file
When testing, don't forget the seq part first.

Iterating grep over and over. How can I make my script faster?

I have to find a string of numbers from one file in another file.
My code is this:
#!/bin/sh
IFS="F"
while read f1 f2
do
LC_ALL=C fgrep -m 1 "$f1" BC_Tel.inp
done < telephonelist.txt
The string of numbers are located in telephonelist.txt. The format of this text file is as follows:
8901040000001304669F 370040000130466
8901040000001317380F 370040000131738
8901040000001330045F 370040000133004
8901040000001330052F 370040000133005
8901040000001330060F 370040000133006
I'm looking for the lines with the above numbers delimited by 'F' in BC_Tel.inp, which has the following format:
981040000030289765F1 655F370D1E86260ED550A2D6F80EFF96 01000045384136453332440303FFFFFFFFFFFFFFFF0000 01000037333643383234380303FFFFFFFFFFFFFFFF0000 083907400030289765 00000031323334FFFFFFFF030334303733323638310AFF 01000034383532FFFFFFFF030334333738333137320AFF 0020 01007F107FD2266C31249530FC531B474F6D44482C007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F97AB34277D5378AEC893716281F99ABC007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F6608B51E4378BE23072E843D6741A184007F007F007F007F007F007F007F007F007F007F007F007F007F007F 636C8D46973FAE4C1BD181BB4E0D4DA2A5E0455E86406CCF40F309F63470CE07 000003817826FF0187494010083A65626501586519104106 083907400030289765636C8D46973FAE4C1BD181BB4E0D4DA2 080900000000101003636C8D46973FAE4C1BD181BB4E0D4DA2 8901040000038279561 40732681
telephonelist.txt and BC_Tel.inp are huge files with over a million lines. The script works fine but I want to make it faster. I'm basically running over the txt file once, but I'm greping over and over the .inp file. How do I go about making this process faster?
tl;dr
I want to optimize my code so it runs faster.
A single grep will do it:
cut -d"F" -f1 telephonelist.txt | grep -F -m1 -f- BC_Tel.inp
the -f option to grep provides a filename containing the patterns. Here, we're using the filename - to indicate "stdin".

unix delete rows from multiple files using input from another file

I have multiple (1086) files (.dat) and in each file I have 5 columns and 6384 lines.
I have a single file named "info.txt" which contains 2 columns and 6883 lines. First column gives the line numbers (to delete in .dat files) and 2nd column gives a number.
1 600
2 100
3 210
4 1200
etc...
I need to read in info.txt, find every-line number corresponding to values less than 300 in 2nd column (so it is 2 and 3 in above example). Then I need to read these values into sed-awk or grep and delete these #lines from each .dat file. (So I will delete every 2nd and 3rd row of dat files in the above example).
More general form of the question would be (I suppose):
How to read numbers as input from file, than assign them to the rows to be deleted from multiple files.
I am using bash but ksh help is also fine.
sed -i "$(awk '$2 < 300 { print $1 "d" }' info.txt)" *.dat
The Awk script creates a simple sed script to delete the selected lines; the script it run on all the *.dat files.
(If your sed lacks the -i option, you will need to write to a temporary file in a loop. On OSX and some *BSD you need -i "" with an empty argument.)
This might work for you (GNU sed):
sed -rn 's/^(\S+)\s*([1-9]|[1-9][0-9]|[12][0-9][0-9])$/\1d/p' info.txt |
sed -i -f - *.dat
This builds a script of the lines to delete from the info.txt file and then applies it to the .dat files.
N.B. the regexp is for numbers ranging from 1 to 299 as per OP request.
# create action list
cat info.txt | while read LineRef Index
do
if [ ${Index} -lt 300 ]
then
ActionReq="${ActionReq};${Index} b
"
fi
done
# apply action on files
for EachFile in ( YourListSelectionOf.dat )
do
sed -i -n -e "${ActionReq}
p" ${EachFile}
done
(not tested, no linux here). Limitation with sed about your request about line having the seconf value bigger than 300. A awk is more efficient in this operation.
I use sed in second loop to avoid reading/writing each file for every line to delete. I think that the second loop could be avoided with a list of file directly given to sed in place of file by file
This should create a new dat files with oldname_new.dat but I havent tested:
awk 'FNR==NR{if($2<300)a[$1]=$1;next}
!(FNR in a)
{print >FILENAME"_new.dat"}' info.txt *.dat

Copying part of a large file using command line

I've a text file with 2 million lines. Each line has some transaction information.
e.g.
23848923748, sample text, feild2 , 12/12/2008
etc
What I want to do is create a new file from a certain unique transaction number onwards. So I want to split the file at the line where this number exists.
How can I do this form the command line?
I can find the line by doing this:
cat myfile.txt | grep 23423423423
use sed like this
sed '/23423423423/,$!d' myfile.txt
Just confirm that the unique transaction number cannot appear as a pattern in some other part of the line (especially, before the correctly matching line) in your file.
There is already a 'perl' answer here, so, i'll give one more AWK way :-)
awk '{BEGIN{skip=1} /number/ {skip=0} // {if (skip!=1) print $0}' myfile.txt
On a random file in my tmp directory, this is how I output everything from the line matching popd onwards in a file named tmp.sh:
tail -n+`grep -n popd tmp.sh | cut -f 1 -d:` tmp.sh
tail -n+X matches from that line number onwards; grep -n outputs lineno:filename, and cut extracts just lineno from grep.
So for your case it would be:
tail -n+`grep -n 23423423423 myfile.txt | cut -f 1 -d:` myfile.txt
And it should indeed match from the first occurrence onwards.
It's not a pretty solution, but how about using -A parameter of grep?
Like this:
mc#zolty:/tmp$ cat a
1
2
3
4
5
6
7
mc#zolty:/tmp$ cat a | grep 3 -A1000000
3
4
5
6
7
The only problem I see in this solution is the 1000000 magic number. Probably someone will know the answer without using such a trick.
You can probably get the line number using Grep and then use Tail to print the file from that point into your output file.
Sorry I don't have actual code to show, but hopefully the idea is clear.
I would write a quick Perl script, frankly. It's invaluable for anything like this (relatively simple issues) and as soon as something more complex rears its head (as it will do!) then you'll need the extra power.
Something like:
#!/bin/perl
my $out = 0;
while (<STDIN>) {
if /23423423423/ then $out = 1;
print $_ if $out;
}
and run it using:
$ perl mysplit.pl < input > output
Not tested, I'm afraid.

Resources