Find uniq items in column of gzip file - sorting

I want to return the number of unique items in one column of my gzip file.
To sort on a normal file I know you can use something like:
sort -u -t, -k1,1 filename | wc -l
but when I run this on a gzip file I get:
?BC??\ks?ʑ???
Is it possible to change this format to find the unique items in a column, given a gzip file?

Okay so I actually figured it out!
gzcat vcf_del.vcf.gz | cut -f 2 | sort | uniq | wc -l (and zcat also works)
Then if there are parts of the file you do not want (for instance in VCF files there are a series of lines with "#") you can simply remove them as such:
gzcat vcf_del.vcf.gz | awk '!/^#/{print $0}' | cut -f 2 | sort | uniq | wc -l

The gzip package comes with the zcat program which works just like cat but works on gz files.
zcat filename | sort -u -t -k1,1

you cannot run search and sort commands on compressed file either you have to extract compressed file and then run you commands on output of the gzip command.
you can try below command
gunzip -c filename | sort -u -t -k1,1

Related

sort and get unique files after removing extension of filename

I am trying to remove filename after the second underscore and get the unique files. I saw many answers and formed a script. This script is working fine till the cut command but it is not able to give the unique filenames. I have tried the following command but i am not getting desired output.
script used:
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" | sort | uniq)
echo "${fname}"
done
file example:
filename1_00_1.gz
filename1_00_2.gz
filename2_00_1.gz
filename2_00_2.gz
Required output:
filename1_00
filename2_00
So, with all of that said. how can I get a unique list of files in the required output format?
Thanks a lot in advance.
Apply uniq and sort are you are done printing the files (it's better to identify uniques first before sorting them):
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" );
echo "${fname}";
done | uniq | sort
Or just do
for filename in ${path/to/files}/*.gz; do echo ${filename%_*.gz}; done | uniq | sort
for f in *.gz; do echo ${f%_*.gz}; done | sort | uniq

sort -R is not an option in my OS

I have a couple OS that do not have sort -R to generate a random list from a txt file I have. For example, I am trying to use the following command:
sort -R file | head -20000 > newfile
I looked up the man pages in these OS and sure enough, the -R option is not listed.
What is an alternative that can generate a random list from a file and print to a new file?
CentOS 5
Try:
shuf file | head -n 20000 > newfile
or:
cat file | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);'
You can use the shuf command, if it is installed.
shuf can either take a file as its input
shuf file | head -n 20000 > newfile
or read from stdin
cat file | shuf | head -n 20000 > newfile
cat file | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2 | head -20000 > newfile
This is working out for me.
cat ALLEMAILS.txt | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2 | head -20000 | tee 20000random.txt
This for seeing progress.

Get most popular domains from log

I am trying to get most popular domain from a log file
The log format is like this
197.123.43.59, 27/May/2015:01:00:11 -0600, https://m.facebook.com/
I am interested only with the domain and i want an output as follows
XXXX facebook.com
where XXXX is the number of similar entries in logs
A one liner unix command anyone
Edit
I tried the following
grep -i * sites.log | sort | uniq -c | sort -nr | head -10 &> popular.log
but popular.log is empty , implying that command is wrong
perl -nle '$d{$1}++ if m!//([^/]+)!; END {foreach(sort {$d{$a} <= $d{$b}} keys(%d)) {print "$d{$_}\t$_"};}' your.log
if you don't mind perl
uniq -c counts unique occurences, but requires sorted input.
sort sorts a stream of data
grep has a flag -o which returns only the output that matched the regex
These three parts put together is what you need to perform map/reduce on this and get the data you want.
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c
The grep gets only the single topdomains and the domain name, all the entries are sored by the sort and the uniq counts the occurences.
Adding a sort -n at the end will give you a list where the top most entries are the highest
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c | sort -nr

How to find most frequent string in file

I have a question about bash script, lets say there is file witch contains lines, each line will have path to a file and a date, the problem is how to find most frequent path.
Thanks in advance.
Here's a suggestion
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
# \_____________________/ \__/ \_____/ \______/ \_______/
# select the file column sort print sort on print top
# files counts count result
Example use:
$ cat file.txt
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileB jan:17:13:46:27:2015
/home/admin/fileC jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
3 /home/admin/fileA
You can strip out 3 from the final result by another cut.
Reverse the lines, cut the begginning (the date), reverse them again, then sort and count unique lines:
cat file.txt | rev | cut -b 22- | rev | sort | uniq -c
If you're absolutely sure you won't have whitespace in your paths, you can avoid rev altogether:
cat file.txt | cut -d " " -f 1 | sort | uniq -c
If the output is too long to inspect visually, aioobe's suggestion of following this with sort -rn | head -n1 will serve you well
It's worth noticing, as aioobe mentioned, that many unix commands optionally take a file argument. By using it, you can avoid the extra cat command in the beginning, by supplying its argument to the next command:
cat file.txt | rev | ... vs rev file.txt | ...
While I personally find the first option both easier to remember and understand, the second is preferred by many (most?) people, as it saves up system resources (specifically, the memory and references used by an additional process) and can have better performance in some specific use cases. Wikipedia's cat article discusses this in detail.

Append xargs argument number as prefix

I want to analyze the most frequentry occuring entries in (column of) a logfile. To write the detail results, I am creating new directories from the output of something along the lines of
cat logs| cut -d',' -f 6 | sort | uniq -c | sort -rn | head -10 | \
awk '{print $2}' |xargs mkdir -p
Is there a way to create the directories with the sequence number of the argument as processed by xargs as a prefix? For e.g. For e.g. "oranges" is the most frequent entry (of the column) the directory created should be named "1.oranges" and so on.
A quick (and dirty?) solution could be to pipe your directory names through cat -n in their proper order and then remove the whitespace separating the line number from the directory name, before passing them to xargs.
A better solution would be to modify your awk command:
... | awk '{ print NR "." $2 }' | xargs mkdir -p
The NR variable contains the record (i.e. line) number.

Resources