sorting with terminal after grepping - sorting

I was hoping someone might be able to shed some light on how I could sort a set of grepped values in unix.
for example if I have a list such as;
qp_1_v2
qp_50_v1
qp_51_v4
qp_52_v1
qp_53_v1
qp_54_v2
qp_2_v1,
is there a way to sort numerically using the wildcard i.e sort qp_*_v1; where * would be read as a number and then sorted according to this (ignoring anything that came before and after the ). The problem I'm finding currently is that gp_52_v2 is always read as a string so I have to cut gp_ and _v to leave only the number and then sort.
I hope this makes sense...
Thanks in advance.
edit: A little addition that would be nice if anyone knows how to do it.. would be to grep and list values with the highest version i.e if gp_50 exists 3 times with the following suffixs _v1, _v2, _v3 it only lists gp_50_v3. As such this list will still consist of files with various versions but only the highest version of each file will be outputted to terminal.

ls | cut -d '_' -f 2 | sort
in your case substitute ls for your grep command
Edit: In the example I put before the output is cut, if you want the original name of the file use this:
ls | sort -k2,2g -t '_'
k is the number of the field to compare
g is the max number of characters to compare
t is the delimiter

Related

Bash function to "use" the most recent "dated" file in a dir

I have a dir with a crap load (hundreds) of log files from over time. In certain cases I want to make a note regarding the most recent (by date in filename, not by creation time) log or I just need some piece of info from it and i want to view it quickly and I just know it was (usually) the last one created (but always) with the newest date. So I wanted to make a "simple" function in my bashrc to overcome this problem, basically what I want is a function that goes to a specific dir and finds the latest log by date (always in the same format) and open it with less or whatever pager I want.
The logs are formatted like this:
typeoflog-short-description-$(date "+%-m-%-d-%y")
basically the digits in between the last 3 dashes are what I'm interested in, for example(s):
update-log-2-24-18
removed-cuda-opencl-nvidia-12-2-19
whatever-changes-1-18-19
Now if it was January, 20 2019 and this was the last log added to the dir I need a way to see what the highest number is in the last 2 digits of the filename (that i don't really have a problem with), then check for the highest month that would be 2 "dashes" from the last set of digits whether it be 2 digits or 1 for the month, and then do the same thing for the day of the month and set that as a local variable and use it like the following example.
Something like this:
viewlatestlog(){
local loc="~/.logdir"
local name=$(echo $loc/*-19 | #awk or cut or sort or i could even loop it from 1-31 and 1-12 for the days and months.)
#I have ideas, but i know there has to be a better way to do this and it's not coming to me, maybe with expr or a couple of sort commands; i'm not sure, it would have been easier if i had made is so that each date number had 2 digits always... But I didn't
## But the ultimate goal is that i can run something like this command at the end
less $loc/$name
{
PS. For bonus points you could also tell me if there is a way to automatically copy the filename (with the location and all or without, I don't really care) to my linux clipboard, so when I'm making my note I can "link" to the log file if I ever need to go back to it...
Edit: Cleaned up post a little bit, I tend to my questions way too wordy, I apologize.
GNU sort can sort by fields:
$ find . -name whatever-changes-\* | sort -n -t- -k5 -k3 -k4
./whatever-changes-3-01-18
./whatever-changes-1-18-19
./whatever-changes-2-12-19
./whatever-changes-11-01-19
The option -t specifies the field delimiter and the option -k selects the fields starting with 1. The option -n specifies numeric sort.
Assuming your filenames do not contain tabs or newlines, how about:
loc="~/.logdir"
for f in "$loc"/* ; do
if [[ $f =~ -([0-9]{1,2})-([0-9]{1,2})-([0-9]{2})$ ]]; then
mm=${BASH_REMATCH[1]}
dd=${BASH_REMATCH[2]}
yy=${BASH_REMATCH[3]}
printf "%02d%02d%02d\t%s\n" "$yy" "$mm" "$dd" "$f"
fi
done | sort -r | head -n 1 | cut -f 2
First extract the month, date, and year from the filename.
Then create a date string formatted as "YYMMDD" and prepend to the
filename delimited by a tab character.
Then you can perform the sort command on the list.
Finally you can obtain the desired (latest) filename by extracting with top and cut.
Hope this helps.

Sort a list of strings in Bash numerically according to specific substring

I have a list of strings which unfortunately doesn't seem to lend itself to be sorted with sort --key=???.
This is the string:
Original 40.101 s 40.556 s
User XYZ 3.389 s 3.261 s
User XYZ/User ABC 5.342 s 5.300 s
Somebody else 32.531 s 32.154 s
My friend Tony the Pony 5.905 s 5.639 s
L33t 27.007 s 26.893 s
Serial port 7.871 s 7.738 s
Unknown user 2.815 s 2.700 s
I'd like it to be sorted according to the first number, ascending or descending doesn't really matter although it would be great to know a solution that can in principle do both.
I tried sort --key=2 <<HERE ... HERE but unsurprisingly this just leads to a random order.
Assuming your input file is correctly indented with space, use -k option of sort command:
sort -n -k1.30 file
or the reverse way:
sort -nr -k1.30 file
1.30 means skip the 30 first characters of field number one.
The -n switch sorts numerically instead of lexicographically.

How to get frequency counts of unique values in a list using UNIX?

I have a file that has a couple thousand domain names in a list. I easily generated a list of just the unique names using the uniq command. Now, I want to go through and find how many times each of the items in the uniques list appears in the original, non-unique list. I thought this should be pretty easy to do with this loop, but I'm running into trouble:
for name in 'cat uniques.list'; do grep -c $name original.list; done > output.file
For some reason, it's spitting out a result that shows some count of something (honestly not sure what) for the uniques file and the original file.
I feel like I'm overlooking something really simple here. Any help is appreciated.
Thanks!
Simply use uniq -c on your file :
-c, --count
prefix lines by the number of occurrences
The command to get the final output :
sort original.list | uniq -c

Find same words in two text files

I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.

awk: how to remove duplicated lines in a file and output them in another file at the same time?

I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates.
Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time.
If anyone could offer advice or help for this I would really appreciate it though, thanks!
Here's something to get you started:
awk 'seen[$0]++{print|"cat>&2";next}1' file > tmp && mv tmp file
The above will print any duplicated lines to stderr at the same time as removing them from your input file. If you need more, tell us more....
In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.
That said, here's the AWK way:
If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line i to line i-1 with O(1) state: if i == i-1 you have a duplicate.
If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.
This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.
cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt
tee sends the output of the cat command to uniq -d.

Resources