Pull random unique samples within sorted categories in bash - bash

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.
A, 1, c, address1 # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3 # the category for this record is C3e
B, 2, a, address4
I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:
sort -u -t, -k1,3
Is there a way to pull several random sample records within each category?
I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.

If i understand right - simple, not very effective bash solution
csvfile="./ca.txt"
while read -r cat
do
grep "^$cat," "$csvfile" | sort -uR | head -5
done < <(cut -d, -f1-3 < "$csvfile" |sort -u)
decomposition
cut -d, -f1-3 < "$csvfile" - filter out all "categories" (first 3 fields)
sort -u - get sorted unique categories
for each unique category (while read...)
grep "^$cat" "$csvfile" find all lines from this category
sort -uR - sort them randomly by hash (note, the duplicates has the same hash, take unique)
head -5 print the first 5 records (from the randomly sorted list)

Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.
Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.
From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.
sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'
If your sort doesn't randomise, then the random sample can be extracted with awk:
# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
function sample(){
for(;n>5;--n)v[int(n*rand())+1]=v[n];
for(;n;--n)print v[n]
}
a!=$1$2$3{a=$1$2$3;sample()}
{v[++n]=$0}
END {sample()}'
It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

Related

Bash to count columns and registers

I am trying to get the number of columns in bash of a dataset (taking into account that it has a header) using the following code but I am not convinced. Do you have any other idea?
On the other hand, Could you help me to determine how can I get the type of data that each attribute is (integer, date, string...) and count how many registers are in the dataset taking into account that it has a header?
The header of my archive is: "A","B","C","D","F","G","H","I"
head -n1 data | awk '{print NF-1}'
Thank you

Parsing/sorting/de-duplicating large matrix of info in UTF-8 form

I have a large file in UTF-8 form (I've encoded it from iso-8859-1 form) that I have opened in terminal on mac.
I've been trying to use parse.date function to convert data in one of the column fields to date form.
I also need to filter all of the rows (each row represents a company, each column represents different data field for each company: i.e. founder, location, year created, etc.) on a certain column field.
As a bonus I would like to de-duplicate the data as well.
Then finally, I'd like to run analysis on this data by sorting the data via different column fields and working with survival curves.
I've been scouring the internet for the appropriate terminal commands to approach this with. Could anyone give me direction on how to get started?
first problem is seperating fields,
i assume fields are TAB-separated;
cat file.txt | sort -t$'\t' -k 2
If there are TABS and spaces messed up together,
i would assume there is not successive spaces inside a field.
So i would write it this way;
cat file.txt | sed -e 's/\s\+/\t/' | sort -t$'\t' -k 2
this will sort the file.txt, according to the 2 column.
if column 2 is numeric, add -n option.
if you want stable sort (which will keep previous ordering whenever possible) add -s option.
if you want to eliminate duplicates add -u option.
cat file.txt | sort -t$'\t' -k 2 -n -s -u
for more details;
man sort
(i don't know about parse.date function.)

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

How to sort file rows with vi?

I have to edit multiple files with multiple rows, and also everything is in three columns, like this:
#file
save get go
go save get
rest place reset
Columns are separated with tab. Is there any possible way to sort rows based on second or third column using vi?
sort by the 2nd col:
:sor /\t/
sort by the 3rd col:
:sor /\t[^\t]*\t/
Second column:
:sort /\%9c/
Third column:
:sort /\%16c/
\%16c means "column 16".
Hi light the rows you want to sort with "V" command
Use a bash command with "!" to work on the selection, like:
!sort -k 10
Where the number is the column number where your second (sort) column starts.
vi will replace the selection with the output of the sort command - which is given the original selection.
You can specify a pattern for sort. For example:
sort /^\w*\s*/
Will sort on the second column (the first thing to sort after matching the pattern).
Likewise
sort /^\w*\s*\w*\s*/
Should sort on the third column.
delimit the column using some char here I have | symbol as delimiter, once did with that you can use below command to sort specific column use -n if u want to sort numeric and its working on some version of vi and not on ubuntu vi :(
/|.*|/ | sort

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

Resources