Parsing/sorting/de-duplicating large matrix of info in UTF-8 form - utf-8

I have a large file in UTF-8 form (I've encoded it from iso-8859-1 form) that I have opened in terminal on mac.
I've been trying to use parse.date function to convert data in one of the column fields to date form.
I also need to filter all of the rows (each row represents a company, each column represents different data field for each company: i.e. founder, location, year created, etc.) on a certain column field.
As a bonus I would like to de-duplicate the data as well.
Then finally, I'd like to run analysis on this data by sorting the data via different column fields and working with survival curves.
I've been scouring the internet for the appropriate terminal commands to approach this with. Could anyone give me direction on how to get started?

first problem is seperating fields,
i assume fields are TAB-separated;
cat file.txt | sort -t$'\t' -k 2
If there are TABS and spaces messed up together,
i would assume there is not successive spaces inside a field.
So i would write it this way;
cat file.txt | sed -e 's/\s\+/\t/' | sort -t$'\t' -k 2
this will sort the file.txt, according to the 2 column.
if column 2 is numeric, add -n option.
if you want stable sort (which will keep previous ordering whenever possible) add -s option.
if you want to eliminate duplicates add -u option.
cat file.txt | sort -t$'\t' -k 2 -n -s -u
for more details;
man sort
(i don't know about parse.date function.)

Related

To find latest entry for a particular record in the unix file

I have a file which has multiple entries for a single record. For example:
abc~20160120~120
abc~20160125~150
xyz~20160201~100
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
The first column is record name, second column is date and third column is the entry for that particular date.
Now what I want to display in my file is the latest entry for a particular record. This is how my output should look like
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
Can anybody help with a shell script for the same? Keeping in mind that I have too many records which needs to be sorted first according to their record name and then taking out the latest one for each record according to date.
Sort the lines by record name and date reversed, than use the -u unique flag of sort to only output the first entry for each record:
sort -t~ -k1,2r < input-file | sort -t~ -k1,1 -u

Pull random unique samples within sorted categories in bash

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.
A, 1, c, address1 # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3 # the category for this record is C3e
B, 2, a, address4
I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:
sort -u -t, -k1,3
Is there a way to pull several random sample records within each category?
I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.
If i understand right - simple, not very effective bash solution
csvfile="./ca.txt"
while read -r cat
do
grep "^$cat," "$csvfile" | sort -uR | head -5
done < <(cut -d, -f1-3 < "$csvfile" |sort -u)
decomposition
cut -d, -f1-3 < "$csvfile" - filter out all "categories" (first 3 fields)
sort -u - get sorted unique categories
for each unique category (while read...)
grep "^$cat" "$csvfile" find all lines from this category
sort -uR - sort them randomly by hash (note, the duplicates has the same hash, take unique)
head -5 print the first 5 records (from the randomly sorted list)
Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.
Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.
From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.
sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'
If your sort doesn't randomise, then the random sample can be extracted with awk:
# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
function sample(){
for(;n>5;--n)v[int(n*rand())+1]=v[n];
for(;n;--n)print v[n]
}
a!=$1$2$3{a=$1$2$3;sample()}
{v[++n]=$0}
END {sample()}'
It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

How to sort file rows with vi?

I have to edit multiple files with multiple rows, and also everything is in three columns, like this:
#file
save get go
go save get
rest place reset
Columns are separated with tab. Is there any possible way to sort rows based on second or third column using vi?
sort by the 2nd col:
:sor /\t/
sort by the 3rd col:
:sor /\t[^\t]*\t/
Second column:
:sort /\%9c/
Third column:
:sort /\%16c/
\%16c means "column 16".
Hi light the rows you want to sort with "V" command
Use a bash command with "!" to work on the selection, like:
!sort -k 10
Where the number is the column number where your second (sort) column starts.
vi will replace the selection with the output of the sort command - which is given the original selection.
You can specify a pattern for sort. For example:
sort /^\w*\s*/
Will sort on the second column (the first thing to sort after matching the pattern).
Likewise
sort /^\w*\s*\w*\s*/
Should sort on the third column.
delimit the column using some char here I have | symbol as delimiter, once did with that you can use below command to sort specific column use -n if u want to sort numeric and its working on some version of vi and not on ubuntu vi :(
/|.*|/ | sort

Resources