Bash to count columns and registers - bash

I am trying to get the number of columns in bash of a dataset (taking into account that it has a header) using the following code but I am not convinced. Do you have any other idea?
On the other hand, Could you help me to determine how can I get the type of data that each attribute is (integer, date, string...) and count how many registers are in the dataset taking into account that it has a header?
The header of my archive is: "A","B","C","D","F","G","H","I"
head -n1 data | awk '{print NF-1}'
Thank you

Related

How can I output my textfile to uppercase hash?

I have a large text file consisting of two columns that I'm looking to convert to uppercase hash. The text file is too large for me to open up in Excel (over 1 million rows), and I was trying to do this through command line if possible.
Hoping for column B to just be hashed, but fine if rest of the file is hashed.
Edit: Essentially, I have a text file with column A as first name and column B of Email addresses. I was hoping to use something like this to convert column B into uppercase hash, so that it is encrypted to transfer where someone else can convert it back.
I saw this code but wasn't sure where I'd specify the file name and column I wanted to convert to hash
echo -n password | sha1sum | awk '{print $1}'
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

To find latest entry for a particular record in the unix file

I have a file which has multiple entries for a single record. For example:
abc~20160120~120
abc~20160125~150
xyz~20160201~100
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
The first column is record name, second column is date and third column is the entry for that particular date.
Now what I want to display in my file is the latest entry for a particular record. This is how my output should look like
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
Can anybody help with a shell script for the same? Keeping in mind that I have too many records which needs to be sorted first according to their record name and then taking out the latest one for each record according to date.
Sort the lines by record name and date reversed, than use the -u unique flag of sort to only output the first entry for each record:
sort -t~ -k1,2r < input-file | sort -t~ -k1,1 -u

Parsing/sorting/de-duplicating large matrix of info in UTF-8 form

I have a large file in UTF-8 form (I've encoded it from iso-8859-1 form) that I have opened in terminal on mac.
I've been trying to use parse.date function to convert data in one of the column fields to date form.
I also need to filter all of the rows (each row represents a company, each column represents different data field for each company: i.e. founder, location, year created, etc.) on a certain column field.
As a bonus I would like to de-duplicate the data as well.
Then finally, I'd like to run analysis on this data by sorting the data via different column fields and working with survival curves.
I've been scouring the internet for the appropriate terminal commands to approach this with. Could anyone give me direction on how to get started?
first problem is seperating fields,
i assume fields are TAB-separated;
cat file.txt | sort -t$'\t' -k 2
If there are TABS and spaces messed up together,
i would assume there is not successive spaces inside a field.
So i would write it this way;
cat file.txt | sed -e 's/\s\+/\t/' | sort -t$'\t' -k 2
this will sort the file.txt, according to the 2 column.
if column 2 is numeric, add -n option.
if you want stable sort (which will keep previous ordering whenever possible) add -s option.
if you want to eliminate duplicates add -u option.
cat file.txt | sort -t$'\t' -k 2 -n -s -u
for more details;
man sort
(i don't know about parse.date function.)

Pull random unique samples within sorted categories in bash

I have a large unsorted CSV file (>4M records). Each record has a category, which is described in the first three columns. The rest of the record is address data which may or may not be unique.
A, 1, c, address1 # the category for this record is A1t
A, 1, c, address2
C, 3, e, address3 # the category for this record is C3e
B, 2, a, address4
I would like to pull a random sample of unique records within each category (so 5 unique records in category A1t, 5 unique records from C3e, etc.). I put together a partial solution using sort. However, it only pulls one non-random record in each category:
sort -u -t, -k1,3
Is there a way to pull several random sample records within each category?
I think there must be a way to do this by using a combination of pipes, uniq, awk or shuf, but haven't been able to figure it out. I would prefer a command-line solution since I'm interested in knowing if this is possible using only bash.
If i understand right - simple, not very effective bash solution
csvfile="./ca.txt"
while read -r cat
do
grep "^$cat," "$csvfile" | sort -uR | head -5
done < <(cut -d, -f1-3 < "$csvfile" |sort -u)
decomposition
cut -d, -f1-3 < "$csvfile" - filter out all "categories" (first 3 fields)
sort -u - get sorted unique categories
for each unique category (while read...)
grep "^$cat" "$csvfile" find all lines from this category
sort -uR - sort them randomly by hash (note, the duplicates has the same hash, take unique)
head -5 print the first 5 records (from the randomly sorted list)
Inspired by the use of sort -R in the answer by jm666. This is a GNU extension to sort, so it may not work on non-Gnu systems.
Here, we use sort to sort the entire file once, with the non-category fields sorted in a random order. Since the category fields are the primary key, the result is in category order with random order of the following fields.
From there, we need to find the first five entries in each category. There are probably hackier ways to do this, but I went with a simple awk program.
sort -ut, -k1,3 -k4R "$csvfile" | awk -F, 'a!=$1$2$3{a=$1$2$3;n=0}++n<=5'
If your sort doesn't randomise, then the random sample can be extracted with awk:
# Warning! Only slightly tested :)
sort -ut, "$csvfile" | awk -F, '
function sample(){
for(;n>5;--n)v[int(n*rand())+1]=v[n];
for(;n;--n)print v[n]
}
a!=$1$2$3{a=$1$2$3;sample()}
{v[++n]=$0}
END {sample()}'
It would also be possible to keep all the entries in awk to avoid the sort, but that's likely to be a lot slower and it will use an exorbitant amount of memory.

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

Resources