I have a huge file with 100 columns.
I am concerned with one column called 'Location'. I know for a fact that all rows of this column are same in value. I need to get that value through Bash.
Any thoughts on how to go about this?
If the column is always in the same location relative to other columns (say 10th) you could use
cut -d" " -f10
In this case you're assuming there's one whitespace between each column, you could change the delimiter to whatever separates between the columns.
I have a file which has multiple entries for a single record. For example:
abc~20160120~120
abc~20160125~150
xyz~20160201~100
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
The first column is record name, second column is date and third column is the entry for that particular date.
Now what I want to display in my file is the latest entry for a particular record. This is how my output should look like
abc~20160205~200
xyz~20160202~90
pqr~20160102~250
Can anybody help with a shell script for the same? Keeping in mind that I have too many records which needs to be sorted first according to their record name and then taking out the latest one for each record according to date.
Sort the lines by record name and date reversed, than use the -u unique flag of sort to only output the first entry for each record:
sort -t~ -k1,2r < input-file | sort -t~ -k1,1 -u
I have a large file in UTF-8 form (I've encoded it from iso-8859-1 form) that I have opened in terminal on mac.
I've been trying to use parse.date function to convert data in one of the column fields to date form.
I also need to filter all of the rows (each row represents a company, each column represents different data field for each company: i.e. founder, location, year created, etc.) on a certain column field.
As a bonus I would like to de-duplicate the data as well.
Then finally, I'd like to run analysis on this data by sorting the data via different column fields and working with survival curves.
I've been scouring the internet for the appropriate terminal commands to approach this with. Could anyone give me direction on how to get started?
first problem is seperating fields,
i assume fields are TAB-separated;
cat file.txt | sort -t$'\t' -k 2
If there are TABS and spaces messed up together,
i would assume there is not successive spaces inside a field.
So i would write it this way;
cat file.txt | sed -e 's/\s\+/\t/' | sort -t$'\t' -k 2
this will sort the file.txt, according to the 2 column.
if column 2 is numeric, add -n option.
if you want stable sort (which will keep previous ordering whenever possible) add -s option.
if you want to eliminate duplicates add -u option.
cat file.txt | sort -t$'\t' -k 2 -n -s -u
for more details;
man sort
(i don't know about parse.date function.)
I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.
So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.