How can I output my textfile to uppercase hash? - terminal

I have a large text file consisting of two columns that I'm looking to convert to uppercase hash. The text file is too large for me to open up in Excel (over 1 million rows), and I was trying to do this through command line if possible.
Hoping for column B to just be hashed, but fine if rest of the file is hashed.
Edit: Essentially, I have a text file with column A as first name and column B of Email addresses. I was hoping to use something like this to convert column B into uppercase hash, so that it is encrypted to transfer where someone else can convert it back.
I saw this code but wasn't sure where I'd specify the file name and column I wanted to convert to hash
echo -n password | sha1sum | awk '{print $1}'
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

Related

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

How can I extract a column's (called say "X") any cell value from a text file with multiple columns using Bash?

I have a huge file with 100 columns.
I am concerned with one column called 'Location'. I know for a fact that all rows of this column are same in value. I need to get that value through Bash.
Any thoughts on how to go about this?
If the column is always in the same location relative to other columns (say 10th) you could use
cut -d" " -f10
In this case you're assuming there's one whitespace between each column, you could change the delimiter to whatever separates between the columns.

Parsing/sorting/de-duplicating large matrix of info in UTF-8 form

I have a large file in UTF-8 form (I've encoded it from iso-8859-1 form) that I have opened in terminal on mac.
I've been trying to use parse.date function to convert data in one of the column fields to date form.
I also need to filter all of the rows (each row represents a company, each column represents different data field for each company: i.e. founder, location, year created, etc.) on a certain column field.
As a bonus I would like to de-duplicate the data as well.
Then finally, I'd like to run analysis on this data by sorting the data via different column fields and working with survival curves.
I've been scouring the internet for the appropriate terminal commands to approach this with. Could anyone give me direction on how to get started?
first problem is seperating fields,
i assume fields are TAB-separated;
cat file.txt | sort -t$'\t' -k 2
If there are TABS and spaces messed up together,
i would assume there is not successive spaces inside a field.
So i would write it this way;
cat file.txt | sed -e 's/\s\+/\t/' | sort -t$'\t' -k 2
this will sort the file.txt, according to the 2 column.
if column 2 is numeric, add -n option.
if you want stable sort (which will keep previous ordering whenever possible) add -s option.
if you want to eliminate duplicates add -u option.
cat file.txt | sort -t$'\t' -k 2 -n -s -u
for more details;
man sort
(i don't know about parse.date function.)

Add index column to CSV file

I have a large Comma-Separated File (6GB) and would like to add an index column to it. I'm looking at Unix type solutions for efficiency. I'm using a Mac.
I have this:
V1 V2 V3
0.4625 0.9179 0.8384
0.9324 0.2486 0.1114
0.6691 0.7813 0.6705
0.1935 0.3303 0.4336
Would like to get this:
ID V1 V2 V3
1 0.4625 0.9179 0.8384
2 0.9324 0.2486 0.1114
3 0.6691 0.7813 0.6705
4 0.1935 0.3303 0.4336
This will probably work:
awk -F'\t' -v OFS='\t' '
NR == 1 {print "ID", $0; next}
{print (NR-1), $0}
' input.csv > output.csv
In awk, the NR variable is "the total number of input records seen so far", which in general means "the current line number". So the NR == 1 in the first line is how we match the first record and add the "ID" column header, and for the remaining lines we use NR-1 as the index.
The -F'\t' argument sets the input field separator, and -vOFS='\t' sets the output field separator.
Since no technology is specified in the original post, I'd be happy here to keep it simple.
(all the fancy Vim/bash solutions are fine if you know what you're doing).
Open the CSV file in your favourite spreadsheet programme (I'm using
LibreOffice, but Excel or a native Mac equivalent will do)
insert a column to the left of column A
Enter a 1 into cell A2, the first cell under the headers
Double-click the blob at the bottom right of the cell as shown in the screenshot:
This last step will fill the index column with 1,2,3... etc.
You can then save the resulting spreadsheet as a CSV file again.
I assume you have a commas delimited file.
Using vim, open the file. In normal mode, type
:%s/^/\=line('.').','/
:%s/^/\=line('.')/ adds the line number at the beginning of the line. Since you have a commas delimited file (add a column) you need a comma after your line number. so the .','
see this answer for full explanation about :%s/^/\=line('.')/
Open the CSV file in your favorite spreadsheet program, such as Excel
Insert a column to the left side of first column
Type 1 in the first cell of this column
Type an equation '=A2+1' in the following cell
Double-click the blob at the bottom right of the cell as shown in the screenshot

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

Resources