Unique entry set in the first column of all csv files under directory [duplicate] - bash

This question already has answers here:
Is there a way to 'uniq' by column?
(8 answers)
Closed 7 years ago.
I have a list of comma separated files under the directory. There are no headers, and unfortunately they are not even the same length for each row.
I want to find the unique entry in the first column across all files.
What's the quickest way of doing it in shell programming?
awk -F "," '{print $1}' *.txt | uniq
seems to only get uniq entries of each files. I want all files.

Shortest is still using awk (this will print the row)
awk -F, '!a[$1]++' *.txt
to get just the first field
awk -F, '!a[$1]++ {print $1}' *.txt

Related

I need to filter only duplicated lines from many files using bash [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 2 years ago.
I have the following three files
filea
a
bc
cde
fileb
a
bc
cde
frtdff
filec
a
bc
cddeeer
erer34
I am able to filter by the duplicated lines from these three files.
I am using the following command
ls file* | wc -l
which returns 3. Then, I am launching
sort file* | uniq --count --repeated | awk '{ if ($1 == 3) { print $2} }'
The last command returns precisely what I need, only in case I am not creating more files starting with "file".
But, in case I have thousands of files that need to be created during the time a script is running , I should get an exact number of files coming retrieved from this command
n=`ls file* | wc -l`
sort file* | uniq --count --repeated | awk '{ if ($1 == $n) { print $2} }'
Unfortunately, variable n is not accepted inside the if condition of the awk command.
My issue is that I am not able to use the value of the variable n as a comparison criteria inside an if conditional that is part of awk command.
You can use:
awk '!line[$0]++' file*
This will print only once any string even if present in several files and or in same file.

Getting unique values from column in a csv file [duplicate]

This question already has answers here:
awk to remove duplicate rows totally based on a particular column value
(6 answers)
Closed 4 years ago.
I have the following input:
no,zadrar,MENTOR,rossana#xt.com,AGRATE
no,mittalsu,MENTOR,rossana#xt.com,GREATER NOIDA
no,abousamr,CADENCE,selim#xt.com,CROLLES
no,lokinsks,MENTOR,sergey#xt.com,CROLLES
no,billys,MENTOR,billy#xt.com,CROLLES
no,basiles1,CADENCE,stephane#xt.com,CASTELLETTO
no,cesaris1,CADENCE,stephane#xt.com,CROLLES
I want to get only the lines where column 4 is unique:
no,abousamr,CADENCE,selim#xt.com,CROLLES
no,lokinsks,MENTOR,sergey#xt.com,CROLLES
no,billys,MENTOR,billy#xt.com,CROLLES
I tried with:
awk -F"," '{print $4}' $vendor.csv | sort | uniq -u
But I get:
selim#xt.com
sergey#xt.com
billy#xt.com
You can use simply the options provided by the sort command:
sort -u -t, -k4,4 file.csv
As you can see in the man page, option -u stands for "unique", -t for the field delimiter, and -k allows you to select the location (key).
Could you please try following(reading Input_file 2 times).
awk -F',' 'FNR==NR{a[$4]++;next} a[$4]==1' Input_file Input_file

Get the extension of file [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 7 years ago.
I have files with "multiple" extension , for better manipulation I would like to create new folder for each last extension but first I need to retrieve the last extension.
Just for example lets assume i have file called info.tar.tbz2 how could I get "tbz2" ?
One way that comes to my mind is using cut -d "." but in this case I would need to specify -f parameter of the last column, which I don't know how to achieve.
What is the fastest way to do it?
You may use awk,
awk -F. '{print $NF}' file
or
sed,
$ echo 'info.tar.tbz2' | awk -F. '{print $NF}'
tbz2
$ echo 'info.tar.tbz2' | sed 's/.*\.//'
tbz2

awk delete line if number of rows = some number [duplicate]

This question already has answers here:
Save modifications in place with awk
(7 answers)
Closed 7 years ago.
This is my awk code:
awk -F"," 'NF!= 8' myfile.csv
How can I delete only lines that have 8 fields.
Here you go (this prints lines with 8 fields as originally asked)
awk -F, 'NF==8' myfile.csv
The question changed, you want to remove the lines with 8 fields. One way to do this
awk -F, 'NF!=8' myfile.csv > temp && mv temp mvfile.csv
NB. updated as per comments

Extract Information From File Name in Bash [duplicate]

This question already has answers here:
How to split a string into an array in Bash?
(24 answers)
Closed 7 years ago.
Suppose I have a file with a name ABC_DE_FGHI_10_JK_LMN.csv. I want to extract the ID from the file-name i.e. 10 with the help of ID position and file-name separator. I have following two inputs
File-name_ID_Position=4; [since 10 is at fourth position in file-name]
File-name_Delimiter="_";
Here ID can be numeric or alpha-numeric. So how extract the 10 from above file with the help of above two inputs. How to achieve this in bash?
Instead of writing a regex in bash, I would do it with awk:
echo 'ABC_DE_FGHI_10_JK_LMN.csv' | awk -F_ -v pos=4 '{print $pos}'
or if you want the dot to also be a delimiter (requires GNU awk):
echo 'ABC_DE_FGHI_10_JK_LMN.csv' | awk -F'[_.]' -v pos=4 '{print $pos}'

Resources