Removes duplicate lines from files recursively - bash

I have a directory with bunch of csv files. I want to remove the duplicates lines from all the files.
I have tried awk solution but seems to be bit tedious to do it for each and every file.
awk '!x[$0]++' file.csv
Even if I will do
awk '!x[$0]++' *
I will lost the file names. Is there a way to remove duplicates from all the files using just one command or script.
Just to clarify
If there are 3 files in the directory, then the output should contain 3 files, each sorted independently. After running the command or script the same folder should contain 3 files each with unique entries.

for f in dir/*;
do awk '!a[$0]++' "$f" > "$f.uniq";
done
to overwrite the existing files change to: awk '!a[$0]++' "$f" > "$f.uniq" && mv "$f.uniq" "$f" after testing!

With GNU awk for "inplace" editing and automatic open/close management of output files:
awk -i inplace '!seen[FILENAME,$0]++' *.csv

This will create new files, with suffix .new, that have only unique lines:
gawk '!x[$0]++{print>(FILENAME".new")}' *.csv
How it works
!x[$0]++
This is a condition. It evaluates to true only the current line, $0, has not been seen before.
print >(FILENAME".new")
If the condition evaluates to true, then this print statement is executed. It writes the current line to a file whose name is the name of the current file, FILENAME, followed by the string .new.

Related

Need command or script to rename a list of files in linux using a pattern match

I have downloaded some 90 fasta files from NCBI for bacterial genomes. The downloaded files have default names given by NCBI. I need to change it to my desired file names. Thus I have created two .txt files:
file1.txt - having the default files names provided by NCBI. listed out the names provided by NCBI in file1.txt
file2.txt - having the names to replace the default. listed out the names to replace the NCBI names
Both the files are made in an order so that 1st entry of file1.txt is corresponding to 1st entry of file2.txt.
Now all the downloaded files are in a folder. the folder having the files
and I need a script which reads file1.txt, matches with the file name in the folder and replace it with the names in file2.txt.
I am not a bioinformatician, new to this genre. I look forward to your help. Can this process be made simpler?
This can done with a very small awk one-liner. For convenience, lets first combine your file1 and file2 to make processing easier. This can be done with paste file1.txt file2.txt >> names.txt.
names.txt will be a text file with the old names in the first column and the new names in the second. Awk lets us conveniently run through a file line-by-line (or record-by-record in its terminology) and access each column/field.
Assuming you are in the directory with all these files, as well as names.txt, you can simply run awk '{system("mv " $1 " " $2)}' names.txt to transform them all. This will run through all the lines in names.txt, take the filename given in the first column, and move it to the name given in the second column. The system command allows you to access more basic file system operations through the shell, like moving mv, copying cp, or removing rm files.
Use paste
and xargs like so:
paste file1.txt file2.txt | xargs --verbose -n2 mv
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by xargs using a pipe (|). Option --verbose prints the command, and option -n2 specifies the max number of arguments for xargs to be 2, so that the resulting commands that are executed are something like mv old_file new_file.
Alternatively, use the Perl one-liners below.
Print the commands to rename the files, without executing the commands ("dry run"):
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd;'
Print the commands to rename the files, then actually execute them:
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd; system $cmd;'
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by the Perl one-liner using a pipe (|) to pass it to Perl one-liner's STDIN.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
$F[0], $F[1] : first and second elements of the array #F into which the line is split. They are old and new file names, respectively.
system executes the command $cmd, which actually moves the files.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Prepending part of a filename to a .csv file using bash/sed

I have a couple of files in a directory that are named like this;
1_38OE983729JKHKJV.csv
an integer followed by an ID (the Integer and ID are both unique).
I need to prepend this ID to every line of the file for each file in the folder to prepare the files for import to a database (and discard the integer part of the filename). The contents of the file look something like this:
BW;20015;11,45;0,49;41;174856;4103399
BA;25340;11,41;0,55;40;222161;4599779
BB;800;7,58;0,33;42;10559;239887
HE;6301;9,11;0,39;40;69191;1614302
.
.
.
Total;112613;9,33;0,43;40;1207387;25897426
The end result should look something like this:
38OE983729JKHKJV;BW;20015;11,45;0,49;41;174856;4103399
38OE983729JKHKJV;BA;25340;11,41;0,55;40;222161;4599779
38OE983729JKHKJV;BB;800;7,58;0,33;42;10559;239887
38OE983729JKHKJV;HE;6301;9,11;0,39;40;69191;1614302
.
.
.
38OE983729JKHKJV;Total;112613;9,33;0,43;40;1207387;25897426
Thanks for the help!
EDIT: Spelling and vocabular for clarity
Loop over the files with for, use parameter expansion to extract the id.
#!/bin/bash
for csv in *.csv ; do
prefix=${csv%_*}
id=${csv#*_}
id=${id%.csv}
sed -i~ "s/^/$id;/" "$csv"
done
If the ID can contain underscores, you might need to be more careful with the expansion.
With awk tool:
for f in *csv; do awk '{ fn=FILENAME; $0=substr(fn,index(fn,"_")+1,length(fn)-6)";"$0 }1' "$f" > tmp && mv tmp "$f"; done
fn=FILENAME - the filename
try following too in single awk and which will take care of the number of files which are getting opened during this operation too, so that we will avoid the error of maximum number of files opened.
awk 'FNR==1{close(val);val=FILENAME;split(FILENAME,a,"_");sub(/\..*/,"",a[2])} {print a[2]","$0}' *.csv
With GNU awk for inplace editing and gensub() all you need is:
awk -i inplace '{print gensub(/.*_(.*)\..*/,"\\1;",1,FILENAME) $0}' *.csv
No shell loops or anything else necessary, just that command.

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

how to write finding output to same file using awk command

awk '/^nameserver/ && !modif { printf("nameserver 127.0.0.1\n"); modif=1 } {print}' testfile.txt
It is displaying output but I want to write the output to same file. In my example testfile.txt.
Not possible per se. You need a second temporary file because you can't read and overwrite the same file. Something like:
awk '(PROGRAM)' testfile.txt > testfile.tmp && mv testfile.tmp testfile.txt
The mktemp program is useful for generating unique temporary file names.
There are some hacks for avoiding a temporary file, but they rely mostly on caching and read buffers and quickly get unstable for larger files.
Since GNU Awk 4.1.0, there is the "inplace" extension, so you can do:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep a backup copy of original files, try this:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3
This can be used to simulate the GNU sed -i feature.
See: Enabling In-Place File Editing
Despite the fact that using a temp file is correct, I don't like it because :
you have to be sure not to erase another temp file (yes you can use mktemp - it's a pretty usefull tool)
you have to take care of deleting it (or moving it like thiton said) INCLUDING when your script crash or stop before the end (so deleting temp files at the end of the script is not that wise)
it generate IO on disk (ok not that much but we can make it lighter)
So my method to avoid temp file is simple:
my_output="$(awk '(PROGRAM)' source_file)"
echo "$my_output" > source_file
Note the use of double quotes either when grabbing the output from the awk command AND when using echo (if you don't, you won't have newlines).
Had to make an account when seeing 'awk' and 'not possible' in one sentence. Here is an awk-only solution without creating a temporary file:
awk '{a[b++]=$0} END {for(c=1;c<=b;c++)print a[c]>ARGV[1]}' file
You can also use sponge from moreutils.
For example
awk '!a[$0]++' file|sponge file
removes duplicate lines and
awk '{$2=10*$2}1' file|sponge file
multiplies the second column by 10.
Try to include statement in your awk file so that you can find the output in a new file. Here total is a calculated value.
print $total, total >> "new_file"
This inline writing worked for me. Redirect the output from print back to the original file.
echo "1" > test.txt
awk '{$1++; print> "test.txt"}' test.txt
cat test.txt
#$> 2

Resources