remove duplicates , but have problems with delete column with "-" - bash

i have this input file..
I need to remove the duplicated rows in column 13 but I have a problem with the data that contains a "-" why does it not remove them
input
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|192863252
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201907|RB|192863252
13220610|4|615906412|5|05502216092 |411|8|798|798|RB|201811|RB|13220610-4
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13219789|K|36062376|5|05202316950 |315|4|493|493|RS|201811|RS|13219789-K
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
13220610|4|615906412|5|05502216092 |411|8|798|798|RB|201811|RB|13220610-4
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
13219789|K|36062376|5|05202316950 |310|4|493|493|RS|201811|RS|13219789-K
I need to remove the columns that have repeated column 13, but my code only removes the data from column 13 that does not have "-"
seen[$13]++; a[++count]=$0; key[count]=$13} END {for (i=1;i<=count;i++) if (seen[key[i]] == 1){print a[i] >> (File".ok")}else{
print a[i] >> (File".nok")}
desired output
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
appreciate your help

If your sample input is accurate, some of your column 13 contain trailing whitespace. If you want to treat them as being the same value, you can trim it.
For example, before using column 13, you could do:
gsub(/^[[:space:]]+|[[:space:]]$/,"",$13)

A dual-pass approach will allow you to eliminate all records that have duplicated field 13, e.g.
awk -F'|' 'FNR==NR{seen[$13]++; next} seen[$13]>1 {next}1' file file
With trailing space in field 13 as noted by #jhnc, in order to match the duplicate non-whitespace, you will need to trim the trailing whitespace, e.g.
awk -F'|' '{sub(/[ ]+$/,"",$13)} FNR==NR{seen[$13]++; next} seen[$13]>1 {next}1' file file
Output
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
(note: the field 13 you show in your output 129220198 is duplicated in the input)

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Sort numberic in a string of text

I tried some sort examble but can't find the way to solve this.I think i should find the right seperator and then sort it by numberic but it don't work as my desire.
This is my file:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg3_bla_reg_26_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
And this is my desire result:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
$ sort -t_ -k5,5 -k8,8n file
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
That may or may not produce the output you expect if the regN value in the 5th column can include 2-digit numbers.
Using awk
$awk -F"_" 'function print_array(arr,max){ for(i=1; i<=max; i++) if(a[i]){print a[i], a[i]="";} } key==$5{a[$8]=$0; key=$5; max=$8>max?$8:max} key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} END{print_array(a,max)}' file
Output:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
Explanation:
awk -F"_" '
function print_array(arr,max) #Simply prints the hashed array from i=1 to max value array is holding
{
for(i=1; i<=max; i++)
if(a[i])
{print a[i], a[i]="";}
}
key==$5{a[$8]=$0; max=$8>max?$8:max} #Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} #If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) #To print the last record set
}' file
key==$5{a[$8]=$0; max=$8>max?$8:max} : Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field $5 matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results. This will work irrespective of the number of digits in $8.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) Just to print the last record set
sort -V file
-V, --version-sort
natural sort of (version) numbers within text

Using awk or sed to print column of CSV file enclosed in double quotes

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.

Edit fields in csv files using bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.
Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

Sorting terms on one line in tab delimted text file

My file contains a number of lines all of which contain a tab delimited sequence of terms. I would like to sort these tab delimited terms alphabetically within each line using the 'sort' command but seem to be unable to do it.
Thanks for your help
Markus
You could use awk to sort the fields of each row:
awk '{split($0,a);asort(a);for(i=1;i<=NF;i++)$i=a[i];print}' a.txt
per_row = []
infile = open('Myfile.txt', 'r')
for line in infile:
per_row.append(line.split('\t'))
sorted = [x for x in sorted(infile)]
print (sorted)
above is a python code to sort the file

Resources