Related
i have this input file..
I need to remove the duplicated rows in column 13 but I have a problem with the data that contains a "-" why does it not remove them
input
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|192863252
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201907|RB|192863252
13220610|4|615906412|5|05502216092 |411|8|798|798|RB|201811|RB|13220610-4
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13219789|K|36062376|5|05202316950 |315|4|493|493|RS|201811|RS|13219789-K
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
13220610|4|615906412|5|05502216092 |411|8|798|798|RB|201811|RB|13220610-4
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
13219789|K|36062376|5|05202316950 |310|4|493|493|RS|201811|RS|13219789-K
I need to remove the columns that have repeated column 13, but my code only removes the data from column 13 that does not have "-"
seen[$13]++; a[++count]=$0; key[count]=$13} END {for (i=1;i<=count;i++) if (seen[key[i]] == 1){print a[i] >> (File".ok")}else{
print a[i] >> (File".nok")}
desired output
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RS|201908|RS|129220198
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
appreciate your help
If your sample input is accurate, some of your column 13 contain trailing whitespace. If you want to treat them as being the same value, you can trim it.
For example, before using column 13, you could do:
gsub(/^[[:space:]]+|[[:space:]]$/,"",$13)
A dual-pass approach will allow you to eliminate all records that have duplicated field 13, e.g.
awk -F'|' 'FNR==NR{seen[$13]++; next} seen[$13]>1 {next}1' file file
With trailing space in field 13 as noted by #jhnc, in order to match the duplicate non-whitespace, you will need to trim the trailing whitespace, e.g.
awk -F'|' '{sub(/[ ]+$/,"",$13)} FNR==NR{seen[$13]++; next} seen[$13]>1 {next}1' file file
Output
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|162230484
13219722|9|644118078|5|05502217789 |310|8|730|730|RS|201811|RS|13219722-9
13220015|7|70321801|5|05502623275 |310|1|359|359|RB|201811|RB|13220015-7
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RX|201908|RX|48510787
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|138290077
0|0|NULL|NULL|NULL|NULL|NULL|NULL|NULL|RB|201908|RB|15568996K
(note: the field 13 you show in your output 129220198 is duplicated in the input)
I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".
I have a big data file with many columns. I would like to get the mean value of some of the columns if another column has a specific value.
For example if $19=9.1 then get the mean of $24, $25,$27, $28, $32 and $35 and write these values in a file like
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
and add two more lines for two other values of $19 column, for example, 11.9 and 13.9, resulting:
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
11.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
13.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
I have seen a post "awk average part of a column if lines (specific field) match" which makes the mean of only one column if the first has some value, but I do not know how to extend the solution to my problem.
this should work, if you fill in the blanks...
$ awk 'BEGIN {n=split("1.9 11.9 13.9",a)}
{k=$19; c[k]++; m24[k]+=$24; m25[k]+=$25; ...}
END {for(i=1;i<=n;i++) print k=a[i], m24[k]/c[k], m25[k]/c[k], ...}' file
perhaps handle c[k]=0 condition as well, with something like this:
function mean(sum,count) {return (count==0?"NaN":sum/count)}
Traspose from line to column is the objetive, taking in consideration the first column, which is the date
Input file
72918,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009
72918,2356,2357,2358,2359,2360,2361,2362,2363,2364
72918,0,0,0,0,0,0,0,0,0
72918,0,0,0,0,0,0,1,0,0
72918,1496,1502,1752,1752,1752,1752,1751,974,972
73018,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009
73018,2349,2350,2351,2352,2353,2354,2355,2356,2357
73018,0,0,0,0,0,0,0,0,0
73018,0,0,0,0,0,0,0,0,0
73018,1524,1526,1752,1752,1752,1752,1752,256,250
Output desired
72918,111000009,2356,0,0,1496
72918,111000009,2357,0,0,1502
72918,111000009,2358,0,0,1752
72918,111000009,2359,0,0,1752
72918,111000009,2360,0,0,1752
72918,111000009,2361,0,0,1752
72918,111000009,2362,0,1,1751
72918,111000009,2363,0,0,974
72918,111000009,2364,0,0,972
73018,111000009,2349,0,0,1524
73018,111000009,2350,0,0,1526
73018,111000009,2351,0,0,1752
73018,111000009,2352,0,0,1752
73018,111000009,2353,0,0,1752
73018,111000009,2354,0,0,1752
73018,111000009,2355,0,0,1752
73018,111000009,2356,0,0,256
73018,111000009,2357,0,0,250
Please advise, thanks in advance.
This code seems to do exactly what you need:
awk -F, '
func init_block() {ts=$1;delete a;cnt=0;nf0=NF}
func dump_block() {for(f=2;f<=nf0;f+=1){printf("%s",ts);for(r=1;r<=cnt;r+=1){printf(",%s",a[r,f])};print ""}}
BEGIN{ts=-1}
ts<0{init_block()}
ts!=$1{dump_block();init_block()}
{cnt+=1;for(f=1; f<=NF; f++) a[cnt,f]=$f}
END{dump_block()}' <input.txt >output.txt
It collects rows until the timestamp changes, then prints the transpose of the block with keeping the same timestamp. The number of fields in the input must be the same within each block so that this code behaves correctly.
I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.