deleting concret data from csv - bash

I have a long csv file with 5 columns. But 3 lines have 6 columns. One begin with "tomasluck", another "peterblack" and the last one "susanpeeters". I need to delete, in this 3 lines, the fourth element (column) and get only 5 columns.
I put a short example, my file is long and is created automatically.
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, tl, 10, 10
anaperez, 14, 11-03-2015, 10, 11
and I need
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
Exactly, I was thinking in a code, that select the lines that begin with tomasluck, peterblack and susanpeeters, and then delete the 4rht field or colum.

The tricky thing about this is to keep the formatting intact. The simplest way, I think, is to treat the input as plain text and use sed:
sed '/^tomasluck,/ s/,[^,]*//3' file.csv
This removes, in a line that begins with tomasluck,, the third occurrence of a comma followed by a field (non-comma characters). The filter regex can be amended to include other first fields, such as
sed '/^\(tomasluck\|petergreat\|anaperez\),/ s/,[^,]*//3' file.csv
...but in your input data, those lines don't appear to have a sixth field.
Further ideas that may or may not pertain to your use case:
Removing the fourth field on the basis of the number of fields is a little trickier in sed, largely because sed does not have arithmetic functionality and identifying the lines is a bit tedious:
sed 'h; s/[^,]//g; /.\{5\}/ { x; s/,[^,]*//3; x; }; x' file.csv
That is:
h # copy the line to the hold buffer
s/[^,]//g # remove all non-comma characters
/.\{5\}/ { # if five characters remain (if the line has six or more
# fields)
x # exchange pattern space and hold buffer
s/,[^,]*//3 # remove field
x # swap back again
}
x # finally, swap in the actual data before printing.
The x dance is typical of sed scripts that use the hold buffer; the goal is to make sure that regardless of whether the substitution takes place, in the end the line (and not the isolated commas) are printed.
Mind you, if you want the selection condition to be that a line has six or more fields, it is worth considering to use awk, where the condition is easier to formulate but the replacement of the field is more tedious:
awk -F , 'BEGIN { OFS = FS } NF > 5 { for(i = 5; i <= NF; ++i) { $(i - 1) = $i }; --NF; $1 = $1 } 1' file.csv
That is: Split line at commas (-F ,), then
BEGIN { OFS = FS } # output field separator is input FS
NF > 5 { # if there are more than five fields
for(i = 5; i <= NF; ++i) { # shift them back one, starting at the fifth
$(i - 1) = $i
}
--NF # let awk know that there is one less field
$1 = $1 # for BSD awk: force rebuilding of the line
}
1 # whether or not a transformation happened, print.
This should work for most awks; I have tested it with gawk and mawk. However, because nothing is ever easy to do portably, I am told that there is at least one awk out there (on old Solaris, I believe) that doesn't understand the --NF trick. It would be possible to hack something together with sprintf for that, but it's enough of a corner case that I don't expect it to bite you.

a more generic solution is to check, whether we have 5 or 6 fields:
awk -F', ' '{if(NF==6) print $1", "$2", "$3", "$5", "$6; else print $0}' file.csv

You could do this through sed which uses capturing group a capturing group based regex.
$ sed 's/^\(\(tomasluck\|peterblack\|susanpeeters\),[^,]*,[^,]*\),[^,]*/\1/' file
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
This captures all the characters upto the third column and matches the fourth column. Replacing the matched characters with the chars inside group 1 will give you the desired output.

Related

AWK: subset randomly and without replacement a string in every row of a file

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

Compare two files and replace value in one of the file

I have to files. AU.swo and Compare files.
AU.swo contain data:
7844204020353125700125759G19
7855207010004191300200759119
7898211030001191500193359119
7898211030001212800212959G19
7898211030002212600212759G19
Compare contain data:
7844204G1
785520712
7898211G1
789821112
First seven values from of files is ID number, position 8 in Compare file is same as position 26 in AU.swo. What I want to do is replace number 9 from last position in AU.swo file. It should be looks like:
7844204020353125700125759G11
7855207010004191300200759112
7898211030001191500193359112
7898211030001212800212959G11
7898211030002212600212759G11
what is better to use? awk or sed command? Could you give me some hint how can I do this?
Thank you
You may use this awk:
awk 'NR == FNR {
k[substr($0, 1, 7),substr($0, 8, 1)] = substr($0, 9, 1)
next
}
(substr($0, 1, 7), substr($0, 26, 1)) in k {
sub(/9$/, k[substr($0, 1, 7),substr($0, 26, 1)])
} 1' compare AU.swo
7844204020353125700125759G11
7855207010004191300200759112
7898211030001191500193359112
7898211030001212800212959G11
7898211030002212600212759G11
(Edit: another, nicer solution in python:)
python -c 'for l in zip(open("AU.swo"), open("Compare")): print(l[0][:-2] + l[1][-2:-1])'
This one also doesn't append on the exact length of the input strings.
My original solution, although quite ugly:
paste -d '' <(sed 's/\(.*\).$/\1/' AU.swo) <(cut -c 9- Compare)
How this works:
sed 's/\(.*\).$/\1/' AU.swo prints every line from AU.swo without the last character. We surround this with <( ... ) to use this as the first input for paste.
cut -c 9- Compare prints the last character only of every line in Compare. Note that this assumes that each line is exactly 9 characters long.
paste -d '' takes each line of both inputs and prints them together on a single line.
Note that I tested this on linux and on mac OS this might not work.

How to select a specific percentage of lines?

Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)
I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.
If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file
Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.

Clean tabs from free-text column in tab-delim file

I've been sent a tab-delimited file where free-text notes include unescaped tabs. There a 19 columns in the data and the free text is column #13. Thus I need to find all rows with >18 tabs. for such rows I need to replace any tabs that are >12 from the beginning of the line and >6 from the end of the line. i'd like to replace with them with the string '##' ( x at symbol) as i'll put the tabs back further downstream.
File is 2405 rows including header row and some rows have blank 'cells' , i.e. tabs next to each other.I can't get to the doc source not does supplier know how to fix source. Text is UTF-* and contains accented characters and such (i.e. not basic ASCII text).
Any simple method top fix this I can use on a Mac running OS 10.8.5 (or if necessary 10.9.x)?
Might help later readers if answers indicate if the splits (here 12 / 6 of 19) are hard coded or input as variables.
Try the following awk command:
awk -F '\t' -v expectedColCount=19 -v freeFormColNdx=13 '
NF > expectedColCount {
# Calculate the number of field-embedded tabs that need replacing.
extraTabs = NF - expectedColCount
# Print 1st column.
printf "%s", $1
# Print columns up to and including the first tab-separated token
# from the offending column.
for (i = 2; i <= freeFormColNdx; ++i) { printf "%s%s", FS, $i }
# Print tokens in the offending column separated with "##" instead of tabs.
for (i = freeFormColNdx + 1; i <= freeFormColNdx + extraTabs; ++i) {
printf "##%s", $i
}
# Print the remaining columns.
for (i = freeFormColNdx + extraTabs + 1; i <= NF; ++i) {
printf "%s%s", FS, $i
}
# Print terminating newline.
printf "\n"
# Done processing this line; move to next.
next
}
1 # Line has expected # of tabs or fewer (header line))? Print as is.
' file

Remove linefeed from csv preserving rows

I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.
I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.
try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.

Resources