Clean tabs from free-text column in tab-delim file - macos

I've been sent a tab-delimited file where free-text notes include unescaped tabs. There a 19 columns in the data and the free text is column #13. Thus I need to find all rows with >18 tabs. for such rows I need to replace any tabs that are >12 from the beginning of the line and >6 from the end of the line. i'd like to replace with them with the string '##' ( x at symbol) as i'll put the tabs back further downstream.
File is 2405 rows including header row and some rows have blank 'cells' , i.e. tabs next to each other.I can't get to the doc source not does supplier know how to fix source. Text is UTF-* and contains accented characters and such (i.e. not basic ASCII text).
Any simple method top fix this I can use on a Mac running OS 10.8.5 (or if necessary 10.9.x)?
Might help later readers if answers indicate if the splits (here 12 / 6 of 19) are hard coded or input as variables.

Try the following awk command:
awk -F '\t' -v expectedColCount=19 -v freeFormColNdx=13 '
NF > expectedColCount {
# Calculate the number of field-embedded tabs that need replacing.
extraTabs = NF - expectedColCount
# Print 1st column.
printf "%s", $1
# Print columns up to and including the first tab-separated token
# from the offending column.
for (i = 2; i <= freeFormColNdx; ++i) { printf "%s%s", FS, $i }
# Print tokens in the offending column separated with "##" instead of tabs.
for (i = freeFormColNdx + 1; i <= freeFormColNdx + extraTabs; ++i) {
printf "##%s", $i
}
# Print the remaining columns.
for (i = freeFormColNdx + extraTabs + 1; i <= NF; ++i) {
printf "%s%s", FS, $i
}
# Print terminating newline.
printf "\n"
# Done processing this line; move to next.
next
}
1 # Line has expected # of tabs or fewer (header line))? Print as is.
' file

Related

Sorting a columns value from a large csv(more than a million records) using awk or bash

I am new to shell scripting.
I have a huge csv file which contains more than 100k rows. I need to find a column and sort it and write it to another file and later I need to process this new file.
below is the sample data
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"
Now you can see that field 4 has data which contains comma as well. now I need the data in which the field 4 is sorted out as below:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"
to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
count=0;
while read line
do
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[#]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
data[4]=${regex}
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
((count++))
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
How to make it faster?
You're using the wrong tools for the task. While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines. Also bash isn't very fast when processing lots of data.
Try a tool which understands CSV directly like http://csvkit.rtfd.org/ or use a programming language like Python. That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable. Note: I'm suggesting Python because of the low initial cost.
With python and the csv module, the code above would look like this:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
writer.writerow(row)
That said, there is nothing obviously wrong with your code. There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.
With single awk:
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv contents:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"
FS=OFS="\042,\042" - considering "," as field separator
split($4,a,",") - split the 4th field into array by separator ,
asort(a) - sort the array by values
Try pandas in python3. Only limitation: The data needs to fit into memory. And that can be a bit larger than your actually data is. I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
continue
print("Start processing %s" % fname)
s = datetime.datetime.now()
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e = datetime.datetime.now()
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.set_index('ts', inplace=True)
e = datetime.datetime.now()
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.sort_index(inplace=True)
e = datetime.datetime.now()
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s = datetime.datetime.now()
df.reset_index(inplace=True)
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e = datetime.datetime.now()
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

deleting concret data from csv

I have a long csv file with 5 columns. But 3 lines have 6 columns. One begin with "tomasluck", another "peterblack" and the last one "susanpeeters". I need to delete, in this 3 lines, the fourth element (column) and get only 5 columns.
I put a short example, my file is long and is created automatically.
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, tl, 10, 10
anaperez, 14, 11-03-2015, 10, 11
and I need
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
Exactly, I was thinking in a code, that select the lines that begin with tomasluck, peterblack and susanpeeters, and then delete the 4rht field or colum.
The tricky thing about this is to keep the formatting intact. The simplest way, I think, is to treat the input as plain text and use sed:
sed '/^tomasluck,/ s/,[^,]*//3' file.csv
This removes, in a line that begins with tomasluck,, the third occurrence of a comma followed by a field (non-comma characters). The filter regex can be amended to include other first fields, such as
sed '/^\(tomasluck\|petergreat\|anaperez\),/ s/,[^,]*//3' file.csv
...but in your input data, those lines don't appear to have a sixth field.
Further ideas that may or may not pertain to your use case:
Removing the fourth field on the basis of the number of fields is a little trickier in sed, largely because sed does not have arithmetic functionality and identifying the lines is a bit tedious:
sed 'h; s/[^,]//g; /.\{5\}/ { x; s/,[^,]*//3; x; }; x' file.csv
That is:
h # copy the line to the hold buffer
s/[^,]//g # remove all non-comma characters
/.\{5\}/ { # if five characters remain (if the line has six or more
# fields)
x # exchange pattern space and hold buffer
s/,[^,]*//3 # remove field
x # swap back again
}
x # finally, swap in the actual data before printing.
The x dance is typical of sed scripts that use the hold buffer; the goal is to make sure that regardless of whether the substitution takes place, in the end the line (and not the isolated commas) are printed.
Mind you, if you want the selection condition to be that a line has six or more fields, it is worth considering to use awk, where the condition is easier to formulate but the replacement of the field is more tedious:
awk -F , 'BEGIN { OFS = FS } NF > 5 { for(i = 5; i <= NF; ++i) { $(i - 1) = $i }; --NF; $1 = $1 } 1' file.csv
That is: Split line at commas (-F ,), then
BEGIN { OFS = FS } # output field separator is input FS
NF > 5 { # if there are more than five fields
for(i = 5; i <= NF; ++i) { # shift them back one, starting at the fifth
$(i - 1) = $i
}
--NF # let awk know that there is one less field
$1 = $1 # for BSD awk: force rebuilding of the line
}
1 # whether or not a transformation happened, print.
This should work for most awks; I have tested it with gawk and mawk. However, because nothing is ever easy to do portably, I am told that there is at least one awk out there (on old Solaris, I believe) that doesn't understand the --NF trick. It would be possible to hack something together with sprintf for that, but it's enough of a corner case that I don't expect it to bite you.
a more generic solution is to check, whether we have 5 or 6 fields:
awk -F', ' '{if(NF==6) print $1", "$2", "$3", "$5", "$6; else print $0}' file.csv
You could do this through sed which uses capturing group a capturing group based regex.
$ sed 's/^\(\(tomasluck\|peterblack\|susanpeeters\),[^,]*,[^,]*\),[^,]*/\1/' file
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
This captures all the characters upto the third column and matches the fourth column. Replacing the matched characters with the chars inside group 1 will give you the desired output.

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).
Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)
In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Remove linefeed from csv preserving rows

I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.
I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.
try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources