is there a way to sort very large CSV file using sort?
Simply sort by the first column, however, the data might contain line breaks within a column (standard CSV file rules apply). Would the line breaks break the sort utility?
The sort function will sort the lines in asciicographical order. To get a more sophisticated effect, you might use the UNIX utility awk.
I believe you should try something like this cat old.csv | sort > new.csv
UPD: To prepare data if needed we can use AWK script....
You could do it with a mix of utilities. Hopefully I've understood it correctly ... and if so, this might do the job. IF not, point out where I've gone wrong in an assumption :-) This requires that the number of fields per CSV record is fixed (it's also a dirt simple example that doesn't cover various CSV variations (e.g., hello,"world,how",are,you would break as "world,how" would be split into two fields)):
hello,world,how,are,you
one,two,three,four,five
once,I,caught,a
fish,alive
hey,now,hey,now,now
And this awk script:
BEGIN {
FS=","
fields=0
}
{
if (line == "") {
fields=NF
line = $0
} else {
fields=fields + (NF - 1)
line=line"|"$0
}
}
fields == 5 {
print line
fields = 0
line = ""
}
Executing this:
awk -f join.awk < infile | sort | tr '|' '\n'
gives this output:
hello,world,how,are,you
hey,now,hey,now,now
once,I,caught,a
fish,alive
one,two,three,four,five
In essence, all we're doing with the awk script is merging the multi-line records into a single-line that we can then feed to sort, then break again with tr. I'm using a pipe as the replacement for the newline char -- just choose something that you can gurantee will not appear in a CSV record.
Now it might not be perfect for what you want, but hopefully it'll nudge you in the right direction. The main thing with the awk script I've knocked up is that it needs to know how many fields there are per CSV record. This needs to be fixed. If it's variable, then all bets are off as there'd need to be more rules in there do define the semantic nature of the file that you want to sort...
A simpler approach is to temporarily modify your data so that the standard UNIX sort command can properly interpret your data.
You can use a program called csvquote which replaces the problematic commas and newlines inside quoted field values with nonprinting characters. Then it restores those characters at the end of your pipeline.
For example,
csvquote inputfile.csv | sort | csvquote -u
You can find the code here: https://github.com/dbro/csvquote
Related
I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.
I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.
I have tasked with pulling certain values out of a very ugly csv file.
The csv is in the following format:
command1=value1, command2=value2, etc etc.
No problem so far I was grep-ing for the command I required and then piping through cut -f 2 -d '=' to return just the value.
The issue I have is one of the fields is text and can have multiple values which are also separated by comma. To add another curve ball if (and only if) one of the values has a space in it the field will be enclosed in double quotes so the value I'm looking to pull could be:
command=value,..
command=value1,value2,..
command="value 1",..
command="value 1, value 2",..
(where .. is other values in the log file OR the end of the line)
I thought I had cracked it by simply pulling the data between two field names using grep -oP '(?<=command1=).*(?= command2)' and then piping that through rev | cut -c 2- | rev.
But I've now found out the order the fields appear aren't consistent so the file could be:
command1=value1, command3=value3, command2=value2
How can I get the value of command2 when it may or may not be enclosed in double quotes, it may also have commas within it. I'm struggling to see how it may be possible as how will the grep know what is a value break and what is the next field.
Any help gratefully accepted.
I would combine grep and sed. Suppose you have this input in example.csv:
command1=value1, command2=value2,
command1=value1, command2="value2, value3"
command1=value1, command3=valu3
Then this command:
grep 'command2=' example.csv |
sed -e 's/.*command2=//g' -e 's/^\([^"][^,]*\),.*$/\1/g' -e 's/^"\([^"]*\)".*$/\1/g'
Will give you this:
value2
value2, value3
Explanation:
grep find the right lines
the first expression in sed (i.e. the firs -e) removes everything before the desired value
the second expression deals with the case without quotation mark
the third expression deals with the case with quotation mark
Please note that CSV is an extremely complicated format. This regex makes some assumptions, e.g. command2 appears only as key. If this csv is not good enough then I would use a real programming language that has a mature csv library.
In the worst case (say, if , command2= could occur in the quoted value of another key, for example) the only recourse is probably to write a dedicated parser for this pesky format. (Killing the person who came up with it will unfortunately not solve any problems, and may result in new ones. I understand it could be tempting, but don't.)
For a quick and dirty hack, perhaps this is sufficient, though:
grep -oP '(^|, )command2=\K([^,"]+|"[^"]+")'
This will keep the double quotes if the field value is quoted, but that shoud be easy to fix if it's undesired. Moving to a better tool than grep could bring better precision as well, though; here's a sed variant with additional anchoring:
sed -n 's/^\(.*, \)*command2=\(\((^,"]*\)\|"\([^"]*\)"\)\(, .*\)*$/\4\5/p'
idk if it's what you're looking for or not, but given this input file:
$ cat file
command1=value1.1,command2=value2.1,value2.2,command3="value 3.1",command4="value 4.1, value 4.2"
this GNU awk (for the 4th arg to split()) script might be what you want:
$ cat tst.awk
{
delete(c2v)
split($0,f,/,?[^=,]+=/,s)
for (i=1; i in s; i++) {
gsub(/^,|=$/,"",s[i])
print "populating command name to value array:", s[i], "->", f[i+1]
c2v[s[i]] = f[i+1]
}
print c2v["command2"]
print c2v["command4"]
}
$ awk -f tst.awk file
populating command to value: command1 -> value1.1
populating command to value: command2 -> value2.1,value2.2
populating command to value: command3 -> "value 3.1"
populating command to value: command4 -> "value 4.1, value 4.2"
value2.1,value2.2
"value 4.1, value 4.2"
Modify the print statements to suit, it should be obvious...
I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou
sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.
Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.
The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.
I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.