GNU sort -u outputs binary characters - bash

I am trying to get all unique values from a column in a very large file (5 columns, 2,044,530,100 lines, ~49 GB). My current approach is to cut the relevant column and putting it through sort -u (which sorts and only outputs the unique values). While my INPUT is just text, my output contains binary characters and makes it unusable.
First lines of INPUT look like this:
1 D12 rs01 T T
1 D12 rs02 G G
1 D12 rs03 G G
1 D15 rs01 C C
Putting it through a tr command does not make it better, it just makes the binary characters visible.
cut -d" " -f3 INPUT | sort -u > OUTPUT
cut -d" " -f3 INPUT | tr -cd '\11\12\15\40-\176' | sort -u > OUTPUT
For example, some sample-output from the command above:
yO+{(#6:1fr
EvI0^?E0/':>)zj;<f#V&:oY\RM&mhR!6(qV%|`rJTq4IKqV{]Dzb"~8(X82
F:7nc9gZ#nht^M">vo|F+g"x%r>UdF+Rn^MOu=
While the expected output is a column with all unique values in a value, e.g.:
rs01
rs02
rs03
rs04
rs05
Unfortunately, I can't replicate this behavior with generated (smaller) data. Does anyone have a suggestion of how to deal with this? All help is greatly appreciated. Sort version is sort (GNU coreutils) 8.4

Instead of manually splitting the file for inspection I would try grep-ing the input file for unusual characters, just to make sure your input is not damaged, or locate the place with garbage.
grep -b -E -v -e '^[[:alnum:][:space:]]+$' <your file>
If the input is OK, try to use temporary file instead of pipe, and examine it in the same way. If it is OK, blame sort.
(PS. I would rather post it as a comment, not a solution but I can't)

Related

Sed creating duplicates

I have used the command sed in shell to remove everything except for numbers from my string.
Now, my string contains three 0s among other numbers and after running
sed 's/[^0-9]*//g'
Instead of three 0s, i now have 0 01 and 02.
How can I prevent sed from doing that so that I can have the three 0s?
sample of the string:
0 cat
42 dog
24 fish
0 bird
0 tiger
5 fly
Now that we know that digits in filenames in the output from the du utility caused the problem (tip of the hat to Lars Fischer), simply use cut to extract only first column (which contains the data of interest, each file's/subdir.'s size in blocks):
du -a "$var" | cut -f1
du outputs tab-separated data, and a tab is also cut's default separator, so all that is needed is to ask for the 1st field (-f1).
In hindsight, your problem was unrelated to sed; your sample data simply wasn't representative of your actual data. It's always worth creating an MCVE (Minimal, Complete, and Verifiable Example) when asking a question.
try this:
du -a "$var" | sed -e 's/ .*//' -e 's/[^0-9]*//g'

How to use sort for sorting file in BASH?

I have to sort file. When I use sort, it takes only first word in every line. For example. I have these words in line.
able a abundance around accelerated early acting following ad
I execute
sort file.txt
Output is:
able a abundance around accelerated early acting following ad
If I have just one column, sort works. What is the problem?
You can try something like this:
tr " " "\n" < file.txt | sort | tr "\n" " " > newfile.txt
Output to newfile.txt:
a able abundance accelerated acting ad around early following
A few options for you:
To sort the words in a line, you can use sed to replace spaces with new lines then pipe that to sort:
sed 's/ /\n/g' file.txt | sort
To sort on a specific column use awk to print the column then pipe that to sort:
awk {print $2} | sort
I've used this a lot working with data files, and I have yet to find a way to get the whole line after the sort.
You have to put each word on a separate line:
tr -s '[:blank:]' '\n' < file.txt | sort | paste -d" " -s
a able abundance accelerated acting ad around early following
Just in case you want to sort each line in a a multi line file, the following one-liner can do it for you
python2 -c"print '\n'.join(' '.join(sorted(l.split())) for l in open('FILE'))"
If the above looks useful, you can augment your ~/.bashrc with
csort(){ python2 -c"print '\n'.join(' '.join(sorted(l.split()))for l in open('$1'))";}
and later use it like in
csort FILE
The machinery of the python command is best explained exploding the command line like this
with open('FILE') as f: # f is a file object
for line in f: # iterating on a f.o. returns lines
words = line.split() # by default splits on white space
print " ".join(sorted(words)) # sorted returns a sorted list
# ' '.join() joins the list elements with spaces

bash: different sort output on files with identical first column

Sorry for the vague title, I couldn't think of a better one...
I have 2 tab-delimited files with identical first columns (different numbers of total columns). I would like to sort both files by their first column.
I think I could do this either with it -t\t option or the -k1,12 option (since first column is never longer than 12 characters.) Both options produce the same (wrong) output.
Even though both files have the same first column, they are sorted differently. Notice that on the file1 I get 23,29,2; file2, I get 2,23,29.
$ head file1 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs10000023
rs10000029
rs1000002
rs10000030
$ head file2 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs1000002
rs10000023
rs10000029
rs10000030
how I can I sort both files such that the first column is in the same order in each?
Thank you!
sort -t $'\t' -k 1,1
Use $'\t' to have the shell interpret \t as a tab since sort doesn't parse escape sequences. Use -k to tell it to only sort on the first field rather than the entire line.
You might also want the -V flag if you want 2 to sort in between 0 and 10.

Remove duplicate lines without sorting [duplicate]

This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 4 years ago.
I have a utility script in Python:
#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
if line in unique_lines:
duplicate_lines.append(line)
else:
unique_lines.append(line)
sys.stdout.write(line)
# optionally do something with duplicate_lines
This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?
Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.
The UNIX Bash Scripting blog suggests:
awk '!x[$0]++'
This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.
A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind #1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:
cat -n file_name | sort -uk2 | sort -n | cut -f2-
Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')
To remove duplicate from 2 files :
awk '!a[$0]++' file1.csv file2.csv
Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
Now you can check out this small tool written in Rust: uq.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.
Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:
awk '{
if ($0 != PREVLINE) print $0;
PREVLINE=$0;
}'
the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

Extract Data from CSV in shell script (Sed, AWK, Grep?)

I need to extract some data from a CSV file. The CSV is a 2 column file with multiple records. The first column is the date, the second column is the data that needs to be extracted. The first row of the CSV file is the column headers, so it can be skipped. And I've already created the column header for the extracted data's csv file, so theres no need for that, I'll simply use >> to import the data into it.
Here is 1 record/line (of many) in the CSV file:
"2009-09-20 00:12:37","a:2:{s:15:""info_buyRequest"";a:5:{s:4:""uenc"";s:116:""aHR0cDovL3N0b3JlLmZvcmdldGhhbmdvdmVycy5jb20vcGF0Y2hlcy9pbmRpdmlkdWFsLXBhdGNoZXMvZnJlZS1zYW1wbGUuaHRtbD9fX19TSUQ9VQ,,"";s:7:""product"";s:1:""1"";s:15:""related_product"";s:0:"""";s:7:""options"";a:13:{i:17;s:2:""59"";i:16;s:2:""50"";i:15;s:2:""49"";i:14;s:2:""47"";i:13;s:2:""41"";i:12;s:2:""34"";i:11;s:2:""25"";i:10;s:2:""23"";i:9;s:2:""19"";i:8;s:2:""17"";i:7;s:2:""12"";i:6;s:1:""9"";i:5;s:1:""5"";}s:3:""qty"";i:1;}s:7:""options"";a:13:{i:0;a:7:{s:5:""label"";s:25:""How did you hear about us"";s:5:""value"";s:22:""Friend / Family Member"";s:11:""print_value"";s:22:""Friend / Family Member"";s:9:""option_id"";s:2:""17"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""59"";s:11:""custom_view"";b:0;}i:1;a:7:{s:5:""label"";s:3:""Age"";s:5:""value"";s:5:""21-24"";s:11:""print_value"";s:5:""21-24"";s:9:""option_id"";s:2:""16"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""50"";s:11:""custom_view"";b:0;}i:2;a:7:{s:5:""label"";s:14:""Marital Status"";s:5:""value"";s:9:""UnMarried"";s:11:""print_value"";s:9:""UnMarried"";s:9:""option_id"";s:2:""15"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""49"";s:11:""custom_view"";b:0;}i:3;a:7:{s:5:""label"";s:3:""Sex"";s:5:""value"";s:6:""Female"";s:11:""print_value"";s:6:""Female"";s:9:""option_id"";s:2:""14"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""47"";s:11:""custom_view"";b:0;}i:4;a:7:{s:5:""label"";s:10:""Occupation"";s:5:""value"";s:7:""Student"";s:11:""print_value"";s:7:""Student"";s:9:""option_id"";s:2:""13"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""41"";s:11:""custom_view"";b:0;}i:5;a:7:{s:5:""label"";s:9:""Education"";s:5:""value"";s:16:""College Graduate"";s:11:""print_value"";s:16:""College Graduate"";s:9:""option_id"";s:2:""12"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""34"";s:11:""custom_view"";b:0;}i:6;a:7:{s:5:""label"";s:16:""Household Income"";s:5:""value"";s:7:""30K-50K"";s:11:""print_value"";s:7:""30K-50K"";s:9:""option_id"";s:2:""11"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""25"";s:11:""custom_view"";b:0;}i:7;a:7:{s:5:""label"";s:23:""Do You Take Supplements"";s:5:""value"";s:2:""No"";s:11:""print_value"";s:2:""No"";s:9:""option_id"";s:2:""10"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""23"";s:11:""custom_view"";b:0;}i:8;a:7:{s:5:""label"";s:40:""How would you rank your typical hangover"";s:5:""value"";s:4:""Mild"";s:11:""print_value"";s:4:""Mild"";s:9:""option_id"";s:1:""9"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""19"";s:11:""custom_view"";b:0;}i:9;a:7:{s:5:""label"";s:51:""What type of establishments do you typically prefer"";s:5:""value"";s:10:""Nightclubs"";s:11:""print_value"";s:10:""Nightclubs"";s:9:""option_id"";s:1:""8"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""17"";s:11:""custom_view"";b:0;}i:10;a:7:{s:5:""label"";s:40:""How often do you usually go out per week"";s:5:""value"";s:3:""1-2"";s:11:""print_value"";s:3:""1-2"";s:9:""option_id"";s:1:""7"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""12"";s:11:""custom_view"";b:0;}i:11;a:7:{s:5:""label"";s:49:""How many drinks do you typically consume per week"";s:5:""value"";s:3:""6-8"";s:11:""print_value"";s:3:""6-8"";s:9:""option_id"";s:1:""6"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""9"";s:11:""custom_view"";b:0;}i:12;a:7:{s:5:""label"";s:53:""How would you prefer to buy our Products"";s:5:""value"";s:6:""Online"";s:11:""print_value"";s:6:""Online"";s:9:""option_id"";s:1:""5"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""5"";s:11:""custom_view"";b:0;}}}"
The Output should be the data found here:
""print_value";s:?:""{DATA}""
Were the ? is a number, and {DATA} is the data being extracted.
So the output for example of this 1 record would be:
"2009-09-20 00:12:37","Friend / Family Member","21-24","UnMarried","Female","Student","College Graduate","30K-50K","No","Mild","Nightclubs","1-2","6-8","Online"
I am not proficient in Sed,AWK, or Grep, but I know it can be done using one of these tools if not all three. Any help or nudges in the right direction would be GREATLY appreciated.
I suggest you use PHP to de-serialize the structure.
However, here's a quick and dirty version of what you want using sed and tr. Certainly you can do this much much better:
cat file.csv | \
tr ",;" "\n" | \
sed -e 's/[asbi]:[0-9]*[:]*//g' -e '/^[{}]/d' -e 's/""//g' -e '/^"{/d' | \
sed -n -e '/^"/p' -e '/^print_value$/,/^option_id$/p' | \
sed -e '/^option_id/d' -e '/^print_value/d' -e 's/^"\(.*\)"$/\1/' | \
tr "\n" "," | \
sed -e 's/,\([0-9]*-[0-9]*-[0-9]*\)/\n\1/g' -e 's/,$//' | \
sed -e 's/^/"/g' -e 's/$/"/g' -e 's/,/","/g'
The explanation:
split by commas and semicolons
remove remove the php structure syntax s:X:Y, b:X, ... and remove lines starting with { or } or "{
extract the section from print_value to the next option_id, also keep the date (line start with ")
remove those labels (print and option), and remove quotations around the date
concat all lines with commas
seperate lines (starting with date pattern), and remove extra comma at end
add quotations around all fields
Wow, I know it's embarrassing :)
Here is my anwser:
cat TestData \
| grep -o -P "print_value\"\";.*?:\"\".*?\"\";" \
| perl -pe 's|print_value.*:\"\"(.*?)\"\";|\1|'
The first line show the data (stored in TestData).
The second line asks grep to separate each match from print_value to the nearest '"";'.
Notice that I use '.*?' for non greedy match (needs to use '-P' with it).
The last line use perl to strip all un-needed. See that I use '(.*?)' to match the needed group and use '\1' to show the group.
Hope this helps.
Here's a sed oneliner:
sed -nr 's/^([^,]+),(.*)$/\2#%#\1/;:a;s/""print_value"";s:[0-9]+:""([^"]+)""(.*)$/\2,"\1"/;ta;s/^.*#%#//p' <source
Basically extract the data and append it to the end of the line using a unique delimiter '#%#'.
When the loop/substitute construct fails (i.e. no more data), throw away what is left of the original line leaving the data nicely formatted.

Resources