how to join files in unix without sorting - sorting

I am trying to join 2 csv files based on a key in unix.
My files are really huge 5GB each and sorting them is taking too long.
I want to repeat this procedure for 50 such joins.
Can someone tell me how to join without sorting and quickly.

Unfortunately there is no way around the sorting. But please take a look at some utility scripts I have written here: (https://github.com/stefan-schroedl/tabulator). You can use them if you keep the header of the column names as the first line in each file. There is a script 'tbljoin' that will take care of the sorting and column counting for you. For example, say you have
Employee.csv:
employee_id|employee_name|department_id
4|John|10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Dept.csv:
department_id|department_name
1|HR
2|Manufacturing
3|Engineering
4|Marketing
5|Sales
6|Information technology
7|Security
Then the command tbljoin Employee.csv Dept.csv produces
employee_id|employee_name|department_id|department_name
20|Peter|2|Manufacturing
21|David|3|Engineering
1|Monica|4|Marketing
12|Louis|5|Sales
13|Barbara|6|Information technology.
tabulator contains many other useful features, e.g., for simple rearranging of columns.

Here is the example with two files having data delimited by pipe
Data from employee.csv with key employee_id, Name and department_id delimited by pipe.
Employee.csv
4|John | 10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Department file with deparment_id and its name delimited by pipe.
Dept.csv
1|HR
2| Manufacturing
3| Engineering
4 |Marketing
5| Sales
6| Information technology
7| Security
command:
join -t “|” -1 3 -2 1 Employee_sort.csv Dept.csv
-t “| ” indicated files are delimited by pipe
-1 3 for third column of file 1 i.e deparment_id from Employee_sort.csv file
-2 1 for first column of file 2 i.e. deparment_id from Dept.csv file
By using above command, we get following output.
2|20|Peter| Manufacturing
3|21|David| Engineering
4|1|Monica| Marketing
5|12|Louis| Sales
6|13|Barbara| Information technology
If you want to get everything from file 2 and corresponding entries in file 1
You can also use -a and -v options
try following commands
join -t “|” -1 3 -2 1 -v2 Employee_sort.csv Dept.csv
join -t “|” -1 3 -2 1 -a2 Employee_sort.csv Dept.csv

I think that you could avoid using join (and thus sorting your file), but this is not a quick solution :
In both files, replace all pipes and all double-spaces with spaces :
sed -i 's/|/ /g;s/ / /g' Employee.csv Dept.csv
run these code lines as a bash script :
cat Employee.csv | while read a b c
do
cat Dept.csv | while read d e
do
if [ "$c" -eq "$d" ] ; then
echo -e "$a\t$b\t$c\t$e"
fi
done
done
Note that looping takes a long time

Related

Pipe tail output into column

I'm trying to tail a log file and format the output into columns. This gives me what I want without tail:
cat /var/log/test.log | column -t -s "|"
How can I pipe the output of tail -f var/log/test.log into column?
EDIT: Here's an excerpt from the file. I'm manually adding the first line of the file so it could be used as the column headers, but I could format it differently if necessary.
timestamp|type|uri|referer|user_id|link|message
Feb 5 23:58:29 181d5d6339bd drupal_overlake: 1612569509|geocoder|https://overlake.lando/admin/config/development/configuration/config-split/add|https://overlake.lando/admin/config/development/configuration/config-split/add|0||Could not execute query "https://maps.googleapis.com/maps/api/geocode/json?address=L-054%2C%20US&language=&region=US".
Feb 5 23:58:29 181d5d6339bd drupal_overlake: 1612569509|geocoder|https://overlake.lando/admin/config/development/configuration/config-split/add|https://overlake.lando/admin/config/development/configuration/config-split/add|0||Unable to geocode 'L-054, US'.
You can't do it with the -f option to tail. column can't produce any output until it receives all its input, since it needs to calculate the number of rows and columns by examining all the input. tail -f never stops writing, so column doesn't know when it's done.
You can use
tail -n 100 test.log | column -t -s "|"
to format the last 100 lines of the log.

how UNIX sort command handles expressions with different character sizes?

I am trying to sort and join two files which contain IP addresses, the first file only has IPs, the second file contains IPs and an associated number. But sort acts differently in these files. here are the code and outcomes:
cat file | grep '180.76.15.15' | sort
cat file | grep '180.76.15.15' | sort -k 1
cat file | grep '180.76.15.15' | sort -t ' ' -k 1
outcome:
180.76.15.150 987272
180.76.15.152 52219
180.76.15.154 52971
180.76.15.156 65472
180.76.15.158 35475
180.76.15.15 99709
cat file | grep '180.76.15.15' | cut -d ' ' -f 1 | sort
outcome:
180.76.15.15
180.76.15.150
180.76.15.152
180.76.15.154
180.76.15.156
180.76.15.158
As you can see, the first three commands all produce the same outcome, but when lines only contain IP address, the sorting changes which causes me a problem trying to join files.
Explicitly, the IP 180.76.15.15 appears at the bottom row in the first case (even when I sort explicitly on the first argument), but at the top row in the second case and I can't understand why.
Can anyone please explain why is this happening?
P.S. I am ssh connecting through windows 10 powershell to ubuntu 20.04 installed on VMware.
sort will use your locale settings to determine the order of the characters. From man sort also:
*** WARNING *** The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses native byte values.
This way you can use the ASCII characters order. For example:
> cat file
#a
b#
152
153
15 4
15 1
Here all is sorted with the alphabetical order excluding special characters, first the numbers, then the letters.
thanasis#basis:~/Documents/development/temp> sort file
15 1
152
153
15 4
#a
b#
Here all characters count, first #, then numbers, but the space counts also, then letters.
thanasis#basis:~/Documents/development/temp> LC_ALL=C sort file
#a
15 1
15 4
152
153
b#

Explanation to an assignment

I am NOT looking for an answer to this problem. I am having trouble understanding what I should be trying to accomplish in this assignment. I welcome Pseudo code or hints if you would like. but what I really need is an explanation to what I need to be making, and what the output should be/look like. please do not write out a lot of code though I would like to try that on my own.
(()) = notes from me
The assignment is:
a program (prog.exe) ((we are given this program)) that reads 2 integers (m, n) and 1 double (a) from an input data file named input.in. For example, the sample input.in given file contains the values
5 7 1.23456789012345
when you run ./prog.exe the output is a long column of floating-point numbers
in additions to the program, there is a file called ain.in that contains a long column of double precision values.
copy prog.exe and ain.in to working directory
Write a bash script that does that following:
-Runs ./prog.exe for all combonations of
--m=0,1,...,10
--n=0,1,...,5
--a=every value in the file ain.in
-this is essentially a triple nested loop over m,n and the ain.in values
-for each combination of m,n and ain.in value above:
-- generate the appropriate input file input.in
-- run the program and redirect the output to some temporary output file.
--extract the 37th and 51st values from this temporary output file and store these in a file called average.in
-when the 3 nested loops terminate the average.in file should contain a long list of floating point values.
-your script should return the average of the values contained in average.in
HINTS: seq , awk , output direction, will be useful here
thank you to whoever took the time to even read through this.
This is my second bash coding assignment and im still trying to get a grasp on it and a better explanation would be very helpful. thanks again!
this is one way of generating all input combinations without explicit loops
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2-
The idea is to write a bash script that will test prog.exe with a variety of input conditions. This means recreating input.in and running prog.exe many times. Each time you run prog.exe, input.in should contain a different three numbers, e.g.,
First run:
0 0 <first line of ain.in>
Second run:
0 0 <second line of ain.in>
. . . last run:
10 5 <last line of ain.in>
You can use seq and for loops to accomplish this.
Then, you need to systematically save the output of each run, e.g.,
./prog.exe > tmp.out
# extract line 37 and 51 and append to average.ln
sed -n '37p; 51p; 51q' tmp.out >> average.ln
Finally, after testing all the combinations, use awk to compute the average of all the lines in average.in.
One-liner inspired by #karakfa:
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2- |
sed "s/.*/echo & >input.in;./prog.exe>tmp.out; sed -n '37p;51p;51q' tmp.out/" |
sh | awk '{sum+=$1; n++} END {print sum/n}'

Append data to the end of a specific line in text file

I admit to being a novice in bash script, but can't quite seem to figure out how to accomplish a key step in a script and couldn't quite find what I was looking for in other threads.
I am trying to extract some specific data (numerical values) from multiple .xml files and add those to a space or tab delimited text file. The files will be generated over time so I need a way to append a new dataset to the pre-existing text file.
For instance, I would like to extract values for 3 different categories, 1 per row or column, and the value for each category from multiple xml files. Basically, I want to build a continuous graph of the data from each of 3 categories over time.
I have the following code which will successfully extract the 3 numbers from the xml file and trim the unnecessary text:
#!/bin/sh
grep "<observation name=\"meanGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"meanBrightGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanBrightGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"std\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"std\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
This gives the output:
1.12
0.33
134.1
I would like to then read in another xml file to get:
1.12 1.45
0.33 0.54
134.1 144.1
I would be grateful for any help with doing this! Thanks in advance.
Erik
It's much safer to use proper XML handling tools. For example, in xsh, you can write something like
$f1 := open /Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml ;
$f2 := open /path/to/the/second/file.xml ;
echo ($f1 | $f2)//observation[#name="meanGhost"] ;
echo ($f1 | $f2)//observation[#name="meanBrightGhost"] ;
echo ($f1 | $f2)//observation[#name="std"] ;

GNU sort -u outputs binary characters

I am trying to get all unique values from a column in a very large file (5 columns, 2,044,530,100 lines, ~49 GB). My current approach is to cut the relevant column and putting it through sort -u (which sorts and only outputs the unique values). While my INPUT is just text, my output contains binary characters and makes it unusable.
First lines of INPUT look like this:
1 D12 rs01 T T
1 D12 rs02 G G
1 D12 rs03 G G
1 D15 rs01 C C
Putting it through a tr command does not make it better, it just makes the binary characters visible.
cut -d" " -f3 INPUT | sort -u > OUTPUT
cut -d" " -f3 INPUT | tr -cd '\11\12\15\40-\176' | sort -u > OUTPUT
For example, some sample-output from the command above:
yO+{(#6:1fr
EvI0^?E0/':>)zj;<f#V&:oY\RM&mhR!6(qV%|`rJTq4IKqV{]Dzb"~8(X82
F:7nc9gZ#nht^M">vo|F+g"x%r>UdF+Rn^MOu=
While the expected output is a column with all unique values in a value, e.g.:
rs01
rs02
rs03
rs04
rs05
Unfortunately, I can't replicate this behavior with generated (smaller) data. Does anyone have a suggestion of how to deal with this? All help is greatly appreciated. Sort version is sort (GNU coreutils) 8.4
Instead of manually splitting the file for inspection I would try grep-ing the input file for unusual characters, just to make sure your input is not damaged, or locate the place with garbage.
grep -b -E -v -e '^[[:alnum:][:space:]]+$' <your file>
If the input is OK, try to use temporary file instead of pipe, and examine it in the same way. If it is OK, blame sort.
(PS. I would rather post it as a comment, not a solution but I can't)

Resources