Join in unix when field is numeric in a huge file

Join in unix when field is numeric in a huge file - sorting

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.

I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

Related

How to compare multiple extension-less files in Bash

I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.

Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.

If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4

Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4

You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?

Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file

Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).

Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.

Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt

You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Bash script that reads from files has garbled output

I am very new to Bash scripting. I am trying to write a script that works with two files. Each line of the files looks like this:
INST <_variablename_> = <_value_>;
The two files share many variables, but they are in a different order, so I can't just diff them. What I want to do is go through the files and find all the variables that have different values, or all the variables that are specified in one file but not the other.
Here is my script so far. Again, I'm very new to Bash so please go easy on me, but also feel free to suggest improvements (I appreciate it).
#!/bin/bash
line_no=1
while read LINE
do
search_var=`echo $LINE | awk '{print $2}'`
result_line=`grep -w $search_var file2`
if [ $? -eq 1 ]
then
echo "$line_no: not found [ $search_var ]"
else
value=`echo $LINE | awk '{print $4}'`
result_value=`echo $result_line | awk '{print $4}'`
if [ "$value" != "$result_value" ]
then
echo "$line_no: mismatch [ $search_var , $value , $result_value ]"
fi
fi
line_no=`expr $line_no + 1`
done < file1
Now here's an example of some of the output that I'm getting:
111: mismatch [ TXAREFBIASSEL , TRUE; , "TRUE"; ]
, 4'b1100; ] [ TXTERMTRIM , 4'b1100;
113: not found [ VREFBIASMODE ]
, 2'b00; ]ch [ CYCLE_LIMIT_SEL , 2'b00;
, 3'b100; ]h [ FDET_LCK_CAL , 3'b101;
The first line is what I would expect (I'll deal with the quotes later). On the second, fourth, and fifth line, it looks like the final value is overwriting the "line_no: mismatch" part. And furthermore, on the second and fourth line, the values DO match--it shouldn't print anything at all!
I asked my friend about this, and his suggestion was "Do it in Perl." So I'm learning Perl right now, but I'd still like to know what's going on and why this is happening.
Thank you!
EDIT:
Sigh. I figured out the problem. One of the files had Unix line breaks, and the other had DOS line breaks. I actually thought this might be the case, but I also thought that vi was supposed to display some character if it opened a dos-ended file. Since they looked the same, I assumed that they were the same.
Thanks for your help and suggestions everybody!

Rather than simply replacing the Bash language with Perl, how about a paradigm shift?
diff -w <(sort file1) <(sort file2)
This will sort both files, so that the variables will appear in the same order in each, and will diff the results (ignoring whitespace differences, just for fun).
This may give you more or less what you need, without any "code" per se. Note that you could also sort the files into intermediate files and run diff on those if you find that easier...I happen to like doing it with no temporary files.

What about this? 2 is avaliable in both files and same value. other values can be parsed easily.
sort 1.txt 2.txt | uniq -c
2 a = 10
1 b = 20
1 b = 40
1 c = 10
1 c = 30
1 e = 50
or like this get your key and values.
sed 's|INST \(.*\) = \(.*\)|\1 = \2|' 1.txt 2.txt | sort | uniq -c
2 a = 10
1 b = 20
1 b = 40
1 c = 10
1 c = 30
1 e = 50

Print only values smaller than certain threshold in bash

I have a file with more than 10000 lines like this, mostly numbers and some strings;
-40
-50
stringA
100
20
-200
...
I would like to write a bash (or other) script that reading this file only outputs numbers (no strings) and only those values smaller than zero (or some other predefined number). How can this be done?
In this case the output (sorted) would be
-40
-50
-200
...

cat filename | awk '{if($1==$1+0 && $1<THRESHOLD_VALUE)print $1}' | sort -n
The $1==$1+0 ensure that the string is a number, it will then check that it is less than THRESHOLD_VALUE (change this to whatever number you wish. Print it out if it passes, and sort.

awk '$1 < NUMBER { print }' FILENAME | sort -n
where NUMBER is the number that you want to use as an upper bound and FILENAME is your file with 10000+ lines of numbers. You can drop the | sort -n if you don't want to sort the numbers.
edit: One small caveat. If your string starts with a number, it will treat it as that number. Otherwise it should ignore it.

Another alternative is as follows:
function compare() {
if test $1 -lt $MAX_VALUE; then
echo $1
fi
} 2> /dev/null
Have a look at help test and man bash for further help on this. The 2> /dev/null redirects errors thrown by test when you try to compare something other than two integers. Call the function like:
compare 1
compare -1
compare string A
Only the middle line will give output.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Join in unix when field is numeric in a huge file - sorting

Related

How to compare multiple extension-less files in Bash

Use Bash scripting to select columns and rows with specific name

Getting specific lines of a file

Bash script that reads from files has garbled output

Print only values smaller than certain threshold in bash

Categories

Resources