Avoiding parsing when loading a file - hadoop

Suppose that I have the following file (input.txt):
1 2 sometext1
2 3 sometext2
3 4 sometext3
4 5 sometext4
i.e. a tab-delimited file where each line is made of two strings representing an Integer and a third string representing arbitrary text.
This file is the input for a PigLatin script:
input = load 'input.txt' as (a:int, b:int, c:chararray);
My assumption is that Pig is going to waste time parsing the text file to produce the corresponding integers. Am I correct?
I would like to store in a binary file the binary representation of the three integers.
How can I make Pig understanding such binary file? Should I simply extend the LoadFunc or do I need to use the BinStorage?

How much time are you afraid to waste here? Assuming the rest of your script does anything meaningful or that your files are large enough (so that IO would be serious) the parsing effort will be negligible compared with everything else

Related

Hex - Search by bytes to get Offset & Search by Offset to get Bytes

just currently prototyping a little software and currently stuck. I'm trying to create a little program that'll edit a .bin file, and for this I will need to do the following:
Get Bytes by Searching for Offset
Get Offset by searching for Bytes
Write/Update .bin file
I usually use the program HxD to do this manually, but want to get a small automated process in place.
Using hex.EncodeToString returns what I want as the output (Like HxD) however I can't find a way to search for the values by bytes and offests
Could anyone help or have suggestions?
OK, "searching of an offset" is a misnomer because if you have an offset and a medium which supports random access, you just "seek" the known offset there; for files, see os.File.Seek.
Searching is more complex: it consists of converting the user input into something searchable and, well, the searching itself.
Conversion is the process of translation of the human operator's imput to a slice of bytes — for instance, you'd need to convert a string "00 87" to a slice of bytes, []byte{00, 87}.
Such conversion can be done using, say, encoding/hex.Decode after removing any whitespace, which can be done using a multitude of ways.
Searching the file given a slice of bytes can be either simple of complex.
If a file is small (a couple megabytes, on today's hardware), you can just slurp it into memory (for instance, using io.ReadAll) and do a simple search using bytes.Index.
If a file is big, the complexity of the task quickly escalates.
For instance, you could read the file from its beginning to its end using chunks of some sensible size and search for your byte slice in each of them.
But you'd need to watch out for two issues: the slice to search should be smaller than each of such chunks, and two adjacent chunks might contain the sequence to be found positioned right across their "sides" — so that the Nth chunk contains the first part of the pattern at its end and the N+1th chunk contains the rest of it at its beginning.
There exist more advanced approaches to such searching — for instance, using so-called "memory-mapped files" but I'd speculate it's a bit too early to tread these lands, given your question.

How smaz compression library works?

I'm currently working for a short text compression project based on my language. But as a beginner, I also know some basic compression algorithm like LZW. But I still don't understand how smaz works. I have 2 questions:
How does smaz work?
How to build the codebook and reversed codebook?
Can any one explain it for me?
Thank you very much.
trying to answer your questions
How does smaz work?
according [1],
Smaz has a hard-wired constant built-in codebook of 254 common English
words, word fragments, bigrams, and the lowercase letters (except j,
k, q). The inner loop of the Smaz decoder is very simple:
Fetch the next byte X from the compressed file.
Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded
text.
Repeat until there are no more compressed bytes left in the compressed file.
Because the codebook is constant, the Smaz decoder is unable to
"learn" new words and compress them, no matter how often they appear
in the original text.
This page could be helpful to understand the code.
How to build the codebook and reversed codebook?
TODO file in repository and author comments in redit poitns that the dictionary was generated by a unreleased ruby script. Also, the author explains:
btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a
table of frequencies, than adjust the weight based on the string
length, and finally hand tuning the table to compress specific things
very well. I added by hand the "http://" and ".com" token for example,
removing the final two entries.
An alternative to your project could be the shoco library which supports generation of a custom compression model based on your language.
The smaz sources is only 178 lines and just 99 lines without comments and codebook tables. You should look to see how it works.
Smaz is pretty simple compression by codebook (like LZW which you know). The library contains table with most popular terms in english (lines 5 - 51 for compression table and 56 -76 for decompression) and replace this terms with indexes in compressed string. And contrary to decompress.
For example, string the end would compressed by 58% becouse if terms the would be one byte index in compression table. So 7 bytes lenght string became 4 bytes length string.

expressing a grep like algorithm in mapreduce terms for a very long list of keywords

I am having trouble expressing an algorightm in mapreduce terms.
I have two big input text files: Let's call the first file "R" and the
second one "P". R is typically much bigger than P, but both are big.
In a non-mapreduce approach, the contents of P would be loaded into
memory (hashed) and then we would start iterating over all the lines
in R. The lines in R are just strings, and we want to
check if any of the substrings in R match any string in P.
The problem is very similar to grepping words in a bigfile, the issue
is that the list of words is very large so you cannot hardcode them
in your map routine.
The problem I am encountering is that I don't know how to ensure that
all the splits of the P file end up in a map job per each split of the R file.
So, assuming these splits:
R = R1, R2, R3;
P = P1, P2
The 6 map jobs have to contain these splits:
(R1, P1) (R1, P2);
(R2, P1) (R2, P2);
(R3, P1) (R3, P2);
How would you express this problem in mapreduce terms?
Thanks.
I have spent some time working on this and I have come up with a couple of
solutions. The first one is based on hadoop streaming and the second one uses
native java.
For the first solution I use an interesting feature from ruby. If you add
the keyword __END__ at the end of your code, all the text after that will
be exposed by the interpreter via the global variable DATA. This variable
is a File object. Example:
$ cat /tmp/foo.rb
puts DATA.read
__END__
Hello World!
$ ruby /tmp/foo.rb
Hello World!
We will use the file R as a input (It will be distributed across the HDFS filesytem).
We iterate over the P file and after traversing a certain number of lines,
we add those at the end of our mapper script. Then, we submit the job to the
hadoop cluster. We keep iterating over the contents of P until we have
consumed all the lines. Multiple jobs will be sent to the cluster based on
the number of lines per job and the size of P.
That's a fine approach that I have implemented and it works quite well. I
don't find particularly elegant though. We can do better by writing a native
mapreduce app in java.
When using a native java app, we have a full access to the hadoop HDFS API.
That means we can read the contents of a file from our code. That's something
I don't think it is available when streaming.
We follow an approach similar to the streaming method, but once we have
traversed a certain number of lines, we send those to the hadoop cluster instead
of append it to the code. We can do that within the code that schedules
our jobs.
Then, it is a matter of running as many jobs as the number of splits that
we have for P. All the mappers in a particular job will load a certain split
and will use it to compute the splits of R.
Nice problem.
One quick way I can think of is to split the P file into multiple files and run multiple MR jobs with each split of the P file and the complete R file as input.

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
uniq -c filename | perl -lane 'print "#F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.
In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant:
mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.
Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

which one is suitable datastructure

Two files each of size terabytes. A file comparison tool compares ith line of file1 with
i th line of file2. if they are same it prints. which datastructure is suitable.
B-tree
linked list
hash tables
none of them
You need to be able to buffer up at LEAST a line at a time. Here's one way:
While neither file is at EOF:
Read lines A and B from files one and two (each)
If lines are identical, print one of them
Translate into suitable programming language, and problem is solved.
Note that no fancy data structures are involved.
the simple logic is read one line at a time from the file and match..
It's like
While line1 is not equal to EOF file1 and line2 is not equal to EOF file2:
Compare line1 and line2
Btw you have to be sure how much maximum character a line can contain so u can change buffer size accordingly..
Otherwise try bigdata concept spark framework to make your work easier.

Resources