I'm using the parquet file. I've found that parquet file has multiple data types, such as int64,int32,boolean,binary,float,double,int96 and fixed_len_byte_array.
I know int64,int32,int96,boolean,binary,float and double. But I can't understand ‘fixed_len_byte_array’.
What does "fixed_len_byte_array" mean?
It's a normal byte array, only it's size is prefixed for better or worse, you cant change it.
You can see example of usage in
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
Related
I am a novice Go lang programmer,trying to learn Go lang features.I wanted to split a large csv file into multiple files in GO lang, each file containing the header.How do i do this? I have searched everywhere but couldnt get the right solution.Any help in this regard will be greatly appreciated.
Also please suggest me a good book for reference.
Thanking You
Depending on your shell fu this problem might be better suited for common shell utilities but you specifically mentioned go.
Let's think through the problem.
How big is this csv file? Are we talking 100 lines or is it 5G ?
If it's smallish I typically use this:
http://golang.org/pkg/io/ioutil/#ReadFile
However, this package also exists:
http://golang.org/pkg/encoding/csv/
Regardless - let's return to the abstraction of the problem. You have a header (which is the first line) and then the rest of the document.
So what we probably want to do (if ignoring csv for the moment) is to read in our file.
Then we want to split the file body by all the newlines in it.
You can use this to do so:
http://golang.org/pkg/strings/#Split
You didn't mention but do you know how many files you want to split by or would you rather split by the line count or byte count? What's the actual limitation here?
Generally it's not going to be file count but if we pretend it is we simply want to divide our line count by our expected file count to give lines/file.
Now we can take slices of the appropriate size and write the file back out via:
http://golang.org/pkg/io/ioutil/#WriteFile
A trick I use sometime to help think me threw these things is to write down our mission statement.
"I want to split a large csv file into multiple files in go"
Then I start breaking that up into pieces but take the divide/conquer approach - don't try to solve the entire problem in one go - just break it up to where you can think about it.
Also - make gratiutious use of pseudo-code until you can comfortably write the real code itself. Sometimes it helps to just write a short comment inline with how you think the code should flow and then get it down to the smallest portion that you can code and work from there.
By the way - many of the golang.org packages have example links where you can literally run in your browser the example code and cut/paste that to your own local environment.
Also, I know I'll catch some haters with this - but as for books - imo - you are going to learn a lot faster just by trying to get things working rather than reading. Action trumps passivity always. Don't be afraid to fail.
Here is a package that might help. You can set a necessary chunk size in bytes and a file will be split on an appropriate amount of chunks.
I am having a performance issue with XDocument.Load("large_file.xml"), where it takes about 25 seconds to load the file.
I read in this question that using a binary format could offer up to a 10x performance increase.
What does a binary format look like? How do you go about converting an XML file to it?
Lets start with the implied question:
Q: What is a Binary format?
A: It is a format in which data is represented in a non-textual form. For example, a Java int might be represented as 4 bytes, rather than a sequence of decimal digits and a sign.
Q: What does it look like?
A: If you view it with a text editor / viewer, it looks like garbage.
Q: How do you go about converting an XML file to a binary form?
A: By hand. Since a binary format is essentially a format (any format) that is not text, there is no magical method of converting it.
Q: How and why is a binary format faster?
A: A binary format isn't automatically faster to load than XML (or JSON). The idea is that you (the programmer) design a specific binary format for your application that will be faster to load. You typically do this by such things as:
avoiding the inclusion of verbose / repetitive structuring information (e.g. XML tag and attribute name),
using data encodings that require less CPU effort to turn into the in-memory representations,
avoiding the inclusion of unnecessary metadata,
avoiding things that require extra in-memory data copying,
and so on.
There is lots of information in an XML format. So it's big and slow. You can create your own format.
For example:
<Data>Value</Data> can be changed to just value at a concrete address in a binary file.
In ruby how to check a string is an actural string or a blob data such as image, from the data type of view they are ruby string, but really their contents are very different since one is literal string, the other is blob data such as image.
Could anyone provide some clue for me? Thank you in advance.
Bytes are bytes. There is no way to declare that something isn't file data. It'd be fairly easy to construct a valid file in many formats consisting only of printable ASCII. Especially when dealing with Unicode, you're in very murky territory. If possible, I'd suggest modifying the method so that it takes two parameters... use one for passing text and the other for binary data.
One thing you might do is look at the length of the string. Most image formats are at least 500-600 bytes even for a tiny image, and while this is by no means an accurate test, if you get passed, say, a 20k string, it's probably an image. If it were text, it would be quite a bit (Like a quarter of a typical novel, or thereabouts)
Files like images or sound files have defined blocks that can be "sniffed". Wotsit.org has a lot of info about the key bytes and ways to determine what the files are. By looking at those byte offsets in your data you could figure it out.
Another way way is to use some "magic", which is code to sniff key-bytes or byte-types in a file to try to figure out what its type is. *nix systems have it built in via the file command. Do a man file or man magic for more info or check Wikipedia's article on Magic numbers in files.
Ruby Filemagic uses the same technique but is based on GNU's libmagic.
What would constitute a string? Are you expecting simple ASCII? UTF-8? Or text encoded some other way?
If you know you're going to get ASCII text or a blob then you can just spin through the first n bytes and see if anything has the eight bit set, that would tell you that you have binary. OTOH, not finding anything wouldn't guarantee that you had text.
If you're going to get UTF-8 Unicode then you'd do the same thing but look for invalid UTF-8 sequences. Of course, the same caveats apply.
You could scan the first n bytes for anything between 0x00 and 0x20. If you find any bytes that low then you probably have a binary blob of some sort. But maybe not.
As Tyler Eaves said: bytes are bytes. You're starting with a bunch of bytes and trying to find an interpretation of them that makes sense.
Your best bet is to make the caller supply the expected interpretation or take Greg's advice and use a magic number library.
I have a really big data structure with a lot of fields in it.
I want to save its fields to later compare with later values; but I can't print the content to console, because it's too much code to write by hand. (I have roughly 1k fields.)
How should I solve my problem?
Use serialization: some tutorial here or here.
write the stricture to a file and use some good diff tools.
Is there a good way to see what format an image is, without having to read the entire file into memory?
Obviously this would vary from format to format (I'm particularly interested in TIFF files) but what sort of procedure would be useful to determine what kind of image format a file is without having to read through the entire file?
BONUS: What if the image is a Base64-encoded string? Any reliable way to infer it before decoding it?
Most image file formats have unique bytes at the start. The unix file command looks at the start of the file to see what type of data it contains. See the Wikipedia article on Magic numbers in files and magicdb.org.
Sure there is. Like the others have mentioned, most images start with some sort of 'Magic', which will always translate to some sort of Base64 data. The following are a couple examples:
A Bitmap will start with Qk3
A Jpeg will start with /9j/
A GIF will start with R0l (That's a zero as the second char).
And so on. It's not hard to take the different image types and figure out what they encode to. Just be careful, as some have more than one piece of magic, so you need to account for them in your B64 'translation code'.
Either file on the *nix command-line or reading the initial bytes of the file. Most files come with a unique header in the first few bytes. For example, TIFF's header looks something like this: 0x00000000: 4949 2a00 0800 0000
For more information on the TIFF file format specifically if you'd like to know what those bytes stand for, go here.
TIFFs will begin with either II or MM (Intel byte ordering or Motorolla).
The TIFF 6 specification can be downloaded here and isn't too hard to follow