Maximum vector size for properties in Windows Search - winapi

I've hit a limit when indexing PDF files in Windows Search, specifically the array size of System.Keywords property. Everything works fine up to 20 tags, but then any further tags aren't included in the index.
My first instinct was to see what the IFilter was capturing and using filtdump.exe I got the following output.
CHUNK: ---------------------------------------------------------------
Attribute = {F29F85E0-4FF9-1068-AB91-08002B27B3D9}\5 (System.Keywords)
idChunk = 3
BreakType = 0 (No Break)
Flags (chunkstate) = (Value)
Locale = 0 (0x0)
IdChunkSource = 0
cwcStartSource = 0
cwcLenSource = 0
VALUE: ---------------------------------------------------------------
Type = 31 (0x1f), VT_LPWSTR
Value = "TAG1; TAG2; TAG3; TAG4; TAG5; TAG6; TAG7; TAG8; TAG9; TAG10; TAG11; TAG12; TAG13; TAG14; TAG15; TAG16; TAG17; TAG18; TAG19; TAG20; TAG21"
So I could see that all the tags were being retrieved, but the final tag was being truncated.
Doing a dump of the property schema for System.Keywords I got the following:
Property Key: {F29F85E0-4FF9-1068-AB91-08002B27B3D9} 5
Canonical Name: System.Keywords
Property Type: VT_VECTOR | VT_LPWSTR
Display Name: Tags
Edit Invitation: Add a tag
Type Flags: PDTF_MULTIPLEVALUES | PDTF_CANGROUPBY | PDTF_CANSTACKBY | PDTF_ISTREEPROPERTY | PDTF_ISVIEWABLE | PDTF_ISSYSTEMPROPERTY
View Flags:
Default Column Width: 11
Display Type: PDDT_STRING
Column State: SHCOLSTATE_TYPE_STR
Grouping Range: PDGR_DISCRETE
Relative Desc. Type: PDRDT_GENERAL
Sort Description: PDSD_A_Z
Sort Desc. Labels: A on top/Z on top
Aggregation Type: PDAT_UNION
Condition Type: PDCOT_STRING
Condition Operation: COP_WORD_EQUAL
Enumerated Types: 0
Search Info Flags: PDSIF_ININVERTEDINDEX | PDSIF_ISCOLUMN | PDSIF_ISCOLUMNSPARSE
Column Index Type: <not specified>
Projection String System.Keywords
Max Size: 512
Also looking at the documentation for System.Keywords there is no mention of a maximum size or limit of items.
Again looking at documentation there is mention of maxSize attribute:
Optional. Indicates the maximum size allowed for the property value
stored in the Windows search database. This limit applies to the
indvidual elements of a vector, not the vector as a whole. Values
beyond this size are truncated. The default is "128" (bytes).
Currently, Windows Search does not use the maxSize when calculating
the amount of data it accepts from a file. Instead, the limit Windows
Search uses is the product of the size of the file and the
MaxGrowFactor (file size N * MaxGrowFactor) read from the registry at
HKEY_LOCAL_MACHINE->Software->Microsoft->Windows Search->Gathering
Manager->MaxGrowFactor. The default MaxGrowFactor is four (4).
Consequently, if your file type tends to be small in total size but
have larger properties, Windows Search may not accept all the property
data you want to emit. However, you can increase the MaxGrowFactor to
suit your needs.
However it isn't clear to me if this affects the size of the array. I'm guessing that this truncation is occurring in the Gatherer component of Windows Search so I'm wondering if there are any registry settings involved.
FWIW I did look at the Windows Search database (Windows.edb) using the ESE Database View utility and I could see from the schema that the column type is large binary type so there shouldn't be a limitation there. Looking at the raw value I could see the bytes for the tag values (separated by NUL characters) and terminated with an # character. But there were only 20 values not 21 confirming the limit.
I've reached the end of my research, but I'm still no further along. Is it possible to extend the array size for System.Keywords or is it a hard-coded limit in the Gatherer component? Any help would be most appreciated, thanks in advance!

Related

Google Spreadsheet API returning grid limits error

I am trying to update a Google Sheet using the Ruby API (that is just a wrapper around the SheetsV4 API)
I am running into the following error
Google::Apis::ClientError: badRequest: Range ('MySheet'!AA1) exceeds grid limits. Max rows: 1000, max columns: 26
I have found references of this problem on the google forum, however there did not seem to be a solution to the problem other that to use a different method to write to the spreadsheet.
The thing is, I need to copy an existing spreadsheet template, and enter my raw data in various sheets. So far I have been using this code (where service is a client of the Ruby SheetsV4 API)
def write_table(values, sheet: 'Sheet1', column: 1, row: 1, range: nil, value_input_option: 'RAW')
google_range = begin
if range
"#{sheet}!#{range}"
elsif column && row
"#{sheet}!#{integer_to_A1_notation(column)}#{row}"
end
end
value_range_object = ::Google::Apis::SheetsV4::ValueRange.new(
range: google_range, values: values
)
service.update_spreadsheet_value(spreadsheet_id,
google_range,
value_range_object,
value_input_option: value_input_option
)
end
It was working quite well so far, but after adding more data to my extracts, I went over the 26th column, (columns AA onwards) and now I am getting the error.
Is there some option to pass to update_spreadsheet_value so we can raise this limit ?
Otherwise, what is the other way to write to the spreadsheet using append ?
EDIT - A clear description of my scenario
I have a template Google spreadsheet with 8 sheets(tabs), 4 of which are titled RAW-XX and this is where I try to update my data.
At the beginning, those raw tabs only have headers on 30 columns (A1 --> AD1)
My code needs to be able to fill all the cells A2 --> AD42
(1) for the first time
(2) and my code needs to be able to re-run again to replace those values by fresh ones, without appending
So basically I was thinking of using update_spreadsheet_value rather than append_xx because of the requirement (2). But becuase of this bug/limitation (unclear) in the API, this does not work. ALso important to note : I am not actually updating all those 30 columns in one go, but actually in several calls to the update method (with up to 10 columns each time)
I've thought that
- Maybe I am missing an option to send to the Google API to allow more than 26 columns in one go ?
- Maybe this is actually an undocumented hard limitation of the update API
- Maybe I can resort to deleting existing data + using append
EDIT 2
Suppose I have a template at version 1 with multiple sheets (Note that I am using =xx to indicate a formula, and [empty] to indicate there is nothing in the cell, and 1 to indicate the raw value "1" was supplied
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
[empty] | [empty] |
Sheet2 - FORMATTED
Number of foos | Number of Bars
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'B2
Now I call my app "for the first time", this copies the existing template to a new file "generated_spreadsheet" and injects data in the RAW sheet. It turns out at this moment, my app says there is 1 foo and 0 bar
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
1 | 0 |
Sheet2 - FORMATTED
Number of foos | Number of Bars
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2
Maybe if I call my app later, maybe the template AND the data have changed in between, so I want to REPLACE everything in my "generated_spreadsheet"
The new template has become in between
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
[empty] | [empty] |
Sheet2 - FORMATTED
Number of foos | Number of Bars | All items
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2 | =A2 + B2
Suppose now my app says there is still 1 foo and the number of bars went from 0 to 2, I want to update the "generated_spreadsheet" so it looks like
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
1 | 3 |
Sheet2 - FORMATTED
Number of foos | Number of Bars | All items
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2 | =A2 + B2
How about using values.append? In my environment, I also experienced the same situation with you. In order to avoid this issue, I used values.append.
Please modify as follows and try it again.
From:
service.update_spreadsheet_value(
To:
service.append_spreadsheet_value(
Reference:
Method: spreadsheets.values.append
If this was not the result you want, I'm sorry.
this because out of range.
AA1 means column is AA, also means 27, so this start point AA1 not exist, that's why you met this error.
you can try Z1, this should be ok.
worksheet.resize(2000)
will resize your sheet 2000 rows
It happened to me as well when I didn't have the empty columns (I removed all the empty columns from the spreadsheet). I simply added an empty one next to my last column and it works.

How to Create LMDB for Caffe Using C

I need to create LMDBs dynamically that can be read by Caffe's data layer, and the constraint is that only C is available for doing so. No Python.
Another person examined the byte-level contents of a Caffe-ready LMDB file here: Caffe: Understanding expected lmdb datastructure for blobs
This is a good illustrative example but obviously not comprehensive. Drilling down led me to the Datum message type, defined by caffe.proto, and the ensuing caffe.pb.h file created by protoc from caffe.proto, but this is where I hit a dead end.
The Datum class in the .h file defines a method that appears to be a promising lead:
void SerializeWithCachedSizes(::google::protobuf::io::CodedOutputStream* output) const
I'm guessing this is where the byte-level magic happens for encoding messages before they're sent.
Question: can anyone point me to documentation (or anything) that describes how the encoding works, so I can replicate an abridged version of it? In the illustrative example, the LMDB file contains MNIST data and metadata, and 0x08 seems to signify that the next value is "Number of Channels". And 0x10 and 0x18 designate heights and widths, respectively. 0x28 appears to designate an integer label being next. And so on, and so forth.
I'd like to gain a comprehensive understanding of all possible bytes and their meanings.
Additional digging yielded answers on the following page: https://developers.google.com/protocol-buffers/docs/encoding
Caffe.proto defines Datum by:
optional int32 channels = 1
optional int32 height = 2
optional int32 width = 3
optional bytes data = 4
optional int32 label = 5
repeated float float_data = 6
optional bool encoded = 7
The LMDB record's header in the illustrative example cited above is "08 01 10 1C 18 1C 22 90 06", so with the Google documentation's decoder ring, these hexadecimal values begin to make sense:
08 = Field 1, Type = int32 (since tags are encoded by: (field_number << 3) | wire_type)
01 = Value of Field 1 (i.e., number of channels) is 01
10 = Field 2, Type = int32
1C = Value of Field 2 (i.e., height) is 28
18 = Field 3, Type = int32
1C = Value of Field 3 (i.e., width) is 28
22 = Field 4, Type = length-delimited in bytes
90 06 = Value of Field 4 (i.e., number of bytes) is 1580 using the VarInt encoding methodology
Given this, efficiently creating LMDB entries directly with C for custom, non-image data sets that are readable by Caffe's data layer becomes straightforward.

Julia: How to modify a column of a matrix that has been saved as a binary file?

I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).

RCFile - emitting GZip compressed int columns

For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings.
Is there something about Hive or RCFile or GZ that is preventing proper rendering of int?
My Hive DDL looks like:
create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';
And the relevant portion of my Java looks like:
BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable(); intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable); // sets int field as column 0
bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable); // sets str field as column 1
The code runs fine, and through logging I can see the various Writables have bytes within them.
Hive can read the external table as well, but the int field shows up as NULL, indicating some error.
SELECT * from db.table;
OK
NULL my string field
Time taken: 0.647 seconds
Any idea what might be going on here?
So, I'm not sure exactly why this is the case, but I got it working using the following method:
In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray(), I instead Text.set(Integer.toString(intVal)).getBytes().
In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string.
Then, in my Hive DDL, I can call the column an int and it interprets it correctly.
I'm not sure what was initially causing the problem, be it a bug in WritableUtils, some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. In any event, the solution described above successfully meets the task's needs.

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.
The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.
Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

Resources