how to send binary object to Oracle NoSQL Database? - oracle-nosql

We are currently using a bitset to represent status of a series of chunks of work, each can be either success or failure in our Java Code. So we we want to find the most space-saving type to convert to in Oracle NoSQl. We think that the best solution is to use binary but we are hitting in the following issue when trying to insert data
Caused by: java.lang.IllegalArgumentException:
PUT: Illegal Argument: Invalid driver type for Binary: ARRAY
In fact, I don't how to send binary object to Oracle NoSQL Database, any ideas?
If you need information about BitSet, please read
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/BitSet.html

If you are doing a serialize/deserialize to JSON for your Java Objects. Here an example that show you, how to convert from/to BitSet to base64. Binary in JSON are stored in base64.
import java.util.BitSet;
import java.util.Base64;
public class HelloWorld {
public static void main(String[] args) {
// Creating some examples of BitSet
BitSet first = new BitSet();
first.set(2); // 100
BitSet second = new BitSet();
second.set(4); // 10000
second.set(3); // 11000
BitSet third = new BitSet(2000);
third.set(5, 49);
// from Bitset to Base64
byte[] encoded = Base64.getEncoder().encode(first.toByteArray());
System.out.println("first="+first);
System.out.println("{'Chunk':'" +new String(encoded) + "'}");
encoded = Base64.getEncoder().encode(second.toByteArray());
System.out.println("second="+second);
System.out.println("{'Chunk':'" +new String(encoded) + "'}");
encoded = Base64.getEncoder().encode(third.toByteArray());
System.out.println("third="+third);
System.out.println("{'Chunk':'" +new String(encoded) + "'}");
// from Base64 to Bitset
byte[] decoded = Base64.getDecoder().decode("4P8H");
BitSet four = BitSet.valueOf(decoded);
System.out.println("four="+four);
decoded = Base64.getDecoder().decode(encoded);
BitSet thirdDecoded = BitSet.valueOf(decoded);
System.out.println("{'Chunk':'" +new String(encoded) + "'}");
System.out.println("third="+thirdDecoded);
}
}
the output BitSet to base64
first={2}
{'successfulChunkMask':'BA=='}
second={3, 4}
{'successfulChunkMask':'GA=='}
third={5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48}
{'successfulChunkMask':'4P//////AQ=='}
the output base64 to BitSet
4P8H in base64
four={5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}
{'successfulChunkMask':'4P//////AQ=='}
third={5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48}
If you use the NoSQL Java MapValue API (instead of JSON representation of your object) , you need transform from byte[] to Bitset and viceversa, and you don't need to pass trough base64. see toByteArray() and BitSet.valueOf(byte[]) in the code provided.
If you are inserting data using the NoSQL API and providing a JSON value, the solution is base64 for encoding your binary fields (java.util.BitSet). Same if you are using the SQL APIs

You can use BitSet.toByteArray() to create a byte[] array and then put that directly into an Oracle NoSQL FieldValue, which you can then write to the NoSQL database using put():
final FieldValue val = new BinaryValue(mybits.toByteArray());
When reading the value from Oracle NoSQL, you can generate the bitset back by calling BitSet.valueOf(byte[]).

Related

How to get hidden layer/state outputs from a Bert model?

Based on the documentation provided here, https://github.com/huggingface/transformers/blob/v4.21.3/src/transformers/modeling_outputs.py#L101, how can i read all the outputs, last_hidden_state (), pooler_output and hidden_state. in my sample code below, i get the outputs
from transformers import BertModel, BertConfig
config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)
outputs = model(inputs)
when i print one of the output (sample below) . i looked through the documentation to see if i can use some functions of this class to just get the last_hidden_state values , but i'm not sure of the type here.
the value for the last_hidden_state =
tensor([[...
is it some class or tuple or array .
how can i get the values or array of values such as
[0, 1, 2, 3 , ...]
BaseModelOutputWithPoolingAndNoAttention(
last_hidden_state=tensor([
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
...
hidden_states= ...
The BaseModelOutputWithPoolingAndCrossAttentions you retrieve is class that inherits from OrderedDict (code) that holds pytorch tensors. You can access the keys of the OrderedDict like properties of a class and, in case you do not want to work with Tensors, you can them to python lists or numpy. Please have a look at the example below:
from transformers import BertTokenizer, BertModel
t = BertTokenizer.from_pretrained("bert-base-cased")
m = BertModel.from_pretrained("bert-base-cased")
i = t("This is a test", return_tensors="pt")
o = m(**i, output_hidden_states=True)
print(o.keys())
print(type(o.last_hidden_state))
print(o.last_hidden_state.tolist())
print(o.last_hidden_state.detach().numpy())
Output:
odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])
<class 'transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions'>
<class 'torch.Tensor'>
[[[0.36328405141830444, 0.018902940675616264, 0.1893523931503296, ..., 0.09052444249391556, 1.4617693424224854, 0.0774402841925621]]]
[[[ 0.36328405 0.01890294 0.1893524 ... -0.0259465 0.38701165
0.19099694]
[ 0.30656984 -0.25377586 0.76075834 ... 0.2055152 0.29494798
0.4561815 ]
[ 0.32563183 0.02308523 0.665546 ... 0.34597045 -0.0644953
0.5391255 ]
[ 0.3346715 -0.02526359 0.12209094 ... 0.50101244 0.36993945
0.3237842 ]
[ 0.18683438 0.03102166 0.25582778 ... 0.5166369 -0.1238729
0.4419385 ]
[ 0.81130844 0.4746894 -0.03862225 ... 0.09052444 1.4617693
0.07744028]]]

Tackling the 'Small Data' Problem with Distributed Computing Cluster?

I'm learning about Hadoop + MapReduce and Big Data and from my understanding it seems that the Hadoop ecosystem was mainly designed to analyze large amounts of data that's distributed on many servers. My problem is a bit different.
I have a relatively small amount of data (a file consisting of 1-10 million lines of numbers) which needs to be analyzed in millions of different ways. For example, consider the following dataset:
[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]
I need to analyze how frequently a number of each column (Column N) repeats itself a certain number of rows later (L rows later. For example, if we were analyzing Column A with 1L (1-Row-Later) the result would be as follows:
Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES -> 4/9.
We would repeat the above analysis for each column separately and for maximum N later times. In the above dateset which only consists of 10 lines it means a maximum of 9 N later. But in a dateset of 1 million lines, the analyses (for each column) would be repeated 999,999 times.
I looked into the MapReduce framework but it doesn't seem to cut it; it doesn't seem like an efficient solution for this problem and it requires a great deal of work to convert the core code into a MapReduce friendly structure.
As you can see in the above example, each analyses is independent of each other. For example, it is possible to analyze Column A separately from Column B. It is also possible to perform 1L analyses separately from 2L and so on. However, unlike Hadoop where the data lives on separate machines, in our scenario, each server needs access to all of the data to perform it's "share" of analysis.
I looked into possible solutions for this problem and it seems there are very few options: Ray or building a custom application on top of YARN using Apache Twill. Apache Twill was moved to the Attic in 2020 which means that Ray is the only available option.
Is Ray the best way to tackle this problem or are there other, better options? Ideally, the solution should automatically handle fail over and distribute the processing load intelligently. For example, in the above example, if we wanted to distribute the load to 20 machines, one way of doing so would be to divide 999,999 N Later by 20 and let Machine A analyze 1L-49999L, Machine B from 50000L - 100000L and so on. However, when you think about it, the load isn't being distributed equally - as it takes much longer to analyze 1L vs. 500000L as the latter contains only about half the number of rows (for 500000L the first row we are analyzing is row 500001 so we are essentially omitting the first 500K rows from analysis).
It should also not require a great deal of modification to the core code (like MapReduce does).
I'm working with Java.
Thanks
Well you are right - your scenario and your technological stack are not that suitable. Which raise the question - why not (add) something more relevant to your current technological stack? For instance - Redis DB.
Seems that your common action is mainly count values and you want to prevent over-calculation and make it more performant (e.g. - properly index your data). Given that this is one of the main features of Redis - it sounds logical to use it as a processor
My suggestion:
Create a hashmap that uses the numeric value as key and its count as value. This way - you will be able to pull different calculations over those metrics and always iterate your data-set once. Afterwards - you just need to pull the data from Redis by different criteria (per calculation or metric).
From this point, it's easy to save your calculated data back to your database and make it ready for direct querying. The overall process may be similar to this:
Scan data from file
Properly index it to redis (using hashmap)
Make desired calculations (over the indexed count)
Save it in your DB (as a digested data-set)
Flush Redis DB
Query your DB (as much as you like)
Follow the docs for both populating and retrieving data

How to post a "poster" to VKontakte social network wall

I have a script that can post a text or an image to the VK.com over API. But I can't find a way to create a poaster:
There is no information about posters at the official documentation page.
You have to add the parameter poster_bkg_id = {background_number_for_poster}. This parameter is not described in the documentation (thanks to the author https://github.com/VBIralo/vk-posters).
Backgrounds (poster_bkg_id)
Gradients
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Artwork
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 31, 32
Emoji Patterns
21, 22, 23, 24, 25, 26, 27, 28, 29, 30

Strange utf8 decoding error in windows notepad

If you type the following string into a text file encoded with utf8(without bom) and open it with notepad.exe,you will get some weired characters on screen. But notepad can actually decode this string well without the last 'a'. Very strange behavior. I am using Windows 10 1809.
[19, 16, 12, 14, 15, 15, 12, 17, 18, 15, 14, 15, 19, 13, 20, 18, 16, 19, 14, 16, 20, 16, 18, 12, 13, 14, 15, 20, 19, 17, 14, 17, 18, 16, 13, 12, 17, 14, 16, 13, 13, 12, 15, 20, 19, 15, 19, 13, 18, 19, 17, 14, 17, 18, 12, 15, 18, 12, 19, 15, 12, 19, 18, 12, 17, 20, 14, 16, 17, 18, 15, 12, 13, 19, 18, 17, 18, 14, 19, 18, 16, 15, 18, 17, 15, 15, 19, 16, 15, 14, 19, 13, 19, 15, 17, 16, 12, 12, 18, 12, 14, 12, 16, 19, 12, 19, 12, 17, 19, 20, 19, 17, 19, 20, 16, 19, 16, 19, 16, 12, 12, 18, 19, 17, 18, 16, 12, 17, 13, 18, 20, 19, 18, 20, 14, 16, 13, 12, 12, 14, 13, 19, 17, 20, 18, 15, 12, 15, 20, 14, 16, 15, 16, 19, 20, 20, 12, 17, 13, 20, 16, 20, 13a
I wonder if this is a windows bug or there is something I can do to solve this.
Did more research; figured it out.
Seems like a variation of the classic case of "Bush hid the facts".
https://en.wikipedia.org/wiki/Bush_hid_the_facts
It looks like Notepad has a different character encoding default for saving a file than it does for opening a file. Yes, this does seem like a bug.
But there is an actual explanation for what is occurring:
Notepad checks for a BOM byte sequence. If it does not find one, it has 2 options: the encoding is either UTF-16 Little Endian (without BOM) or plain ASCII. It checks for UTF-16 LE first using a function called IsTextUnicode.
IsTextUnicode runs a series of tests to guess whether the given text is Unicode or not. One of these tests is IS_TEXT_UNICODE_STATISTICS, which uses statistical analysis. If the test is true, then the given text is probably Unicode, but absolute certainty is not guaranteed.
https://learn.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-istextunicode
If IsTextUnicode returns true, Notepad encodes the file with UTF-16 LE, producing the strange output you saw.
We can confirm this with this character ㄠ. Its corresponding ASCII characters are ' 1' (space one); the corresponding hex values for those ASCII characters are 0x20 for space and 0x31 for one. Since the byte-ordering is Little Endian, the order for the Unicode code point would be '1 ', or U+3120, which you can confirm if you look up that code point.
https://unicode-table.com/en/3120/
If you want to solve the issue, you need to break the pattern which helps IsTextUnicode determine if the given text is Unicode. You can insert a newline before the text to break the pattern.
Hope that helped!

Algorithm for Iterating thru a list using pagination?

I am trying to come up with an algorithm to iterate thru a list via pagination.
I'm only interested in the initial index and the size of the "page".
For example if my list is 100 items long, and the page length is 10:
1st page: starts at 0, length 10
2nd page: starts at 11, length 10
3rd page: starts at 21, length 10
...
Nth page: starts at 90, length 10
My problem is coming up with an elegant solution that satisfies these cases:
1. list has 9 elements, page length is 10
1st page: starts at 0, length 9
2. list has 84 elements, page length is 10
1st page: starts at 0, length 10
2nd page: starts at 11, length 10
3rd page: starts at 21, length 10
...
Nth page: starts at 80, length 4
I could do this with a bunch of conditionals and the modulo operation, but I was wondering if anyone could offer a better/elegant approach to this problem.
Thanks!
There follows some code doing it the long way in Python which could be used for other languages too; followed by how it could be done in a more maintainable fashion by the intermediate Pythoneer:
>>> from pprint import pprint as pp
>>> n, perpage = 84, 10
>>> mylist = list(range(n))
>>> mylist[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist[-10:] # last ten items
[74, 75, 76, 77, 78, 79, 80, 81, 82, 83]
>>> sublists = []
>>> for i in range(n):
pagenum, offset = divmod(i, perpage)
if offset == 0:
# first in new page so create another sublist
sublists.append([])
# add item to end of last sublist
sublists[pagenum].append(i)
>>> pp(sublists)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83]]
>>> # Alternatively
>>> sublists2 = [mylist[i:i+perpage] for i in range(0, n, perpage)]
>>> pp(sublists2)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83]]
>>>

Resources