How to hide Hugginface's logging messages in Longformer? - huggingface-transformers

I'm training a longformer model and getting many of these messages:
Initializing global attention on multiple choice...
Input ids are automatically padded from 412 to 512 to be a multiple of `config.attention_window`: 512
Initializing global attention on multiple choice...
Input ids are automatically padded from 448 to 512 to be a multiple of `config.attention_window`: 512
...
I found this source code for the model:
if global_attention_mask is None and input_ids is not None:
logger.info("Initializing global attention on multiple choice...")
And also:
if padding_len > 0:
logger.info(
f"Input ids are automatically padded from {seq_len} to {seq_len + padding_len} to be a multiple of "
f"`config.attention_window`: {attention_window}"
These both seem to be under logging. I followed this huggingface link on suppressing logging using:
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()
But this didn't seem to work. I'd add the code I'm using but it's exceptionally long, so if there's a specific question about it I'd be happy to answer/share particular code snippets, but I'm mainly following this tutorial, just with the longformer architecture.

Related

how can i read and write on iso 14443 cards?

I'm trying my hand at using iso 14443 cards. I can't find a way to read or write on them via android app. Anyone have any solutions?
For now I have downloaded android apps like NFC tools, but I'm not very smart in using them.
So as these are sort of like Type 2 NfcA Tags (though not fully Type 2 compliant) and have a datasheet of what commands they support and what their memory organisation is like.
So to read and write data to these Tags you need to transceive a byte array containing the right commands and then you will receive back another byte array with the results of the command.
So here is an example of how to transceive to NfcA on Android.
So your Tag does not support the Fast Read (0x3A) command used in this example but does support a more standard Read command
e.g. send the byte array
0x30,0x00 to read the first 4 blocks of data (16 bytes) from the Tag (see section 6.2.1 of the datasheet and note the CRC is calculated for you.)
A write command begins with 0xA2,0x05 with 4 more bytes of data to write to the first user data area memory block

Train RoBERTa from scratch where dataset is larger than the capacity of RAM?

I have a corpus that is 16 GB large and my ram IS around 16 GB ish. If I load the entire dataset to train the language model RoBERTa from scratch, I am going to have a memory issue. I intend to train my RoBERTa using the script provided from Huggingface's tutorial in their blog post: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
However, their blog post suggests the usage of LineByLineTextDatase. However, this loads the dataset eagerly.
class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
assert os.path.isfile(file_path)
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
logger.info("Creating features from dataset file at %s", file_path)
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
self.examples = batch_encoding["input_ids"]
def __len__(self):
return len(self.examples)
def __getitem__(self, i) -> torch.Tensor:
return torch.tensor(self.examples[i], dtype=torch.long)
Unexpectedly, my kernel crashed on the part where they read the line. I wonder if there is a way to make it read lazily. It will be very desirable if the suggested answer can create minimum code change with the posted tutorial since I'm rather new with Huggingface and afraid I won't be able to debug it on my own.
I would recommend using HuggingFace's own datasets library. The documentation says:
It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.
The quick tour has good explanations and code snippets for creating a dataset object with your own data and it also explains how to train your own model.

How to use audioConverterFillComplexBuffer and its callback?

I need a step by step walkthrough on how to use audioConverterFillComplexBuffer and its callback. No, don't tell me to read the Apple docs. I do everything they say and the conversion always fails. No, don't tell me to go look for examples of audioConverterFillComplexBuffer and its callback in use - I've duplicated about a dozen such examples both line for line and modified and the conversion always fails. No, there isn't any problem with the input data. No, it isn't an endian issue. No, the problem isn't my version of OS X.
The problem is that I don't understand how audioConverterFillComplexBuffer works, so I don't know what I'm doing wrong. And nothing out there is helping me understand, because it seems like nobody on Earth really understands how audioConverterFillComplexBuffer works, either. From the people who actually use it(I spy cargo cult programming in their code) to even the authors of Learning Core Audio and/or Apple itself(http://stackoverflow.com/questions/13604612/core-audio-how-can-one-packet-one-byte-when-clearly-one-packet-4-bytes).
This isn't just a problem for me, it's a problem for anybody who wants to program high-performance audio on the Mac platform. Threadbare documentation that's apparently wrong and examples that don't work are no fun.
Once again, to be clear: I NEED A STEP BY STEP WALKTHROUGH ON HOW TO USE audioConverterFillComplexBuffer plus its callback and so does the entire Mac developer community.
This is a very old question but I think is still relevant. I've spent a few days fighting this and have finally achieved a successful conversion. I'm certainly no expert but I'll outline my understanding of how it works. Note I'm using Swift, which I'm also just learning.
Here are the main function arguments:
inAudioConverter: AudioConverterRef: This one is simple enough, just pass in a previously created AudioConverterRef.
inInputDataProc: AudioConverterComplexInputDataProc: The very complex callback. We'll come back to this.
inInputDataProcUserData, UnsafeMutableRawPointer?: This is a reference to whatever data you may need to be provided to the callback function. Important because even in swift the callback can't inherit context. E.g. you may need to access an AudioFileID or keep track of the number of packets read so far.
ioOutputDataPacketSize: UnsafeMutablePointer<UInt32>: This one is a little misleading. The name implies it's the packet size but reading the documentation we learn it's the total number of packets expected for the output format. You can calculate this as outPacketCount = frameCount / outStreamDescription.mFramesPerPacket.
outOutputData: UnsafeMutablePointer<AudioBufferList>: This is an audio buffer list which you need to have already initialized with enough space to hold the expected output data. The size can be calculated as byteSize = outPacketCount * outMaxPacketSize.
outPacketDescription: UnsafeMutablePointer<AudioStreamPacketDescription>?: This is optional. If you need packet descriptions, pass in a block of memory the size of outPacketCount * sizeof(AudioStreamPacketDescription).
As the converter runs it will repeatedly call the callback function to request more data to convert. The main job of the callback is simply to read the requested number packets from the source data. The converter will then convert the packets to the output format and fill the output buffer. Here are the arguments for the callback:
inAudioConverter: AudioConverterRef: The audio converter again. You probably won't need to use this.
ioNumberDataPackets: UnsafeMutablePointer<UInt32>: The number of packets to read. After reading, you must set this to the number of packets actually read (which may be less than the number requested if we reached the end).
ioData: UnsafeMutablePointer<AudioBufferList>: An AudioBufferList which is already configured except for the actual data. You need to initialise ioData.mBuffers.mData with enough capacity to hold the expected number of packets, i.e. ioNumberDataPackets * inMaxPacketSize. Set the value of ioData.mBuffers.mDataByteSize to match.
outDataPacketDescription: UnsafeMutablePointer<UnsafeMutablePointer<AudioStreamPacketDescription>?>?: Depending on the formats used, the converter may need to keep track of packet descriptions. You need to initialise this with enough capacity to hold the expected number of packet descriptions.
inUserData: UnsafeMutableRawPointer?: The user data that you provided to the converter.
So, to start you need to:
Have sufficient information about your input and output data, namely the number of frames and maximum packet sizes.
Initialise an AudioBufferList with sufficient capacity to hold the output data.
Call AudioConverterFillComplexBuffer.
And on each run of the callback you need to:
Initialise ioData with sufficient capacity to store ioNumberDataPackets of source data.
Initialise outDataPacketDescription with sufficient capacity to store ioNumberDataPackets of AudioStreamPacketDescriptions.
Fill the buffer with source packets.
Write the packet descriptions.
Set ioNumberDataPackets to the number of packets actually read.
return noErr if successful.
Here's an example where I read the data from an AudioFileID:
var converter: AudioConverterRef?
// User data holds an AudioFileID, input max packet size, and a count of packets read
var uData = (fRef, maxPacketSize, UnsafeMutablePointer<Int64>.allocate(capacity: 1))
err = AudioConverterNew(&inStreamDesc, &outStreamDesc, &converter)
err = AudioConverterFillComplexBuffer(converter!, { _, ioNumberDataPackets, ioData, outDataPacketDescription, inUserData in
let uData = inUserData!.load(as: (AudioFileID, UInt32, UnsafeMutablePointer<Int64>).self)
ioData.pointee.mBuffers.mDataByteSize = uData.1
ioData.pointee.mBuffers.mData = UnsafeMutableRawPointer.allocate(byteCount: Int(uData.1), alignment: 1)
outDataPacketDescription?.pointee = UnsafeMutablePointer<AudioStreamPacketDescription>.allocate(capacity: Int(ioNumberDataPackets.pointee))
let err = AudioFileReadPacketData(uData.0, false, &ioData.pointee.mBuffers.mDataByteSize, outDataPacketDescription?.pointee, uData.2.pointee, ioNumberDataPackets, ioData.pointee.mBuffers.mData)
uData.2.pointee += Int64(ioNumberDataPackets.pointee)
return err
}, &uData, &numPackets, &bufferList, nil)
Again, I'm no expert, this is just what I've learned by trial and error.

How to correctly represent message class in SMPP

I am currently trying to figure out how sms classes are correctly represented in SMPP. However I am by now completely confused by the standard and it's documentation.
In normal sms we have
Class0: Flash sms, which are shown on the display
Class1: Normal Sms to be stored on the sim or internally in the device
Looking at the SMPP spec, I first find the parameter data_coding in the submit_sm operation, which is used to set the DCS sent via MAP. As far as I understand this, if we want to explicitly set the message class we need to set the first four bits of this parameter to ones, then two bits indicating the coding and then another two bits indicating the message class. So for Class1 Sms, we would set 1111xx01. Is this correct so far?
If we try to set this DCS, however currently we also set the data coding to "8-Bit data". It seems, several phones are not able to understand this. Is this specified anywhere, and can we just change this, or is a special coding needed when sending other message classes.
More confusion arises, when we try to use the SMPPv3.4 recommended way of setting the Message class. Since 3.4 there is an optional parameter in the submit_sm operation, called dest_addr_subunit. According to the standard this parameter should be set to 0 for unknown, 1 for MS-Display, 2 for Mobile equipment, etc. If I look at this, it seems the parameters are shifted by one compared to GSM message classes. Class0 is encoded as 1, Class1 is encoded as 2 and so on. Is this correct or is there any more complicated mapping behind this?
Also, if we set dest_addr_subunit, do we still have to set DCS as well, or can we just leave this parameter at it's default value?
I recommend to read 3GPP TS 23.038 specification with detailed DCS (Data Coding Scheme) description.
In case of DCS bits 7654 are 00xx, you should check DCS for bit 4 value.
bit 4 == 0 - no message class for this message (bits 1 and 0 are reserved)
bit 4 == 1 - bits 1 and 0 contains message class
So you should set data_coding SMPP parameter in accordance to 3GPP TS 23.038 specification to handle message_class properly.
By default GSM SMS message has no message_class and this is not the same as message_class = 1.

Embedding GSM cellids in Short Messages

I'm using the WML function "providelocalinfo" to put location information into Short Messages send via a WIB menu on a GSM handset.
I'm using the WIG WML v.4 Spec from SmartTrust. The relevant section is "9.4 providelocalinfo Element"
I use the code as in the example, and then transmit the variable via SMS, and use Kannel to retrieve the message from the SMSC.
Here's the code that I'm using, with the exception of [myservicecentre] being my actual service centre:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE wml PUBLIC "-//SmartTrust//DTD WIG-WML 4.0//EN"
"http://www.smarttrust.com/DTD/WIG-WML4.0.dtd">
<wml wibletenc="UCS2">
<card id="s">
<p>
<providelocalinfo cmdqualifier="location" destvar="LOC"/>
<setvar name="X" value="loc=" class="binary"/>
<sendsm>
<destaddress value="367"/>
<userdata docudenc="hex-binary" dcs="245">
$(X)$(LOC)
</userdata>
<servicecentreaddress value="[myservicecentre]"/>
</sendsm>
</p>
</card>
</wml>
What I see in my received messages is "loc=" followed by 7 bytes (octets) or binary data. I have tried to find documentation explaining how to decode this data, but found nothing the explains this clearly.
Of the decoded 7 octets,
the first 3 octets are always the same,
The next 2 octets tend to vary between three unique values,
the last 2 octets appear to be the cellid.
So I have coded the receiver to pull the last two octets and construct a 16-bit GSM cellid. Most of the time it matches known cellids from the network. But quite often, the value does not match.
So I'm trying to find information on the following:
How to properly transmit the location information in a safe manner (encodings, casts, etc)
How to decode the information properly
How to configure Kannel to honor binary location data
I've examined the following documents in my vain searching, but not found the relevant data:
GSM 03.38, GSM 04.07, GSM 04.08, GSM 11.15, as well as the WIG WML Spec V .4
Any insight into what I might be doing wrong would be appreciated!
To decode the location info, you need to look in GSM 11.14 page 48
1.19 LOCATION INFORMATION
Byte(s) Description Length
1 Location Information tag 1
2 Length (X) of bytes following 1
3-5 Mobile Country & Network Codes (MCC & MNC) 3
6-7 Location Area Code (LAC) 2
8-9 Cell Identity Value (Cell ID) 2
The mobile country code (MCC), the mobile network code (MNC), the location area code (LAC) and the
cell ID are coded as in TS GSM 04.08 [8].
From personal experience, the first octet mentioned here is usually left off, so your first three unchanging bytes are the length and the country. The next 2 are the network operator code.
Not too many bites on this question! I wanted to summarize my findings in case others can find them useful:
Need to send messages with a dcs setting not equal to 0. dcs="0" sends data packed (honoring the lower 7-bits of each octet; this allows 160 character SMS messages when the max message size is actually 140 octets)
Need to parse the data in a binary safe manner: regex expressions that stop searching when 0x0A is encountered will fail when the binary data itself can be that value.
I found no need to change Kannel's default configuration.
Cheers
Disclaimer: Safe transmission of 16-bit GSM Cell-Ids requires dealing with a few settings that I understand only because they weren't configured by default. There are probably other defaults that I've depended on but am unaware that they can vary.

Resources