Databricks read Azure blob last modified date - azure-blob-storage

I have an Azure blob storage mounted to my Databricks hdfs.
Is there a way to get the last modified date of the blob in databricks?
This is how i'm reading the blob content:
val df = spark.read
.option("header", "false")
.option("inferSchema", "false")
.option("delimiter", ",")
.csv("/mnt/test/*")

Generally, there are two ways to read an Azure Blob last modified data, as below.
Directly read it via Azure Storage REST API or Azure Storage SDK for Java.
After I researched Azure Blob Storage REST APIs, there are two REST APIs Get Blob & Get Blob Properties which can get the Last-Modified property from the response header. So you can call these apis in Scala to parse api response header to get it, or simply using Azure Storage SDK for Java in Scala to do the same.
Here is my sample code in Java for getting Last-Modified property of a blob.
import java.util.Date;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlob;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
String StorageConnectionStringTemplate = "DefaultEndpointsProtocol=https;" +
"DefaultEndpointsProtocol=https;" +
"AccountName=%s;" +
"AccountKey=%s";
String accountName = "<your storage account name for HDInsight>";
String accountKey = "<your storage account key for HDInsight>";
String containerName = "<container name for HDFS>";
String blobName = "<blob name>";
String storageConnectionString = String.format(StorageConnectionStringTemplate, accountName, accountKey);
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
CloudBlobClient client = storageAccount.createCloudBlobClient();
CloudBlobContainer container = client.getContainerReference(containerName);
CloudBlob blob = container.getBlobReferenceFromServer(blobName);
Date lastModifiedDate = blob.getProperties().getLastModified();
Considering for Hadoop Azure is based on Azure Storage SDK for Java 8.0.0, not a newest version 10.0, so my sample code above is different from the offical tutorial of Azure Blob Storage for Java.
If you want to get the Last-Modified property of a container, you can use the REST API [Get Container Properties][5] or the Java code Date lastModifiedDate = container.getProperties().getLastModified();.
Using Hadoop Azure Java API for wasb:// protocol.
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
Path f = new Path("<blob path on HDFS>");
FileStatus fileStatus = hdfs.getFileStatus(f);
long lastModifiedTime = f.getModificationTime();
Date lastModifiedDate = new Date(lastModifiedTime);

Related

Accessing ADLS Gen2 public container from the C# SDK

I am unable to access a public container using the C# SDK, even though I have enabled "Allow Blob public access" in the storage account configuration.
var fileSystemClient = new DataLakeFileSystemClient(new Uri("https://somestorageaccount.dfs.core.windows.net/public"), new DataLakeClientOptions());
var paths = fileSystemClient.GetPaths();
foreach (var path in paths)
{
Console.WriteLine(path);
}
This code throws the following exception:
Azure.RequestFailedException: 'Server failed to authenticate the
request. Make sure the value of Authorization header is formed
correctly including the signature.
Is there anything I can configure to make this work?
I tried in my environment and got below results:
Initially, I created ADLS gen2 container with public access level set to container level.
Portal:
When I try to access the file through browser, I got same error.
Browser:
When we are accessing through file system, Files kept in storage system are not accessible anonymously. It is necessary to authorize access even if it is public Access level. You are getting this error because you are attempting to access the resource without authorization.
If you need to access files, you need to authorize with SAS token.
I tried with File URL + SAS token in the browser. I can be able to access the file.
You can get SAS-token by clicking file with generate SAS token.
Browser:
If you need access path of data lake gen 2 in C#, you use the StorageSharedKeyCredential method by this link:
string storageAccountName = StorageAccountName;
string storageAccountKey = StorageAccountKey;
Uri serviceUri = StorageAccountUri;
StorageSharedKeyCredential sharedKeyCredential = new StorageSharedKeyCredential(storageAccountName, storageAccountKey);
DataLakeServiceClient serviceClient = new DataLakeServiceClient(serviceUri, sharedKeyCredential);
DataLakeFileSystemClient filesystem = serviceClient.GetFileSystemClient(Randomize("sample-filesystem-list"));
List<string> names = new List<string>();
foreach (PathItem pathItem in filesystem.GetPaths())
{
names.Add(pathItem.Name);
}
Reference:
java - How to get list of child files/directories having parent DataLakeDirectoryClient class instance - Stack Overflow in java by Jim Xu.

Jmeter- How to copy Large files from S3 bucket to another bucket in same s3

I am trying to run the performance script which has API which takes the payload as file from S3 and submit to target system. As the JSSR preprocessor i have added the code.(Jmeter-How to copy files from one AWS S3 bucket to another bucket?)
This will copy file upto 5 GB. It will not allow to copy larger files where as i need to test for 15 GB. Please suggest how to perform for more than 5 GB file
Looking into Q: How much data can I store in Amazon S3?
Upload an object in a single operation using the AWS SDKs, REST API, or AWS CLI—With a single PUT operation, you can upload a single object up to 5 GB in size.
from the same page:
Upload an object in parts using the AWS SDKs, REST API, or AWS CLI—Using the multipart upload API, you can upload a single large object, up to 5 TB in size.
So you need to switch to Multipart Upload, there is even a piece of example code:
import com.amazonaws.AmazonServiceException;
import com.amazonaws.SdkClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.transfer.TransferManager;
import com.amazonaws.services.s3.transfer.TransferManagerBuilder;
import com.amazonaws.services.s3.transfer.Upload;
import java.io.File;
public class HighLevelMultipartUpload {
public static void main(String[] args) throws Exception {
Regions clientRegion = Regions.DEFAULT_REGION;
String bucketName = "*** Bucket name ***";
String keyName = "*** Object key ***";
String filePath = "*** Path for file to upload ***";
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new ProfileCredentialsProvider())
.build();
TransferManager tm = TransferManagerBuilder.standard()
.withS3Client(s3Client)
.build();
// TransferManager processes all transfers asynchronously,
// so this call returns immediately.
Upload upload = tm.upload(bucketName, keyName, new File(filePath));
System.out.println("Object upload started");
// Optionally, wait for the upload to finish before continuing.
upload.waitForCompletion();
System.out.println("Object upload complete");
} catch (AmazonServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
e.printStackTrace();
} catch (SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
e.printStackTrace();
}
}
}
In case of JMeter's Groovy you either need to invoke this main() function explicitly or completely remove the class and the main function declaration.

Dynamics CRM Attachment Data Import Using SDK

I am trying to import Attachments/Annotations to CRM Dynamics, I am doing this using the SDK.
I am not using the data import wizard.
I am not individually creating Annotation entities, instead I am using Data Import Feature programmatically.
I mostly leveraged the DataImport sample from the SDK sample code (SDK\SampleCode\CS\DataManagement\DataImport).
Import import = new Import()
{
ModeCode = new OptionSetValue((int)ImportModeCode.Create),
Name = "Data Import"
};
Guid importId = _serviceProxy.Create(import);
_serviceProxy.Create(
new ColumnMapping()
{
ImportMapId = new EntityReference(ImportMap.EntityLogicalName, importMapId),
ProcessCode = new OptionSetValue((int)ColumnMappingProcessCode.Process),
SourceEntityName = sourceEntityName,
SourceAttributeName = sourceAttributeName,
TargetEntityName = targetEntityName,
TargetAttributeName = targetAttributeName
});
I am getting an error "The reference to the attachment could not be found".
The documentation says the crm async service will find the physical file on disk and upload it, my question is where does the async service look for attachment files?
I tried to map documentbody field to the full path of the attachment on the desk, but that still didn't work.
The answer below was provided before the question edits clarifying the use of the import wizard instead of the SDK. The answer below is specific to using the SDK.
When you are attaching files to an Annotation (Note) record in CRM via the SDK, you do use the documentbody attribute (along with mimetype), but you have to first convert it base64.
Something like this:
var myFile = #"C:\Path\To\My\File.pdf";
// Do checks to make sure file exists...
// Convert to Base64.
var base64Data = Convert.ToBase64String(System.IO.File.ReadAllBytes(myFile));
var newNote = new Entity("annotation");
// Set subject, regarding object, etc.
// Add the data required for a file attachment.
newNote.Attributes.Add("documentbody", base64Data);
newNote.Attributes.Add("mimetype", "text/plain"); // This mime type seems to work for all file types.
orgService.Create(newNote);
I found the solution in an obscure blog post, I think the documentation is misleading or unclear, the way this whole thing works, makes having the files available on the server disk for the async to process, odd.
To follow the same principle, all contents should be sent like the csv file itself while being linked to the same import.
To solve this we need create individual special Internal ImportFile for each physical attachment, and link it to the import that has the attachments record details.
As you see below with linking the attachments ImportFile using the ImportId and then setting the two properties (ProcessCode and FileTypeCode), it all worked in the end.
Suffice to say using this method is much more efficient and quicker than individually creating Annotation records.
foreach (var line in File.ReadLines(csvFilesPath + "Attachment.csv").Skip(1))
{
var fileName = line.Split(',')[0].Replace("\"", null);
using (FileStream stream = File.OpenRead(attachmentsPath + fileName))
{
byte[] byteData = new byte[stream.Length];
stream.Read(byteData, 0, byteData.Length);
stream.Close();
string encodedAttachmentData = System.Convert.ToBase64String(byteData);
ImportFile importFileAttachment = new ImportFile()
{
Content = encodedAttachmentData,
Name = fileName,
ImportMapId = new EntityReference(ImportMap.EntityLogicalName, importMapId),
UseSystemMap = true,
ImportId = new EntityReference(Import.EntityLogicalName, importId),
ProcessCode = new OptionSetValue((int)ImportFileProcessCode.Internal),
FileTypeCode = new OptionSetValue((int)ImportFileFileTypeCode.Attachment),
RecordsOwnerId = currentUserRef
};
_serviceProxy.Create(importFileAttachment);
}
idx++;
}

Exception : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=hbase, access=EXECUTE

I am trying to perform BulkLoad into Hbase. The input to map reduce is hdfs file(from Hive).
Using the below code in Tool(Job) class to initiate the bulk loading process
HFileOutputFormat.configureIncrementalLoad(job, new HTable(config, TABLE_NAME));
In Mapper, using the following as output of Mapper
context.write(new ImmutableBytesWritable(Bytes.toBytes(hbaseTable)), put);
Once the mapper is completed. performing the actual bulk loading using,
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, tableName);
loadFfiles.doBulkLoad(new Path(pathToHFile), hTable);
The job runs fine, but once the Loadincrement start, it hangs on for ever. I have to stop the job from running after many attempts. However after long wait of may be 30 mins, I finally got the above error. After extensive search I found, that Hbase would be trying to access the files(HFiles) which are placed in the output folder, and that folder do not have permission to be written or executed. So throwing the above error. So the alternative solutions are to add file access permissions as below in java code before Bulk Loading is performed.
FileSystem fileSystem = FileSystem.get(config);
fileSystem.setPermission(new Path(outputPath),FsPermission.valueOf("drwxrwxrwx"));
Is this the correct approach, as we move from development to production. Also once I added the above code, I got the similar error for the folder created inside the output folder. This time its the column family folder. This is dynamic action at runtime.
As a temporary workaround, I did as below and was able to move ahead.
fileSystem.setPermission(new Path(outputPath+"/col_fam_folder"),FsPermission.valueOf("drwxrwxrwx"));
Both the steps seems to be workarounds, and I need a correct solution to move to production. Thanks in advance
Try this
System.setProperty("HADOOP_USER_NAME", "hadoop");
Secure bulk load seems to be an appropriate answer. This thread explains a sample implementation. The snippet is copied over as below.
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HRegionInfo;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.client.coprocessor.SecureBulkLoadClient;
import org.apache.hadoop.hbase.security.UserProvider;
import org.apache.hadoop.hbase.security.token.FsDelegationToken;
import org.apache.hadoop.hbase.util.Pair;
import org.apache.hadoop.security.UserGroupInformation;
String keyTab = "pathtokeytabfile";
String tableName = "tb_name";
String pathToHFile = "/tmp/tmpfiles/";
Configuration configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum","ZK_QUORUM");
configuration.set("hbase.zookeeper"+ ".property.clientPort","2181");
configuration.set("hbase.master","MASTER:60000");
configuration.set("hadoop.security.authentication", "Kerberos");
configuration.set("hbase.security.authentication", "kerberos");
//Obtaining kerberos authentication
UserGroupInformation.setConfiguration(configuration);
UserGroupInformation.loginUserFromKeytab("here keytab", path to the key tab);
HBaseAdmin.checkHBaseAvailable(configuration);
System.out.println("HBase is running!");
HBaseConfiguration.addHbaseResources(configuration);
Connection conn = ConnectionFactory.createConnection(configuration);
Table table = conn.getTable(TableName.valueOf(tableName));
HRegionInfo tbInfo = new HRegionInfo(table.getName());
//path to the HFiles that need to be loaded
Path hfofDir = new Path(pathToHFile);
//acquiring user token for authentication
UserProvider up = UserProvider.instantiate(configuration);
FsDelegationToken fsDelegationToken = new FsDelegationToken(up, "name of the key tab user");
fsDelegationToken.acquireDelegationToken(hfofDir.getFileSystem(configuration));
//preparing for the bulk load
SecureBulkLoadClient secureBulkLoadClient = new SecureBulkLoadClient(table);
String bulkToken = secureBulkLoadClient.prepareBulkLoad(table.getName());
System.out.println(bulkToken);
//creating the family list (list of family names and path to the hfile corresponding to the family name)
final List<Pair<byte[], String>> famPaths = new ArrayList<>();
Pair p = new Pair();
//name of the family
p.setFirst("nameofthefamily".getBytes());
//path to the HFile (HFile are organized in folder with the name of the family)
p.setSecond("/tmp/tmpfiles/INTRO/nameofthefilehere");
famPaths.add(p);
//bulk loading ,using the secure bulk load client
secureBulkLoadClient.bulkLoadHFiles(famPaths, fsDelegationToken.getUserToken(), bulkToken, tbInfo.getStartKey());
System.out.println("Bulk Load Completed..");

Windows Azure REST API MediaLink

I'm trying to use the API REST of Windows Azure for creating a virtual machine deployment. However, I've got a problem when trying to specify an OS image in the following XML file:
<Deployment xmlns="http://schemas.microsoft.com/windowsazure" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>SomeName</Name>
<DeploymentSlot>Production</DeploymentSlot>
<Label></Label>
<RoleList>
<Role i:type="PersistentVMRole">
<RoleName>SomeName</RoleName>
<OsVersion i:nil="true"/>
<RoleType>PersistentVMRole</RoleType>
<ConfigurationSets>
<ConfigurationSet i:type="WindowsProvisioningConfigurationSet">
<ConfigurationSetType>WindowsProvisioningConfiguration</ConfigurationSetType>
<ComputerName>SomeName</ComputerName>
<AdminPassword>XXXXXXXXXX</AdminPassword>
<EnableAutomaticUpdates>true</EnableAutomaticUpdates>
<ResetPasswordOnFirstLogon>false</ResetPasswordOnFirstLogon>
</ConfigurationSet>
<ConfigurationSet i:type="NetworkConfigurationSet">
<ConfigurationSetType>NetworkConfiguration</ConfigurationSetType>
<InputEndpoints>
<InputEndpoint>
<LocalPort>3389</LocalPort>
<Name>RemoteDesktop</Name>
<Protocol>tcp</Protocol>
</InputEndpoint>
</InputEndpoints>
</ConfigurationSet>
</ConfigurationSets>
<DataVirtualHardDisks/>
<Label></Label>
<OSVirtualHardDisk>
<MediaLink>¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿???????????????</MediaLink>
<SourceImageName>¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿??????????????????</SourceImageName>
</OSVirtualHardDisk>
</Role>
</RoleList>
</Deployment>`
I need the MediaLink (URI of the OS image) and the SourceImageName (Canonical name of the OS image). My question is, the web portal provides several PREDEFINED images but I cannot determine the URI and the canonical name of them. Will I be forced to create my own OS image and upload it to any of the storage services under my Windows Azure account?
To get these parameters, you could perform List OS Images Service Management API operation on your subscription.
UPDATE
Please discard some of my comments below (sorry about those). I was finally able to create a VM using REST API :). Here're some of the things:
<MediaLink> element should specify the URL of the VHD off of which your VM will be created. It has to be a URL in one of your storage account in the same subscription as your virtual machine cloud service. So for this, you could specify a URL like: https://[yourstorageaccount].blob.core.windows.net/[blobcontainer]/[filename].vhd where [blobcontainer] would be the name of the blob container where you would want the API to store the VHD while the [filename] is any name that you want to give to your VHD. What REST API does is that it copies the source image specified in the <SourceImageName> and saves it at the URI specified in the <MediaLink> element.
Make sure that your Service and Storage Account where your VHD will be stored are in the same data center/affinity group. Furthermore that data center should be able to support Virtual Machines. It turns out that not all data centers support Virtual Machines.
Order of XML element is of utmost importance. You move one element up or down would result in 400 error.
Based on my experimentation, here's the code:
private static void CreateVirtualMachineDeployment(string subscriptionId, X509Certificate2 cert, string cloudServiceName)
{
try
{
string uri = string.Format("https://management.core.windows.net/{0}/services/hostedservices/{1}/deployments", subscriptionId, cloudServiceName);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "POST";
request.ContentType = "application/xml";
request.Headers.Add("x-ms-version", "2013-03-01");
request.ClientCertificates.Add(cert);
string requestPayload = #"<Deployment xmlns=""http://schemas.microsoft.com/windowsazure"" xmlns:i=""http://www.w3.org/2001/XMLSchema-instance"">
<Name>[SomeName]</Name>
<DeploymentSlot>Production</DeploymentSlot>
<Label>[SomeLabel]</Label>
<RoleList>
<Role i:type=""PersistentVMRole"">
<RoleName>MyTestRole</RoleName>
<OsVersion i:nil=""true""/>
<RoleType>PersistentVMRole</RoleType>
<ConfigurationSets>
<ConfigurationSet i:type=""WindowsProvisioningConfigurationSet"">
<ConfigurationSetType>WindowsProvisioningConfiguration</ConfigurationSetType>
<ComputerName>[ComputerName]</ComputerName>
<AdminPassword>[AdminPassword - Ensure it's strong Password]</AdminPassword>
<AdminUsername>[Admin Username]</AdminUsername>
<EnableAutomaticUpdates>true</EnableAutomaticUpdates>
<ResetPasswordOnFirstLogon>false</ResetPasswordOnFirstLogon>
</ConfigurationSet>
<ConfigurationSet i:type=""NetworkConfigurationSet"">
<ConfigurationSetType>NetworkConfiguration</ConfigurationSetType>
<InputEndpoints>
<InputEndpoint>
<LocalPort>3389</LocalPort>
<Name>RemoteDesktop</Name>
<Protocol>tcp</Protocol>
</InputEndpoint>
</InputEndpoints>
</ConfigurationSet>
</ConfigurationSets>
<DataVirtualHardDisks/>
<Label></Label>
<OSVirtualHardDisk>
<MediaLink>https://[storageaccount].blob.core.windows.net/vhds/fb83b3509582419d99629ce476bcb5c8__Microsoft-SQL-Server-2012SP1-Web-CY13SU04-SQL11-SP1-CU3-11.0.3350.0.vhd</MediaLink>
<SourceImageName>fb83b3509582419d99629ce476bcb5c8__Microsoft-SQL-Server-2012SP1-Web-CY13SU04-SQL11-SP1-CU3-11.0.3350.0</SourceImageName>
</OSVirtualHardDisk>
</Role>
</RoleList>
</Deployment>";
byte[] content = Encoding.UTF8.GetBytes(requestPayload);
request.ContentLength = content.Length;
using (var requestStream = request.GetRequestStream())
{
requestStream.Write(content, 0, content.Length);
}
using (HttpWebResponse resp = (HttpWebResponse)request.GetResponse())
{
}
}
catch (WebException webEx)
{
using (var streamReader = new StreamReader(webEx.Response.GetResponseStream()))
{
string result = streamReader.ReadToEnd();
Console.WriteLine(result);
}
}
}
Hope this helps!

Resources