Bigquery utf-8 problems - utf-8

I am using google-api-services-bigquery in java to load data from JSON files stored in Google Cloud Storage to BigQuery.
Everything was ok with this configuration:
Job job = new Job();
JobConfiguration config = new JobConfiguration();
JobConfigurationLoad configLoad = new JobConfigurationLoad();
configLoad.setSchema(schema);
configLoad.setDestinationTable(destTable);
configLoad.setEncoding(StringConstants.UTF_8);
configLoad.setCreateDisposition("CREATE_IF_NEEDED");
configLoad.setWriteDisposition("WRITE_APPEND");
configLoad.setSourceFormat("NEWLINE_DELIMITED_JSON");
configLoad.setAllowQuotedNewlines(false);
configLoad.setSourceUris(gcsPaths);
config.setLoad(configLoad);
job.setConfiguration(config);
But since about 2014-01-30 12:00:00 GMT russian characters in JSON values began to be replaced by question marks. The application was working as a daemon and was not even restarted at that moment. So, I think the issue is caused by some changes in Bigquery. Bigquery now uses Latin-1?
Does anyone know, how can I solve this?

Related

Copy xml file of the old nifi and upload to the new one but it doesnt work

I am moving my existing data flow processes to another new server. I also upgrade the version from nifi 1.16.2 to nifi 1.19.1. I already copied my flow.xml.gz file from my old NiFi to the new NiFi and upload it but I get the error:
2023-02-20 15:13:27,155 WARN [NiFi Web Server-36]
o.a.nifi.web.api.ProcessGroupResource An error occurred while
importing a template.
org.apache.nifi.xml.processing.ProcessingException: Reader creation
failed
Can everyone help me know what I did wrong? Any advice would be appreciated

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

BIg data testing approch

I am working on big data project.
The basic flow of the project is following:
- data is coming from mainframe and getting stored into cornerstone 3.0
- after that the data is getting ingested in hive using scheduler
- then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API.
I want to test this application starting from Hive to Rest API assuming the data in Hive is loaded correctly.
What can be the best approach to test this application
(Objective to be tested : Hive data,hive queries,mapr db performance,mapr dp data,Rest api).What are the best tools and technology to use.
Thank you in advance.
What can be tested? - this is explained by requirements/question it self
data is coming from mainframe mainframe and getting stored into cornerstone 3.0 - validate data is stored as expected (based on requirement) from mainframe to cornerstone
after that the data is getting ingested in hive using scheduler - verify hive tables have data/hdfs file locations etc. as expected(as per requirements - if any transformation is happening during hive table load - you will be validating that)
then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API - here basically you are testing map-reduce job that loads/transforms data in maprdb. you should be running job first -> verify job runs end to end with no error/warns (note execution time to very the performance of the job) -> validate maprdb -> thne test REST API app and verify expected results base on requirements.
What are the best tools and technology to use?
for hive/hdfs/data validation - I would create shell-script (consist of hive, hdfs file location, log file validation, runs mapreduce job, validates mapreduce job etc) that test/verifies each step describe above. one should start with manual CLI commands first to get started with testing.
for testing REST API - there are many tools available e.g. ReadyAPI, postman. I would include this step in shell-script too (using curl)

Parse reduce deploy version

I am working on an app since few days for testing. I am using parse to store data. I have deployed cloud code lot of times since I am a beginner to parse,but now the version number has gone above 200 . Is there a way so that I can set that version number to beginning again.
I dont want to start a new app in parse again and import all the data.

elasticsearch-hadoop 1.3 M3 proxy support not working

i am a beginner in elasticsearch and hadoop. i am having a problem with moving data from hdfs into elasticsearch server using es.net.proxy.http.host with credentials. Server secured with credentials using nginx proxy configuration. But when i am trying to move data using pig script it shows null pointer exception.
My pig script is
REGISTER elasticsearch-hadoop-1.3.0.M3/dist/elasticsearch-hadoop-1.3.0.M3.jar
A = load 'date' using PigStorage() as (date:datetime);
store A into 'doc/id' using org.elasticsearch.hadoop.pig.EsStorage('es.net.proxy.http.host=ipaddress','es.net.proxy.http.port=portnumber','es.net.proxy.http.user=username','es.net.proxy.http.pass=password');
I don't understand where is the problem with my script. Can anyone please help me?
Thanks in Advance.
I faced this type of problem.elasticsearch-hadoop-1.3.0.M3.jar should not support the Proxy and authentication setting.You will try to elasticsearch-hadoop-1.3.0.BUILDSHAPSHOT.jar file. But I couldn't move object data like Totuble to Production server with authentication
Thank U

Resources