Fixing data missing in Hadoop - hadoop

I have a question, which according to my understanding, is more about theory.
I have a job run in Hadoop; basically, it pulls out all customer information from all applications to the company's database. The job is run on a daily basis. The result is a master table with over 200 columns.
customer_id
active_status
dayid
1
0
20221230
2
1
20230101
From Jan 01st to Jan 03rd, data in the column active_status was missing. One of my stakeholders needs the data on Jan 02nd for the reports. My boss said there were 2 options:
Copy the data of Jan 02 into a new table with values in column A being replaced with the data of Jan 04
Fix the master table. She said that in Hadoop, tables could not be updated in a way like in the SQL database, and I would need to remove the file in HDFS, add a new file, and load data into the master table.
This is not the first time I heard that tables could not be updated in Hadoop. I have read that Hadoop works in the Write-Once & Read-Many mechanism.
However, I also know there is the statement INSERT OVERWRITE, which is used to replace existing data with new rows.
So, how should I understand this? For the 2nd option recommended by my boss. Is only the file related to dayid 20230102 be removed from HDFS, or the whole file (all dayid) is removed?
I'm a BA and am quite new in big data, so I hope you can shed more light on this.

Related

Approach to upload multiple interconnected csv files to HBase

I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!
You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.

Apache Sqoop Incremental import

I understand that Sqoop offers couple of methods to handle incremental imports
Append mode
lastmodified mode
Questions on Append mode:
Is the append mode supported only for the check column as integer data type? What if i want to use a date or a timestamp column but still i want to only append to the data already in HDFS?
Does this mode mean that the new data is appended to the existing HDFS file or it picks only the new data from the source DB or both?
Lets say that the check-column is an id column in the source table. There already exists a row in the table where the id column is 100. When the sqoop import is run in the append mode where the last-value is 50. Now it imports all rows where the id > 50. When run again with last-value as 150, but this time the row with the id value as 100 was updated to 200. Would this row also be pulled?
Example: Lets say there is a table called customers with one of the records as follows. The first column is the id.
100 abc xyz 5000
When Sqoop job is run in the append mode and last-value as 50 for the id column, then it would pull the above record.
Now the same record is changed and id also gets changed (hypothetical example though) as follows
200 abc xyz 6000
If you run the sqoop command again, would this pull the above record as well was the question.
Questions on lastmodified mode:
Looks like running sqoop with this mode would merge the existing data with the new data using 2 MR jobs internally. What is the column that sqoop use to compare the old and the new for the merge process?
Can user specify the column for the merge process?
Can more than one column be provided that have to be used for the merge process?
Should the target-dir exist for the merge process to happen, so that sqoop treats the existing target dir as the old dataset? Otherwise, how would Sqoop what is the old data set to be merged?
Answers for append mode:
Yes, it needs to be integer
Both
Question is not clear.
Answers for lastmodified mode:
Incremental load does not merge data with lastmodified, it is primarily to pull updated and inserted data using timestamp.
Merge process is completely different. Once you have both old data and new data, you can merge new data onto old data to a different directory. You can see detailed explanation here.
Merge process works with only one field
target-dir should not exist. The video covers complete merge process

hive query performance is bad

I am joining 3 huge tables (billion row tables) in HIVE. All the statistics are collected, but still the performance is very bad (query taking 40 minutes odd).
Is there any parameter which I can set in the HIVE prompt to get better performance?
When I am trying execution I am seeing info like
Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetFileReader: reading another 1 footers
All the tables are created in BigSql with storage parameter as "STORED AS PARQUETFILE"
How can I suppress the job progress details when a HIVE query is running?
Regarding HIVE version
hive> set system:sun.java.command;
system:sun.java.command=org.apache.hadoop.util.RunJar /opt/ibm/biginsights/hive/lib/hive-cli-0.12.0.jar org.apache.hadoop.hive.cli.CliDriver -hiveconf hive.aux.jars.path=file:///opt/ibm/biginsights/hive/lib/hive-hbase-handler-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hive-contrib-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-client-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-common-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-hadoop2-compat-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-prefix-tree-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-protocol-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-server-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/htrace-core-2.01.jar,file:///opt/ibm/biginsights/hive/lib/zookeeper-3.4.5.jar,file:///opt/ibm/biginsights/sheets/libext/piggybank.jar,file:///opt/ibm/biginsights/sheets/libext/pig-0.11.1.jar,file:///opt/ibm/biginsights/sheets/libext/avro-1.7.4.jar,file:///opt/ibm/biginsights/sheets/libext/opencsv-1.8.jar,file:///opt/ibm/biginsights/sheets/libext/json-simple-1.1.jar,file:///opt/ibm/biginsights/sheets/libext/joda-time-1.6.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets-serdes-1.0.0.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-column-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-common-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-encoding-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-generator-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hadoop-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hive-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-thrift-1.3.2.jar,file:///opt/ibm/biginsights/hive/lib/guava-11.0.2.jar
Koushik - This question I asked a month back will give you a good insight to performance of ORC vs Parquet.
Let me ask this question! What is the structure of your data? Is this nested or flatter? If this is a flatter data, example can be data ingested from an RDBMS, ORC is better since it has light indexes stored alongside the data and makes data retrieval faster.
Hope this helps

Real time database in Hadoop ecosystem

Pardon me if this is a silly question.
I have a cloudera manager installed in a single node.
I am trying to use Hbase and Hadoop for logging request and response in my web application.
I am trying to list latest user activity using the log.
Rows are added using the below table structure.
1 Column Family, RowId, 11 columns. I store every value as string. Fairly simple & similar to a mysql table.
RowId
entry:addedTime
entry:value
entry:ip
entry:accessToken
entry:identifier
entry:userId
entry:productId
entry:object
entry:requestHeader
entry:completeDate
entry:tag
Now, in order to get rows from my Hbase, I use
SingleColumnValueFilter("entry", "userId", "=", binary:"25", true, true)
Now, I am struggling to order this by
entry:completeDate DESCENDING
and limit by 25 rows for pagination or infinite scroll.
My question,
Is Hbase the only real time querying database available in Hadoop ecosystem?
Am I using Hbase for wrong reasons? Is my table structure correct?
I work in a startup and these are our baby steps to moving to BigData. Though BigData created lot of hype, the Hadoop is poorly supported for latest linux and looks too complicated.
Any help or suggestions would be appreciated.
Many thanks,
Karthik

Informatica Workflow shows different number of records in transformation and target

Hi I am a informatica newbie trying to train myself in power center workflows.
when I look at many of the workflow last sessions runs in the workflow monitor I see the number of records that are picked by the transformation is different than the number of records that get updated or inserted in the target table.
in the below image for example my sql transformation picks 80,742 rows from the source table. but only 29,813 rows get loaded into the target table.
image of informatica workflow monitor
on further analyzing the workflow log file I can see it loaded both insertable records and updatable records:
WRT_8036 Target: W_SALES_ORDER_LINE_F (Instance Name:
[W_SALES_ORDER_LINE_F]) WRT_8038 Inserted rows - Requested: 15284
Applied: 15284 Rejected: 0 Affected: 15284 WRT_8041
Updated rows - Requested: 14529 Applied: 14529 Rejected: 0
Affected: 14529
WRITER_1_*_1> WRT_8035 Load complete time: Wed Mar 19 04:41:24 2014
I am not able to figure out why would the workflows load lesser records than what source sql gives. and I would really appreciate some help in this matter.
Thanks,
Matt
This happens when there is a join happening in the ETL and the column over which we join has got duplicate values.

Resources