I have a list of IP address. I need to assign a country to each IP.
For example http://www.ip2nation.com/ provides this service.
I have found some databases for IP2Country, but how do I integrate it with pig?
Input:
14.59.63.28
145.89.87.211
54.27.253.89
98.201.50.22
116.48.29.143
145.89.87.211
20.109.204.65
20.109.204.65
Expected output:
14.59.63.28 country1
145.89.87.211 country2
54.27.253.89 country3
98.201.50.22 country4
116.48.29.143 country5
145.89.87.211 country2
20.109.204.65 country6
20.109.204.65 country6
You will need to get Extract of Database of IP and country name from that database.
Then use that Extracted data to perform join with your data which you streaming.
I will straight forward join. To get better performance you can check for replicated join in Pig
http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html#Replicated+Joins
Related
I have a csv file like this:
ProductId,CategoryId
1,1
1,2
1,3
2,1
2,2
2,3
...
... nearly 1 million records
From this csv, I want to produce another csv like below:
ProductId,CategoryId,ProductName,CategoryName
1,1,Shirt,Mens
1,2,Shirt,Boys
1,3,Shirt,Uni
2,1,Watch,Mens
2,2,Watch,Boys
2,3,Watch,Uni
To this, I have setup NiFi flow as below:
GetFile
-> LookupRecord
-> uses SimpleDatabaseLookupService to query DB table "PRODUCT" using ProductId to get ProductName
-> uses CSVRecordSetWriter to write the ProdcutName value back to the CSV
-> LookupRecord
-> uses SimpleDatabaseLookupService to query DB table "CATEGORY" using CategoryId to get CategoryName
-> uses CSVRecordSetWriter to write the CategoryName value back to the CSV
This works, atleast on files which has about 40K lines. But, as soon as I feed the original file containing over a million records, NiFi just hangs.
So my questions is:
Is there a way to optimize my flow, which can work with such large data sets ?
Note: there is a repetition of ProductId and CategoryId in my file. My current flow executes for each row. I was wondering if this fact can be leveraged to optimize the flow, but I couldn't figure out how. Any help will be greatly appreciated. Thanks.
I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..
I want to perform Behavioral analysis / anomalies detection in Splunk by comparing Historical (say last months data) with todays data to find anomalies.
I am analyzing FTP logs, so e.g I want to have a historical baseline/report of all users with there IPs/City and logging time.
Anomalies can be defined as if same user logins from different IP range/City and in different time zone.
Commands: anomalies, anomalousvalue, analyzefields are availbale in Splunk but these commands typically work on a time range of searched data and not compare with the historical data for a user as we want it.
How can I achieve this in Splunk?
You can do it by running two searches and then joining them together:
start by getting the current data and putting it in a simple table: search | table username ip city time_zone
Prepare the second search and rename the fields (except username) to have different names second search earliest=-2mon#mon latest=-1mon#mon| table username ip city time_zone | rename ip as old_ip | rename city as old_city ...
Join the searches together: search | join [ | search second_search ]
Now you can search for users with similar new and historical fields.
Hope it is helpful.
I am crawling several websites say site1, site2, ...., site100. I use a list of proxy ips to crawl them lets say ip1, ip2, ..., ip10. Whenever i crawl any page from a site say site5 i call a function getProxyFor(site5) that gives me the proxy ip i should be using to request a page from site5. getProxyFor checks return the proxy ip which has been used for site5 the least number of times(i can additional conditions like how old the proxy is or how many total successful requests(combined for all sites) has it sent, etc). So the basic problem is
From a list of items where each has a few properties, I want to choose
an item by querying on one or more of its properties
I could store all this data is rdbms like
IP | Site | Count
ip1 | s1 | 5
ip1 | s2 | 9
ip2 | s2 | 1
and do a query like select ip from table order by count limit 1 (i could use limit 5 and then check those 5 ips for age and other stuff). But what if i dont want to use a sql database? What data structure should i use to query efficiently on such data
I'd use redis for this type of functionality. Specifically, redis has sorted sets, which buys you the ability to get the IP used the least number of times (assuming you use the number of times an IP was used as the score for the key).
If you have a more complex set of criteria to use to determine which IP gets used next, you could compute the score of the key by making it a function of the criteria (assuming that the output of the function is a number). E.g. score = f(last_time_used, latency, number_of_times_used)
Also, redis is stored in memory so it is freaky fast compared to a SQL.
I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));