gensim tfidf number of unique tokens v.s. number of features - gensim

I'm wondering why isn't the number of features the same as the number of unique tokens, but rather, in my case, they differ by one (1236 v.s. 1235)
2018-06-19 04:54:45,158 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-19 04:54:45,182 : INFO : built Dictionary(1236 unique tokens: ['.', ':', .....]...) from 98 documents (total 10007 corpus positions)
2018-06-19 04:54:45,214 : INFO : collecting document frequencies
2018-06-19 04:54:45,215 : INFO : PROGRESS: processing document #0
2018-06-19 04:54:45,219 : INFO : calculating IDF weights for 98 documents and 1235 features (6993 matrix non-zeros)

Related

Log query to get a value from a log message and get the average

I have some log messages in my Loki:
2023-02-13 12:20:08.675 INFO 30937 --- [lettuce-epollEventLoop-5-1] c.g.poc.Filter.AuthenticationFilter : [ requestId : 904c1292-66, AuthFilterTime : 15 ms ]
2023-02-13 12:16:32.100 INFO 30937 --- [lettuce-epollEventLoop-5-1] c.g.poc.Filter.AuthenticationFilter : [ requestId : f84a572f-65, AuthFilterTime : 4 ms ]
2023-02-13 12:16:31.427 INFO 30937 --- [lettuce-epollEventLoop-5-1] c.g.poc.Filter.AuthenticationFilter : [ requestId : 904c1292-64, AuthFilterTime : 10 ms ]
I want to get the average value of AuthFilterTime.
I'm getting the error:
"parse error at line 5, col 6: syntax error: unexpected NUMBER"
when i run the query,
sum by (filename)(
avg_over_time(
{filename="/path/to/the/log/file"} |= "AuthFilterTime.*ms" |
regexp `AuthFilterTime\s*:\s*(\d+) ms` |
$1
)[24h]
)
Can somebody help what is it that i've been doing wrong??
I'm actually new to grafana.
try this
avg_over_time({filename="/path/to/the/log/file"} |
regexp "(?m)AuthFilterTime\s*:\s*([0-9]+) ms" |
float |
mean() [24h])
This query uses regular expressions to extract the AuthFilterTime value from each log message and then calculates the average value over a 24-hour period. The float function is used to convert the extracted value from a string to a floating-point number.

Elasticsearch random slow queries troubleshooting

I moved my data from a Elasticsearch cluster to another with powerful hardware
(4 nodes / 2CPU 8GB RAM each one with 4GB on JVM per machine // old cluster had 3 nodes with 1 cpu each one and 2GB on JVM per machine)
but I am randomly experiencing some very slow query responses that I didn't have on the old cluster.
Same ES version on both nodes 6.8.14 and same document numbers / shards
(62GB data / 106 million documents).
Launching the queries from kibana using the profiler on the new cluster
is showing that most of the time is spent in this phase :
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 5777066437,
"children" : [
{
"name" : "MultiCollector",
"reason" : "search_multi",
"time_in_nanos" : 5756203807,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 747674917
},
{
"name" : "MultiBucketCollector: [[min_price, max_price, in_stock, out_of_stock, category, agg_h, mdata, agg_att, agg_f]]",
"reason" : "aggregation",
"time_in_nanos" : 4966026553
}
]
}
]
}
]
CPU usage per node is pretty low (5/10%) and load average is OK (0.50)
Launching the same query again in short time lowers the response from 8 sec to 0.4 sec (due to ES caching I guess), also removing the "aggs" part from the query seems to fix the issue so poor performance is actually during the aggregators phase.
Anyway I do not understand why this slowdown occurs only on the new "better" cluster... how can I optimize the performance or better troubleshoot ?

Maximum value of a column in apache pig

I am trying to find the maximum value of a column ratingTime using pig.I am running below script :
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userid:int,movieID:int,rating:int, ratingTime:int);
maxrating = MAX(ratings.ratingTime);
DUMP maxrating
Sample Input data is :
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
I am getting below error :
2018-08-05 07:02:05,247 [main] INFO org.apache.pig.backend.hadoop.PigATSClient - Created ATS Hook
2018-08-05 07:02:05,914 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. <file script.pi
You need a preceding GROUP ALL before applying MAX.Source
ratings = LOAD '/user/maria_dev/ml-100k/u.data' USING PigStorage('\t') AS (userid:int,movieID:int,rating:int, ratingTime:int);
rating_group = GROUP ratings ALL;
maxrating = FOREACH ratings_group GENERATE MAX(ratings.ratingTime);
DUMP maxrating;

Pig Latin distinguishing Map or Reduce queries

I have the following data sample:
AGE,EDU,SEX,SALARY
67,10th,Male,<=50K
17,10th,Female,<=50K
40,Assoc-voc,Male,>50K
35,Assoc-voc,Male,<=50K
57,Assoc-voc,Male,<=50K
49,Assoc-voc,Male,>50K
42,Bachelors,Male,>50K
30,Bachelors,Male,>50K
23,Bachelors,Female,<=50K
==============================================
I created the following Pig Latin/hadoop script:
sensitive = LOAD '/mdsba' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
*--Filtered the data by the salary
Data_filter1 = FILTER sensitive by (SALARY matches '<=50K');
Data_filter2 = FILTER sensitive by (SALARY matches '>50K');
--group both filters
B= foreach(group Data_filter1 by(AGE,EDU,SEX))
generate Data_filter1;
C= foreach(group Data_filter2 by(AGE,EDU,SEX))
generate Data_filter2;
Dump B ;
Dump C ;
=============================================================
Is there any way to determine whether the queries B,C, Data_filter1, or Data_filter2 run on Map or Reduce process. Since the following report is generated at the end of the job:
Elapsed: 35sec
Diagnostics:
Average Map Time: 12sec
Average Shuffle Time: 10sec
Average Merge Time: 0sec
Average Reduce Time: 2sec
With many thanks
Yes, when you are launching the job you'll see a string
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Alias1[73,14] C: Alias2[20, 9] R: Alias3[90, 78]
M stands for mapper, C for combiner, R for reducer. But in general case there is a possibility that your queries will run on both mapper and reducer

Json parse with elephantbird in Pig

I can't get the following data to parse in Pig. It's what the twitter API returns after getting all tweets from a certain user.
source data: (I removed some numbers to not invade on anyone's privacy by accident)
[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}, {another one goes here ....} ]
I have tried a lot of things but this is the current code I have:
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []);
describe loadJson;
--dump loadJson;
--PARSING JSON
--txt
--a = FOREACH loadJson GENERATE json#'text' AS ParsedInput;
dump loadJson;
c = FOREACH loadJson GENERATE flatten(json#'text') as (m:map[]);
If I'm not getting erros, I just get no returns (as in 0 bytes returned after the script is done running)
for instance:
success!
Input(s):
Successfully read 0 records (532459 bytes) from: "/user/cloudera/tweetwall"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-988640258/tmp-846532109"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
1. You need to give the root name for your input json
I added "tweets" as your root name
{"tweets":[<your input>]}
2. This is nested json, so you need to load your json file with 'nested' option in the loader
input.json
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
PigScript:
REGISTER '/tmp/json-simple-1.1.jar';
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
loadJson = LOAD 'input.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
C = FOREACH B GENERATE FLATTEN(m#'text');
DUMP C;
Output:
(#Beace_ your nan makes me laugh with some of the things she comes out with)

Resources