How to change dynamically the index name in “saveJsonToES”? - elasticsearch

I am trying to insert logs that I extract from a kafka server in order to insert in ElasticSearc 5 with Spark Streaming 2.0.0 .
Here is my code. My big problem is with line "saveJsonToES", in fact, this function has a string argument for specifying the index name. However, my index name is a JavaDStream. I did like this in oder to generate dynamic index names in another class.
JavaDStream<List<String>> newLines = lines.map(arg0 -> {
String lineToInsertInES = "";
String indexName = "";
List<String> list = new ArrayList<String>();
//some code to determine strings to add in my list
list.add(lineToInsertInES);
list.add(indexName);
return list;
});
JavaDStream<String> lineToInsertInES = newLines.map(list -> list.get(0));
JavaDStream<String> indexName = newLines.map(list -> list.get(1));
lineToInsertInES.foreachRDD(line->{
if(!line.isEmpty())
JavaEsSpark.saveJsonToEs(line,indexName); //problem at this line
});
Can you telle how I can solve this ?
Thank you in advance
J

Related

Using spring jdbc template to query for list of parameters

New to Spring JDBC template but I'm wondering if I am able to pass a list of parameters and execute query once for each parameter in list. As I've seen many examples, the list of parameters being passed is for the execution of the query using all the parameters provided. Rather I am trying to execute query multiple times and for each time using new parameter in list.
For example:
Let's say I have a List of Ids - params (Strings)
List<String> params = new ArrayList<String>();
params.add("1234");
params.add("2345");
trying to do something like:
getJdbcTemplate().query(sql, params, new CustomResultSetExtractor());
which I know as per documentation is not allowed. I mean for one it has to be an array. I've seen simple examples where query is something like "select * from employee where id = ?" and they are passing new Object[]{"1234"} into method. And I'm trying to avoid the IN() condition. In my case each id will return multiple rows which is why I'm using ResultSetExtractor.
I know one option would be to iterate over list and include each id in list as a parameter, something like:
for(String id : params){
getJdbcTemplate().query(sql, new Object[]{id}, new CustomResultSetExtractor());
}
Just want to know if I can do this some other way. Sorry, I Should mention that I am trying to do a Select. Originally was hoping to return a List of custom objects for each resultset.
You do need to pass an array of params for the API, but you may also assume that your first param is an array. I believe this should work:
String sql = "select * from employee where id in (:ids)"; // or should there be '?'
getJdbcTemplate().query(sql, new Object[]{params}, new CustomResultSetExtractor());
Or you could explicitly specify, that the parameter is an array
getJdbcTemplate().query(sql, new Object[]{params}, new int[]{java.sql.Types.ARRAY}, new CustomResultSetExtractor());
You can use preparedStatement and do batch job:
eg. from http://docs.spring.io/spring/docs/current/spring-framework-reference/html/jdbc.html
public int[] batchUpdate(final List<Actor> actors) {
int[] updateCounts = jdbcTemplate.batchUpdate("update t_actor set first_name = ?, " +
"last_name = ? where id = ?",
new BatchPreparedStatementSetter() {
public void setValues(PreparedStatement ps, int i) throws SQLException {
ps.setString(1, actors.get(i).getFirstName());
ps.setString(2, actors.get(i).getLastName());
ps.setLong(3, actors.get(i).getId().longValue());
}
public int getBatchSize() {
return actors.size();
}
});
return updateCounts;
}
I know you don't want to use the in clause, but I think its the best solution for your problem.
If you use a for in this way, I think it's not optimal.
for(String id : params){
getJdbcTemplate().query(sql, new Object[]{id}, new CustomResultSetExtractor());
}
I think it's a better solution to use the in clause. And then use a ResultSetExtractor to iterate over the result data. Your extractor can return a Map instead of a List, actually a Map of List.
Map<Integer, List<MyObject>>
Here there is a simple tutorial explaining its use
http://pure-essence.net/2011/03/16/how-to-execute-in-sql-in-spring-jdbctemplate/
I think this is the best solution:
public List<TestUser> findUserByIds(int[] ids) {
String[] s = new String[ids.length];
Arrays.fill(s, "?");
String sql = StringUtils.join(s, ',');
return jdbcTemplate.query(String.format("select * from users where id in (%s)", sql),
ArrayUtils.toObject(ids), new BeanPropertyRowMapper<>(TestUser.class));
}
this one maybe what you want. BeanPropertyRowMapper is just for example, it will be very slow when there's a lot of records. you should change it to another more efficient RowMapper.

Spark RDD to update

I am loading a file from HDFS into a JavaRDD and wanted to update that RDD. For that I am converting it to IndexedRDD (https://github.com/amplab/spark-indexedrdd) and I am not able to as I am getting Classcast Exception.
Basically I will make key value pair and update the key. IndexedRDD supports update. Is there any way to convert ?
JavaPairRDD<String, String> mappedRDD = lines.flatMapToPair( new PairFlatMapFunction<String, String, String>()
{
#Override
public Iterable<Tuple2<String, String>> call(String arg0) throws Exception {
String[] arr = arg0.split(" ",2);
System.out.println( "lenght" + arr.length);
List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>();
results.addAll(results);
return results;
}
});
IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap();
The collectAsMap() returns a java.util.Map containing all the entries from your JavaPairRDD, but nothing related to Spark. I mean, that function is to collect the values in one node and work with plain Java. Therefore, you cannot cast it to IndexedRDD or any other RDD type as its just a normal Map.
I haven't used IndexedRDD, but from the examples you can see that you need to create it by passing to its constructor a PairRDD:
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
So in your code it should be:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());

How to iterate through Elasticsearch source using Apache Spark?

I am trying to build a recommendation system by integrating Elasticsearch with Apache Spark. I am using Java. I am using movilens dataset as example data. I have indexed the data to Elasticsearch as well. So far, I have been able to read the input from Elasticsearch index as follows:
SparkConf conf = new SparkConf().setAppName("Example App").setMaster("local");
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer.class.getName());
conf.set("es.nodes", "localhost");
conf.set("es.port", "9200");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc, "movielens/recommendation");
Using esRDD.collect() function, I can see that I am retrieving the data from elastic search correctly. Now I need to feed the user id, item id and preference from the Elasticsearch result to Spark's recommendation. If I am using a csv file, I would be able to do it as follows:
String path = "resources/user_data.data";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Rating> ratings = data.map(
new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(" ");
return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
Double.parseDouble(sarray[2]));
}
}
);
What could be an equivalent mapping if I need to iterate through the elastic search output stored in esRDD and create a similar map as above? If there is any example code that I could refer to, that would be of great help.
Apologies for not answering the Spark question directly, but in case you missed it, there is a description of doing recommendations on MovieLens data using elasticsearch here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html
You have not specified the format of the data in ElasticSearch. But let's assume it has fields userId, movieId and rating so an example document looks something like {"userId":1,"movieId":1,"rating":4}.
Then you should be able to do (ignoring null checks etc):
JavaRDD<Rating> ratings = esRDD.map(
new Function<Map<String, Object>, Rating>() {
public Rating call(Map<String, Object> m) {
Int userId = Integer.parseInt(m.get("userId"));
Int movieId = Integer.parseInt(m.get("movieId"));
Double rating = Double.parseDouble(m.get("rating"));
return new Rating(userId, movieId, rating);
}
}
);

Pig Not Interpreting Int Correctly -- Custom Loader

So this is my first time to ever use Pig and I'm having a hard time getting it to interpret my data correctly. I dont want to have to define a schema for my input files until run time, so I wrote a super simple custom loader where the only changes I made to PigStorage were changing the GetSchema Method to read the first two lines of my file and create a schema off of it:
public ResourceSchema getSchema(String location,
Job job) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(location.replace("file://", "")));
String[] line = br.readLine().split(",");
String[] data = br.readLine().split(",");
List<FieldSchema> fields = new ArrayList<FieldSchema>();
for(int f = 0; f< line.length; f++)
{
Byte type = GetType(data[f].replace("\"", ""));
fields.add(new FieldSchema(line[f].replace("\"", ""), type));
}
schema = new ResourceSchema(new Schema(fields));
return schema;
}
private Byte GetType(Object Data)
{
try{
int number = Integer.parseInt(Data.toString());
return org.apache.pig.data.DataType.INTEGER;
}
catch(Exception e){}
try{
double dnumber = Double.parseDouble(Data.toString());
return org.apache.pig.data.DataType.DOUBLE;
}
catch(Exception e){}
return org.apache.pig.data.DataType.CHARARRAY;
}
When I load a file and run DESCRIBE on it, it looks like what I want, for instance:
{CU_NUMBER: int,CYCLE_DATE: chararray,JOIN_NUMBER: int,RSSD: int,CU_TYPE: int,CU_NAME: chararray}
And the first 10 Rows look like this:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
However, when I try to do stuff with the data like:
FOICU = LOAD 'file:///home/biadmin/NCUA/foicu.txt' USING org.apache.pig.builtin.PigStorageInferSchema(',', '-schema');
FirstSixColumns = FOREACH FOICU GENERATE CU_NUMBER, CYCLE_DATE, JOIN_NUMBER, RSSD, CU_TYPE, CU_NAME;
TopTen = LIMIT FirstSixColumns 10;
FOICUFiltered = FILTER TopTen BY CU_NUMBER > 20;
CU_FIVE = FILTER TopTen BY CU_NUMBER == 5;
DUMP FOICUFiltered;
DUMP CU_FIVE;
FOICUFiltered returns all 10 rows even though 7 of them have a CU_NUMBER less than 20:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
And CU_FIVE returns no rows at all.
Does anybody know what I've done wrong here and is there a better way to dynamically load the schema at run time without using schema files?

Using an list in a query in entity framework

I am trying to find a way to pass in an optional string list to a query. What I am trying to do is filter a list of tags by the relationship between them. For example if c# was selected my program would suggest only tags that appear in documents with a c# tag and then on the selection of the next, say SQL, the tags that are linked to docs for those two tags together would be shown, whittling it down so that the user can get closer and closer to his goal.
At the moment all I have is:
List<Tag> _tags = (from t in Tags
where t.allocateTagDoc.Count > 0
select t).ToList();
This is in a method that would be called repeatedly with the optional args as tags were selected.
I think I have been coming at it arse-backwards. If I make two(or more) queries one for each supplied tag, find the docs where they all appear together and then bring out all the tags that go with them... Or would that be too many hits on the db? Can I do it entirely through an entity context variable and just query the model?
Thanks again for any help!
You can try this.
First collect tag to search in a list of strings .
List<string> tagStrings = new List<string>{"c#", "sql"};
pass this list in your query, check whether it is empty or not, if empty, it will return all the tags, else tags which matches the tagStrings.
var _tags = (from t in Tags
where t.allocateTagDoc.Count > 0
&& (tagStrings.Count ==0 || tagStrings.Contains(t.tagName))
select t).ToList();
You can also try this, Dictionary represents ID of a document with it's tags:
Dictionary<int, string[]> documents =
new Dictionary<int, string[]>();
documents.Add(1, new string[] { "C#", "SQL", "EF" });
documents.Add(2, new string[] { "C#", "Interop" });
documents.Add(3, new string[] { "Javascript", "ASP.NET" });
documents.Add(4, new string[] { });
// returns tags belonging to documents with IDs 1, 2
string[] filterTags = new string[] { "C#" };
var relatedTags = GetRelatedTags(documents, filterTags);
Debug.WriteLine(string.Join(",", relatedTags));
// returns tags belonging to document with ID 1
filterTags = new string[] { "C#", "SQL" };
relatedTags = GetRelatedTags(documents, filterTags);
Debug.WriteLine(string.Join(",", relatedTags));
// returns tags belonging to all documents
// since no filtering tags are specified
filterTags = new string[] { };
relatedTags = GetRelatedTags(documents, filterTags);
Debug.WriteLine(string.Join(",", relatedTags));
public static string[] GetRelatedTags(
Dictionary<int, string[]> documents,
string[] filterTags)
{
var documentsWithFilterTags = documents.Where(o =>
filterTags
.Intersect(o.Value).Count() == filterTags.Length);
string[] relatedTags = new string[0];
foreach (string[] tags in documentsWithFilterTags.Select(o => o.Value))
relatedTags = relatedTags
.Concat(tags)
.Distinct()
.ToArray();
return relatedTags;
}
Thought I would pop back and share my solution which was completely different to what I first had in mind.
First I altered the database a little getting rid of a useless field in the allocateDocumentTag table which enabled me to use the entity framework model much more efficiently by allowing me to leave that table out and access it purely through the relationship between Tag and Document.
When I fill my form the first time I just display all the tags that have a relationship with a document. Using my search filter after that, when a Tag is selected in a checkedListBox the Document id's that are associated with that Tag(s) are returned and are then fed back to fill the used tag listbox.
public static List<Tag> fillUsed(List<int> docIds = null)
{
List<Tag> used = new List<Tag>();
if (docIds == null || docIds.Count() < 1)
{
used = (from t in frmFocus._context.Tags
where t.Documents.Count >= 1
select t).ToList();
}
else
{
used = (from t in frmFocus._context.Tags
where t.Documents.Any(d => docIds.Contains(d.id))
select t).ToList();
}
return used;
}
From there the tags feed into the doc search and vice versa. Hope this can help someone else, if the answer is unclear or you need more code then just leave a comment and I'll try and sort it.

Resources