How to upsert into elasticsearch in spark? - hadoop

With HTTP POST, the following script can insert a new field createtime or update lastupdatetime:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"doc": {
"lastupdatetime": "2015-09-16T18:00:00"
}
"upsert" : {
"createtime": "2015-09-16T18:00:00"
"lastupdatetime": "2015-09-16T18:00",
}
}'
But in spark script, after setting "es.write.operation": "upsert", i don't know how to insert createtime at all. There is only es.update.script.* in the official document... So, can anyone give me an example?
UPDATE: In my case, i want to save the information of android devices from log into one elasticsearch type, and set it's first appearance time as createtime. If the device appear again, i only update the lastupdatetime, but leave the createtime as it was.
So the document id is android ID, if the id exists, update lastupdatetime, else insert createtime and lastupdatetime.So the setting here is(in python):
conf = {
"es.resource.write": "stats-device/activation",
"es.nodes": "NODE1:9200",
"es.write.operation": "upsert",
"es.mapping.id": "id"
# ???
}
rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=conf
)
I just don't know how to insert a new field if the id not exist.

Without seeing your Spark script, it will be hard to give a detailed answer. But in general you will want to use elasticsearch-hadoop (so you'll need to add that dependency to your Build.sbt file, e.g.) and then in your script you can:
import org.elasticsearch.spark._
val documents = sc.parallelize(Seq(Map(
"id" -> 1,
"createtime" -> "2015-09-16T18:00:00"
"lastupdatetime" -> "2015-09-16T18:00"),
Map(<next document>), ...)
.saveToEs("test/type1", Map("es.mapping.id" -> "id"))
as per the official docs. The second argument to saveToES specifies which key in your RDD of Maps to use as the ElasticSearch document id.
Of course, if you're doing this with Spark it implies you've got more rows than you'll want to type out by hand, so for your case you'd need to transform your data into an RDD of Maps from key -> value within your script. But without knowing the data sources I can't go into much more detail.

Finally, i got a solution which is not perfect:
add createtime to all source doc;
save to es with create method and ignore already created error;
remove createtime field;
save to es again with update method;
For now(2015-09-27), step 2 can be implemented by this patch.

Related

Elastic Search doesn't give me any error when updating non existent document

I'm running an updateByQuery operation in ElasticSearch using Spring Data ElasticSearch (Spring Boot parent v2.6.1, Elastic Search 7.15.2).
In my ES index I have stored 2 documents.
When I give a non-existent document in the search, it doesn't give me any error, because of which I'm not able to distinguish whether the update actually ran or not.
Updates for a document that exists work fine. I'd like to figure a way such that if no rows are edited, I can log it.
What should I look at? What should I change so that I can get some message to understand if there was an update?
Here's my code snippet:
UpdateByQueryRequest request = new UpdateByQueryRequest('index');
Map<String, Object> data = new HashMap<>();
data.put("marks", "30");
data.put("name", "timmy");
data.put("roll_number", "10");
request.setScript(
new Script(
ScriptType.INLINE, "painless",
"if (ctx._source.name == params.name && ctx._source.roll_number == params.roll_number) {ctx._source.marks=params.marks;}",
data));
BulkByScrollResponse resp = globalClient.updateByQuery(request, RequestOptions.DEFAULT);
log.info("response: {}",resp.getStatus());
I've added the response status as well. What I find weird is that in case of both existent and non-existent document, the updated parameter count is 2, same as the number of documents I have in my index.
response in case of non-existent record:
response: BulkIndexByScrollResponse[sliceId=null,updated=2,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]
response in case of existing record:
response: BulkIndexByScrollResponse[sliceId=null,updated=2,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]
This is pure Elasticsearch code, nothing from Spring Data Elasticsearch .
Where do you specify the query? I don't see any in your code. That means that all documents will be updated - 2 in your case.

How to use java.util.Date as #Id in mongo documents

Ok i found myself in a simple but annoying problem. My mongo documents are using java.util.Date as id, and as you might guess the id gets converted (spring converters) to ObjectId, I can't update these documents because every time a new ObjectId(Date) is created get a completely different id even though the date is the same...
how do i force mongo to just use java.util.Date as an id?
providing the sample code:
public void updateNode(...node..) {
final MongoTemplate mongoTemplate = ...
final String collectionName = ...
final Query query = (new Query()).addCriteria(Criteria.where("time").is(node.getTime()));
final Update update = Update.update("time", node.getTime()).set("top", node.getTop())
.set("bottom", node.getBottom()).set("mid", node.getMid())
.set("startTime", node.getStartTime()).set("potential", node.isPotential());
mongoTemplate.upsert(query, update, MyClassNode.class, collectionName);
}
if I ran this code for the first time the objects are inserted into the database but with ObjectId... if the node.getTime() is a java.sql.Date then everything is fine.
if the node.getTime() is not a java.sql.Date I cannot update the document if it exists: why? because everytime the document is prepared it creates a new ObjectId the update and query will have two different _id field values and update fails.
On checking the documentation , i found the following details :
In MongoDB, each document stored in a collection requires a unique _id
field that acts as a primary key. If an inserted document omits the
_id field, the MongoDB driver automatically generates an ObjectId for the _id field.
This also applies to documents inserted through update operations with
upsert: true.
The following are common options for storing values for _id:
Use an ObjectId.
Use a natural unique identifier, if available. This saves space and
avoids an additional index.
Generate an auto-incrementing number.
What i understood from the documentation was that to avoid inserting the same document more than once, only use upsert: true if the query field is uniquely indexed.So, if this flag is set , you will find your id converted using ObjectId() to make it unique.

Full-text searth JSON-string

I have a question: in my DB i have a table, who has a field with JSON-string, like:
field "description"
{
solve_what: "Add project problem",
solve_where: "In project CRUD",
shortname: "Add error"
}
How can i full-text search for this string? For example, I need to find all records, who has "project" in description.solve_what. In my sphinx.conf i have
sql_attr_json = description
P.S.Mb i can do this with elasticSearch?
I've just answered a very similar questio here:
http://sphinxsearch.com/forum/view.html?id=13861
Note there is no support for extracting them as FIELDs at this time -
so cant 'full-text' search the text within the json elements.
(To do that would have to use mysql string manipulation functions to
create a new column to index as a normal field. Something like:
SELECT id, SUBSTR(json_column,
LOCATE('"tag":"', json_column)+7,
LOCATE('"', json_column, LOCATE('"tag":"', json_column)+7)-LOCATE('"tag":"',
json_column)-7 ) AS tag, ...
is messy but should work... )
The code is untested.

Multi get returns source as null after bulk update

I am using elastic search multi get for reading documents after bulk update. Its returning some document sources as null.
MultiGetRequestBuilder builder = client.prepareMultiGet();
builder.setRefresh(true);
builder.add(indexName, type, idsList);
MultiGetResponse multiResponse = builder.execute().actionGet();
for (MultiGetItemResponse response : multiResponse.getResponses())
{
String customerJson = response.getResponse().getSourceAsString();
System.out.println("customerJson::" + customerJson);
}
Any issues in my code? Thanks in advance.
When you say "some return sources as null", I assume the get response is marking them as not existing..?
If that's the case, then maybe :
some indexation request in the bulk are failing dur to mapping/random error.
You need to refresh your index between the indexation and the multiget (i.e : your docs are not available for search yet)
transportClient.admin().indices().prepareRefresh(index).execute();
good luck
EDIT : You answered your own question in the comment, but for readability's sake : when using get or multiget, if a routing key was used when indexing, it must be specified again during the get, else, a wrong shard is determined using default routing and the get fails.

Does Avro schema evolution require access to both old and new schemas?

If I serialize an object using a schema version 1, and later update the schema to version 2 (say by adding a field) - am I required to use schema version 1 when later deserializing the object? Ideally I would like to just use schema version 2 and have the deserialized object have the default value for the field that was added to the schema after the object was originally serialized.
Maybe some code will explain better...
schema1:
{"type": "record",
"name": "User",
"fields": [
{"name": "firstName", "type": "string"}
]}
schema2:
{"type": "record",
"name": "User",
"fields": [
{"name": "firstName", "type": "string"},
{"name": "lastName", "type": "string", "default": ""}
]}
using the generic non-code-generation approach:
// serialize
ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
GenericDatumWriter writer = new GenericDatumWriter(schema1);
GenericRecord datum = new GenericData.Record(schema1);
datum.put("firstName", "Jack");
writer.write(datum, encoder);
encoder.flush();
out.close();
byte[] bytes = out.toByteArray();
// deserialize
// I would like to not have any reference to schema1 below here
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema2);
Decoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
GenericRecord result = reader.read(null, decoder);
results in an EOFException. Using the jsonEncoder results in an AvroTypeException.
I know it will work if I pass both schema1 and schema2 to the GenericDatumReader constructor, but I'd like to not have to keep a repository of all previous schemas and also somehow keep track of which schema was used to serialize each particular object.
I also tried the code-gen approach, first serializing to a file using the User class generated from schema1:
User user = new User();
user.setFirstName("Jack");
DatumWriter<User> writer = new SpecificDatumWriter<User>(User.class);
FileOutputStream out = new FileOutputStream("user.avro");
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(user, encoder);
encoder.flush();
out.close();
Then updating the schema to version 2, regenerating the User class, and attempting to read the file:
DatumReader<User> reader = new SpecificDatumReader<User>(User.class);
FileInputStream in = new FileInputStream("user.avro");
Decoder decoder = DecoderFactory.get().binaryDecoder(in, null);
User user = reader.read(null, decoder);
but it also results in an EOFException.
Just for comparison's sake, what I'm trying to do seems to work with protobufs...
format:
option java_outer_classname = "UserProto";
message User {
optional string first_name = 1;
}
serialize:
UserProto.User.Builder user = UserProto.User.newBuilder();
user.setFirstName("Jack");
FileOutputStream out = new FileOutputStream("user.data");
user.build().writeTo(out);
add optional last_name to format, regen UserProto, and deserialize:
FileInputStream in = new FileInputStream("user.data");
UserProto.User user = UserProto.User.parseFrom(in);
as expected, user.getLastName() is the empty string.
Can something like this be done with Avro?
Avro and Protocol Buffers have different approaches to handling versioning, and which approach is better depends on your use case.
In Protocol Buffers you have to explicitly tag every field with a number, and those numbers are stored along with the fields' values in the binary representation. Thus, as long as you never change the meaning of a number in a subsequent schema version, you can still decode a record encoded in a different schema version. If the decoder sees a tag number that it doesn't recognise, it can simply skip it.
Avro takes a different approach: there are no tag numbers, instead the binary layout is completely determined by the program doing the encoding — this is the writer's schema. (A record's fields are simply stored one after another in the binary encoding, without any tagging or separator, and the order is determined by the writer's schema.) This makes the encoding more compact, and saves you from having to manually maintain tags in the schema. But it does mean that for reading, you have to know the exact schema with which the data was written, or you won't be able to make sense of it.
If knowing the writer's schema is essential for decoding Avro, the reader's schema is a layer of niceness on top of it. If you're doing code generation in a program that needs to read Avro data, you can do the codegen off the reader's schema, which saves you from having to regenerate it every time the writer's schema changes (assuming it changes in a way that can be resolved). But it doesn't save you from having to know the writer's schema.
Pros & Cons
Avro's approach is good in an environment where you have lots of records that are known to have the exact same schema version, because you can just include the schema in the metadata at the beginning of the file, and know that the next million records can all be decoded using that schema. This happens a lot in a MapReduce context, which explains why Avro came out of the Hadoop project.
Protocol Buffers' approach is probably better for RPC, where individual objects are being sent over the network (as request parameters or return value). If you use Avro here, you may have different clients and different servers all with different schema versions, so you'd have to tag every binary-encoded blob with the Avro schema version it's using, and maintain a registry of schemas. At that point you might as well have used Protocol Buffers' built-in tagging.
To do what you are trying to do you need to make the last_name field optional, by allowing null values. The type for last_name should be ["null", "string"] instead of "string"
I have tried to circumvent this problem. I am putting it here:
I have also tried using two schemas one schema just an addition of another column to the other schema using the refection API of Avro. I have the following schema:
Employee (having name, age, ssn)
ExtendedEmployee (extending Employee and having gender column)
I am assuming on file which had the Employee objects earlier now also has the ExtendedEmployee object and I tried to read that file as :
RecordHandler rh = new RecordHandler();
if (rh.readObject(employeeSchema, dbLocation) instanceof Employee) {
Employee e = (Employee) rh.readObject(employeeSchema, dbLocation);
System.out.print(e.toString());
} else if (rh.readObject(schema, dbLocation) instanceof ExtendedEmployee) {
ExtendedEmployee e = (ExtendedEmployee) rh.readObject(schema, dbLocation);
System.out.print(e.toString());
}
This solves the problem here. However, I would love to know if there is an API wherein we can give the ExtendedEmployee schema to read the objects of Employee as well.

Resources