I'm using Spring WebClient for getting html. The response contains polish characters such as: ą, ę, ż and so on.
After calling service i expect the response to look like this: <div>plan zajęć</div>
But the actual response looks like this: <div>plan zaj�ć</div> - and this sign replaces all polish characters.
Here's a WebClient bean config:
#Bean
WebClient webClient() {
return WebClient.builder()
.build();
}
And here's how i use it:
Optional<String> resp = webClient.get()
.uri(uri)
.retrieve()
.bodyToMono(String.class)
.blockOptional();
And here's a link to page that i'm trying to web scrape: https://plan.polsl.pl/plan.php?winW=1000&winH=1000&type=0&id=343126158
I've no idea what to change in the WebClient configuration to get the desired effect, so I'm asking for help.
Please show how you use WebClient. I don't know Polish character but very likely your problem is related to the encoding of the response.
You can try to specify the charset to UTF_8 and see if that helps
WebClient webClient = WebClient.create();
Mono<String> response = webClient.get()
.uri(uri)
.acceptCharset(StandardCharsets.UTF_8)
.retrieve()
.bodyToMono(String.class);
String responseString = response.block();
== Updated 1/2/2023 ==
Note that Java String is using UTF-8 encoding. That's why we attempted to request the web server to return us a document in UTF-8 encoding. Unfortunately, the web server that you specified above returns ISO-8859-2 charset even though WebClient is requesting to return UTF-8 charset. You will have to transcode the response body from ISO-8859-2 to UTF-8 charset yourself. Here is the sample code to do that. I tested it with your web server.
WebClient webClient = WebClient.create();
Mono<ByteArrayResource> responseBody = webClient.get()
.uri(uri)
.retrieve()
.bodyToMono(ByteArrayResource.class);
String responseString = new String(responseBody.block().getByteArray(), Charset.forName("ISO-8859-2"));
If you are building a generic web crawler, instead of hardcoding the above code to always transcode from ISO-8859-2 to UTF-8, you will need to get the charset information from the Content-Type header. Most of the web server would tell you the media type as well as the charset encoding in Content-Type. Then, instead of hardcoding ISO-8859-2 in the above code, you can specify the correct charset. Here is the sample code to find the charset.
WebClient webClient = WebClient.create();
Mono<ClientResponse> response = webClient
.get()
.uri("http://example.com")
.exchange();
response.map(res -> {
String contentType = res.headers().contentType().get().toString();
String charset = null;
// parse the Content-Type header to extract the charset
Matcher m = Pattern.compile("charset=([^;]+)").matcher(contentType);
if (m.find()) {
charset = m.group(1);
}
return charset;
});
Unfortunately, the web server that you specified didn't tell you the charset in Content-Type header either. In this case, you may need to look elsewhere in the response to determine the character encoding.
One place you can check is the charset attribute of the element in the HTML document. Some web servers include a element in the HTML document with a charset attribute that specifies the character encoding of the document. This is how I found out your specified document is using ISO-8859-2 charset.
WebClient doesn't have an easy way to extract the charset information from tag but you can use regular expression to extract that. Here is the sample code
WebClient webClient = WebClient.create();
Mono<String> responseBody = webClient
.get()
.uri("http://example.com")
.retrieve()
.bodyToMono(String.class);
responseBody.map(html -> {
String charset = null;
// use a regular expression to extract the charset attribute from the <meta> element
Matcher m = Pattern.compile("<meta[^>]+charset=[\"']?([^\"'>]+)[\"']?").matcher(html);
if (m.find()) {
charset = m.group(1);
}
return charset;
});
Related
I'm trying to send a byte[] from a client to a server using WebClient, this is what I have:
HttpClient httpClient = HttpClient.create();
// some proxy Settings to httpClient..
ReactorClientHttpConnector connector = new ReactorClientHttpConnector(httpClient);
WebClient client = WebClient.builder().clientConnector(connector).build();
MultipartBodyBuilder formDataBuilder = new MultipartBodyBuilder();
String header = String.format("form-data; pack=%s;", pack); // pack is byte[]
formDataBuilder.part("pack", new ByteArrayInputStream(pack)).header("Content-Disposition", header);
formDataBuilder.part("simpleParam", "testParam");
client.post().uri("myurl.test").accept(MediaType.APPLICATION_XML).contentType(MediaType.MULTIPART_FORM_DATA)
.header("Content-type", MediaType.MULTIPART_FORM_DATA_VALUE)
.body(BodyInserters.fromMultipartData(formDataBuilder.build()))
.retrieve()
.bodyToMono(Response.class)
.block();
Executing this code though i get this error:
org.springframework.core.codec.CodecException: No suitable writer found for part: pack
at org.springframework.http.codec.multipart.MultipartHttpMessageWriter.encodePart(MultipartHttpMessageWriter.java:260)
at org.springframework.http.codec.multipart.MultipartHttpMessageWriter.lambda$encodePartValues$4(MultipartHttpMessageWriter.java:213)
....
I don't understand what is missing.
Any help is appreciated, thank you
Your problem is that your Content-Disposition header is invalid. You shouldn't put your byteArray into the header. You can read more about Content-Disposition Header here
Also in my case it helps me to pass a ByteArrayResource instead of ByteArrayInputStream. I would recommend you to try one of these solutions:
Set Content-Disposition Header correct:
// ...
String header = String.format("form-data; name=%s; filename=%s", "part", "testFilename.txt");
// ...
Use ByteArrayResource instead of ByteArrayInputStream
formDataBuilder.part("pack", new ByteArrayResource(pack)).filename("testFilename.txt");
I'm currently using Springs ResponseBodyEmitter to stream a multipart/related response consisting of multiple parts (of mimetype application/json as well as application/octet-stream) to a client. Therefore I am manually setting the boundary in the Content-Type header as well as creating the encapsulation boundaries between the different message parts within the payload. I'm pretty sure there is a more convenient way to achieve this. What would be the idiomatic way in Spring to achieve this?
#GetMapping(value = "/data", produces = {MediaType.MULTIPART_FORM_DATA_VALUE})
public ResponseEntity<ResponseBodyEmitter> streamMultipart() {
// omitting actual contents for the sake of brevity
InputStream audio = new ByteArrayInputStream(null); //asynchronously retrieved
String json = "{}";
ResponseBodyEmitter speakEmitter = new ResponseBodyEmitter();
executor.execute(() -> {
try {
speakEmitter.send("\r\n--myBoundary\r\n");
speakEmitter.send("Content-Type: application/json;\r\n\r\n");
speakEmitter.send(json, MediaType.APPLICATION_JSON);
speakEmitter.send("\r\n--myBoundary\r\n");
speakEmitter.send("Content-Type: application/octet-stream;\r\n\r\n");
speakEmitter.send(audio.readAllBytes(), MediaType.APPLICATION_OCTET_STREAM);
speakEmitter.send("\r\n--myBoundary--\r\n");
} catch (IOException ignoredForBrevity) {}
});
HttpHeaders responseHeaders = new HttpHeaders();
responseHeaders.add(HttpHeaders.CONTENT_TYPE, "multipart/related; boundary=myBoundary");
return ResponseEntity.ok().headers(responseHeaders).body(speakEmitter);
}
I have a url, that has ø in it. For example this:
https://server/ø
When I make the GET call in postman, the url is converted correctly into
https://server/%C3%B8
But when I use Java code like this:
UriComponentsBuilder builder = UriComponentsBuilder.fromUri(new URI(url));
String out = restTemplate.getForObject(builder.toUriString(), String.class);
It correctly turns the url into
https://server/%25C3%25B8
The url-string contains a back-slash character that needs to be encoded. The url string is as follows.
String folder = "\\Foo\\Bar\\"; // some folder search path.
String urlString= "http://localhost:8081/certificates/?mypath=%5CFoo%5CBar%5C" // (after encoding)
Here I use Spring RestTemplate to do a GET request. I setup a mock-server to examine the request in detail (mock server setup using Mulesoft, if u must know!).
ResponseEntity<String> responseEntity = api.exchange(urlString, HttpMethod.GET, new HttpEntity<>(new HttpHeaders()), String.class);
Here I use plain vanilla Java URLConnection to perform the request. Attached image with detailed request snapshot.
// 2. Plain vanilla java URLConnection. "result.toString()" has certificate match.
StringBuilder result = new StringBuilder();
URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("X-Venafi-Api-Key", apiKey);
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
System.out.println(result.toString());
In the images, you can see that the queryString value is different for these two requests. One of them shows \\ while the other shows %5C, although the parsed parameter value for myPath is still the same.
I am having to deal with an api that seems to work if-and-only-if the queryString looks like the former (i.e. "\\"). Why does the parsed queryString for Spring show "%5C" while this value shows double-backslash for requests originating from plain Java, curl, and even a simple browser?
What baffles me EVEN more, is that just about everything about the two HTTP Requests are IDENTICAL! And yet, why does the queryString/requestUri parse differently for these two requests? Shouldn't it be that a HTTP GET method is completely defined by its header contents and the requestUri? What am I missing to capture in these two GET requests?
Lots of questions. Spent an entire day, but at least I could verify that the way the requestUri/queryString is parsed seems to align with how the remote api-server responds.
Thanks.
Did some digging around the following morning. Turn out, with
ResponseEntity<String> responseEntity = api.exchange(urlString, HttpMethod.GET, new HttpEntity<>(new HttpHeaders()), String.class);
You should NOT have the "urlString" already encoded. The 'exchange' method does that encoding for you under-the-hood.
I have been trying to use accent characters in URL to call SOLR.
My Url looks like this:
"http://host:8983/solr/principal/select?q=**name:%22Michaël.e%22**"
When fire the URL from browser I get the correct result but when try from RestTempalte.exchange(URI,HttpMethod.GET, entity, String.class)
The log I see on SOLR is showing the accent characters being coverted to "?" as shown below
q=(name:"Micha?.e")
I have set RestTemple request charSet to "UTF-8" it still does the same.
My SOLR is running on Jetty.
You can try to encode HTML characters before calling RestTemplate using URLEncoder
String baseUri = "http://host:8983/solr/principal/select?q=**name:%22";
// TODO get name from somewhere
String name = "Michaël.e";
String encodedName= = URLEncoder.encode(name, "UTF-8");
RestTempalte.exchange(baseUri + encodedName + "%22**",HttpMethod.GET, entity, String.class);