How to gather files from different sources into HDFS?

How to gather files from different sources into HDFS? - hadoop

Currently I am working with a team that works on "search engines", especially with HP Idol,
The main idea of my work is to find a new search engine which is open source so that I started to work with Elasticsearch, but I still have some problems that I could not find solutions;
I need to index the documents into Elasticsearch from the servers of,
Sharepoint
Documentum
Alfresco
so from my searches on the web, I found out,
Talend ( can not use because, the team does not want to pay )
Apache Manifoldcf (open source but lots of problem with it)
Have seen those problems, I continue to find out new solutions.
Can you please tell me if I have some possibilities to put all files from sources into HDFS and then index them all on Elasticsearch with Apache Spark ?
I will appreciate also all your new techniques that I have never thought.
Thanks in advance

Related

IBM BAW 22 - process portal - display all tasks

I have a very big problem after migrating BPM from version 8.5.5 to BAW 22. I have a saved search on the process portal, which shows me the tasks of all users broken down by process. After migration, this search has stopped working and it only shows me my tasks and the team tasks I belong to. It stopped showing other users tasks. I tried to fix this error in different ways but nothing works. Can anyone help?

Can you provide more specifics on what you have tried so far?
Two things I would suggest:
If search optimization is enabled, check out this link here if you haven't already. You will need to drop the LSW_BPD_INSTANCE_VAR_NAMES and LSW_BPD_INSTANCE_VARS_PIVOT tables and recreate them using the steps mentioned here:
https://www.ibm.com/docs/en/baw/22.x?topic=portal-tuning-process-searches
2)Depending on your Saved Search criteria, it's possible you will need to re-create it. See the red alert section about existing saved searches and migration here:
https://www.ibm.com/docs/en/baw/22.x?topic=windows-restarting-verifying-migration.
It mentions 'earlier versions of Business Automation Workflow' but the same would apply to earlier versions of BPM

are COPY commands possible with MonetDBe-Python?

I was having some trouble bulk-loading records to go faster than what cursor.executemany would allow. I hoped the bulk operations documented with regular MonetDB here might work, so I tried an export as a test. e.g. cursor.execute("COPY SELECT * FROM foo INTO '/file/path.csv'"). This doesn't raise an error unless the file already exists, but the resulting file is always 0 bytes. I tried the same with file STDOUT and it prints nothing.
Are these COPY commands meant to work on the embedded version?
Note: This is my first use of anything related to MonetDB. As a fan of SQLite and a not-super-impressed user of Amazon Redshift, this seemed like a neat project. Not sure if MonetDB/e is the same as MonetDBLite - the former seems more active lately?

Exporting data through a COPY INTO command should be possible in MonetDB/e, yes.
However, this feature is not working currently. I was able to reproduce your problem, i.e. the COPY INTO creates the file where the data should be exported to, but doesn't write the data. This does not happen with regular MonetDB.
Our team is notified of this issue, and we're looking into it. Thanks for the heads up!
PS: Regarding your doubt about MonetDB/e vs MonetDBLite: our team no long develops and maintains MonetDBLite. Both are embedded databases that use MonetDB as the core engine, but MonetDBLite is deprecated. After having learnt some do's and don'ts with MonetDBLite, our team is developing our next generation of embedded databases.
So for your embedded database needs, you should follow what's coming out of our MonetDB/e projects.

I've created a test for it at: https://github.com/MonetDBSolutions/monetdbe-examples/blob/CI/C/copy_into.c
Also filed a bug report over on GitHub: https://github.com/MonetDB/MonetDB/issues/7058
We're currently looking into this issue.

Liferay Document and Media limitations

I'm working on an EDM migration project and I would like to know if anyone has information on the documents limitation in Liferay 6.2 CE.
We need to import millions of documents to a Liferay instance.
I would like to know if someone has done that with a community edition ? And if someone knows if I need a cluster of Liferay's to keep an acceptable response time for the end users.
Thanks for your advices !
Julien

I found an answer in the Liferay documentation. https://dev.liferay.com/fr/discover/deployment/-/knowledge_base/7-0/document-repository-configuration
The advanced file system store overcomes this limitation by programmatically creating a structure that can expand to millions of files, by alphabetically nesting the files in folders. This not only allows for more files to be stored, but also improves performance as there are fewer files stored per folder.
So it seems that with this advanced file system store, Liferay can handle millions of documents and bypass OS limitations.
Hope this helps other.
Julien

Distributing data on cluster (using torrents?)

I hope this is a good place to ask this, otherwise please redirect me to the correct forum.
I have a large amount of data (~400GB) I need to distribute to all nodes in a cluster (~100 nodes). Any help into how to do this will be appreciated, following here is what Ive tried.
I was thinking of doing this using torrents but I'm running into a bunch of issues. These are the steps I tried:
I downloaded ctorrent to create the torrent and seed and download it. I had a problem because I didn't have a tracker.
I found that qbittorrent-nox has an embedded tracker so I downloaded that on one of my nodes and set the tracker up.
I now created the torrent using the tracker I created and copied it to my nodes.
When I run the torrent with ctorrent on the node with the actual data on it to seed the data I get:
Seed for others 72 hours
- 0/0/1 [1/1/1] 0MB,0MB | 0,0K/s | 0,0K E:0,1 Connecting
When I run on one of the nodes to download the data I get:
- 0/0/1 [0/1/0] 0MB,0MB | 0,0K/s | 0,0K E:0,1
So it seems they aren't connecting to the tracker ok, but I don't know why
I am probably doing something very wrong, but I can't figure it out.
If anyone can help me with what I am doing, or has any way of distributing the data efficiently, even not with torrents, I would be very happy to hear.
Thanks in advance for any help available.

but the node thats supposed to be seeding thinks it has 0% of the file, and so it doesn't seed.
If you create a metadata file (.torrent) with tool A and then want to seed it with tool B then you need to point B to both the metadata and the data (the content files) itself.
I know it is a different issue now, and might require a different topic, but Im hoping you might have ideas.
You should create a new question which will have more room for you to provide details.

So this is embarrassing, I might have had it working for a while now, but I did change my implementation since I started. I just re-checked and the files I was transferring were corrupted on one of my earlier tries and I have been using them since.
So to sum up this is what worked for me if anybody else ends up needing the same setup:
I create torrents using "transmission-create /path/to/file/or/directory/to/be/torrented -o /path/to/output/directory/output_file_name.torrent" (this is because qbittorrent-nox doesn't provide a tool that I could find to create torrents)
I run the torrent on the computer with the actual files so it will seed using "qbittorrent-nox ~/path/to/torrent/file/name_of_file.torrent"
I copy the .torrent file to all nodes and run "qbittorrent-nox ~/path/to/torrent/file/name_of_file.torrent" to start downloading
qbittorrent settings I needed to configure:
In "Downloads" change "Save files to location" to the location of the data in the node that is going to be seeding #otherwise that node wont know it has the files specified in the torrent and wont seed them.
To avoid issues with the torrents sometimes starting as queued and requiring a "force resume". This doesn't appear to have fixed the problem 100% though
In "Speed" tab uncheck "Enable bandwidth management (uTP)"
uncheck "Apply rate limit to uTP connections"
In "BitTorrent" tab uncheck "Torrent Queueing"
Thanks for all the help and Im sorry I hassled people for no reason from some point..

SpagoBI _ Exporting Documents

Does anyone know how to export a document\set of documents (cockpits, reports, charts, etc) from one server instance to another?
I haven't found any info on this subject and really would like to know, as it's important for future development and upgrading. We want\need to have two separated environments, one for dev and another for use by end-client.
Thank you in advance :)

See my write-up of Exporting SpagoBI Report Documents from one server and importing them into another SpagoBI Server. That will will give you a largely manual process, but it is repeatable.
Exporting and Importing of SpagoBI Documents
If you're interested in exporting one report document, possibly versioning it as an artifact, and then deploying it to one or more SpagoBI servers in an automated fashion, see my working SpagoBI Export deployer project in Github dbh / SpagoBIInteg and my blog post SpagoBI report deployment via SDK
Here is a former discussion thread on the Spago World forum which which went without comment.
If you find any of this useful, please let me know. I'm happy to collaborate.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to gather files from different sources into HDFS? - hadoop

Related

IBM BAW 22 - process portal - display all tasks

are COPY commands possible with MonetDBe-Python?

Liferay Document and Media limitations

Distributing data on cluster (using torrents?)

SpagoBI _ Exporting Documents

Categories

Resources