Re: Import large dataset to Postgres instance in CF


Guillaume Berche
 

Hi Siva,

We've been working at Orange on a solution which dumps of an existing db to
an S3-compatible endpoint and then reimports from the S3 bucket into a db
instance (see mailing list announce in [1] and specs in [2]). The
implementation at [3] is still in early stage and currently lacks
documentation beyond the specs. We'd be happy to get feedback from the
community. While this does not directly addresses your issue, this might
provide ideas:

a) within corp network manually upload the data set (e.g. a pg dump) and
upload it to S3 using S3 CLIs (e.g. riakcs service). Then within one of
your CF instance, ssh to it, and download the dump from S3 and stream it
into a pg client to import it into a CF reacheable instance (as to avoid
reaching ephemeral FS limit)

b) If this process is recurrent and needs automation, then the
service-db-dumper could potentially help.

I'll think about extending the service db dumper to accept a remote S3
bucket as the source of a dump (currently it accepts a db URL to perform a
dump from, and soon a service instance name/guid)

If this service-db-dumper improvement were available, then you could
instanciate a service-db-dumper within your private CF instance. Then
instanciate a dump service instance from the S3 bucket were you would have
uploaded the dump.
Then use the service-db-dumper to restore/import this dump into to your pg
instance accessible within CF.

Hope this helps,

Guillaume.

[1]
http://cf-dev.70369.x6.nabble.com/cf-dev-Data-services-import-export-tp1717.html
[2]
https://docs.google.com/document/d/1Y5vwWjvaUIwHI76XU63cAS8xEOJvN69-cNoCQRqLPqU/edit
[3] https://github.com/Orange-OpenSource/service-db-dumper

On Thu, Dec 10, 2015 at 6:35 AM, Nicholas Calugar <ncalugar(a)pivotal.io>
wrote:

Hi Siva,

1. If you run the PostgreSQL, you likely want to temporarily open the
firewall to load data or get on a jump box of some sort that can access the
database. It's not really a CF issue at this point, it's a general issue of
seeding a database out-of-band from the application server.
2. If the above isn't an option and your CF is running Diego, you
could use SSH to get onto an app container after SCPing the data to that
container.
3. The only other option I can think of is writing a simple app that
you can push to CF to do the import.

Hope that helps,

Nick

On Wed, Dec 9, 2015 at 3:08 PM Siva Balan <mailsiva(a)gmail.com> wrote:

Hi Nick,
Your Option 1(Using psql CLI) is not possible since there is a firewall
that only allows connection from CF apps to postgres DB. Apps like psql CLI
that are outside of CF have no access to the postgres DB.
I just wanted to get some thoughts from this community since I presume
many would have faced a similar circumstance of importing large sets of
data to their DB which is behind a firewall and accessible only through CF
apps.

Thanks
Siva

On Wed, Dec 9, 2015 at 2:27 PM, Nicholas Calugar <ncalugar(a)pivotal.io>
wrote:

Hi Siva,

You'll have to tell us more about how your PostgreSQL and CF was
deployed, but you might be able to connect to it from your local machine
using the psql CLI and the credentials for one of your bound apps. This
takes CF out of the equation other than the service binding providing the
credentials.

If this doesn't work, there are a number of things that could be in the
way, i.e. firewall that only allows connection from CF or the PostgreSQL
server is on a different subnet. You can then try using some machine as a
jump box that will allow access to the PostgreSQL.

Nick

On Wed, Dec 9, 2015 at 9:40 AM Siva Balan <mailsiva(a)gmail.com> wrote:

Hello,
Below is my requirement:
I have a postgres instance deployed on our corporate CF deployment. I
have created a service instance of this postgres and bound my app to it.
Now I need to import a very large dataset(millions of records) into this
postgres instance.
As a CF user, I do not have access to any ports on CF other than 80 and
443. So I am not be able to use any of the native postgresql tools to
import the data. I can view and run simple SQL commands on this postgres
instance using the phppgadmin app that is also bound to my postgres service
instance.
Now, what is the best way for me to import this large dataset to my
postgres service instance?
All thoughts and suggestions welcome.

Thanks
Siva Balan

--
http://www.twitter.com/sivabalans

--
http://www.twitter.com/sivabalans

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.