bosh lite self healing


Cornelia Davis <cdavis@...>
 

I've just been trying this today and the resurrector does not seem to be functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent - indeed bosh sees this as follows

+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs |
+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | | |
| mysql/0 | running | common-resource-pool | 10.244.0.2 |
| wordpress/0 | running | common-resource-pool | 10.244.0.6 |
+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.


Dmitriy Kalinin
 

I've tried it just now with my deployment and saw that Director ran `scan
and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0
out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate VM.
Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I've just been trying this today and the resurrector does not seem to be
functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh yesterday
with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent - indeed
bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool | 10.244.0.2
|
| wordpress/0 | running | common-resource-pool | 10.244.0.6
|

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.


Cornelia Davis <cdavis@...>
 

No such task from my director. Any suggestions on how I might go about
figuring out why not?

On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io> wrote:

I've tried it just now with my deployment and saw that Director ran `scan
and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0
out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate
VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I've just been trying this today and the resurrector does not seem to be
functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh
yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent - indeed
bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool |
10.244.0.2 |
| wordpress/0 | running | common-resource-pool |
10.244.0.6 |

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.
--
Cornelia Davis
(805) 452 8941


Dr Nic Williams
 

Cornelia, did you change/create bosh users in lieu of the default admin/admin user?




If so, then perhaps the HM cannot connect to the director

On Fri, Dec 4, 2015 at 4:55 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

No such task from my director. Any suggestions on how I might go about
figuring out why not?
On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io> wrote:
I've tried it just now with my deployment and saw that Director ran `scan
and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0
out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate
VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I've just been trying this today and the resurrector does not seem to be
functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh
yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent - indeed
bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool |
10.244.0.2 |
| wordpress/0 | running | common-resource-pool |
10.244.0.6 |

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.
--
Cornelia Davis
(805) 452 8941


Cornelia Davis <cdavis@...>
 

I didn't change any passwords but Nic, that was the key. The password isn't
set right in the health_monitor config file. Thanks! I'm up and running now.

On Fri, Dec 4, 2015 at 7:55 AM, Dr Nic Williams <drnicwilliams(a)gmail.com>
wrote:

Cornelia, did you change/create bosh users in lieu of the default
admin/admin user?

If so, then perhaps the HM cannot connect to the director




On Fri, Dec 4, 2015 at 4:55 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

No such task from my director. Any suggestions on how I might go about
figuring out why not?

On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io>
wrote:

I've tried it just now with my deployment and saw that Director ran
`scan and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0
out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate
VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io>
wrote:

I've just been trying this today and the resurrector does not seem to
be functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh
yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent -
indeed bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool |
10.244.0.2 |
| wordpress/0 | running | common-resource-pool |
10.244.0.6 |

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.

--
Cornelia Davis
(805) 452 8941

--
Cornelia Davis
(805) 452 8941


Dmitriy Kalinin
 

are you sure you are on the latest version of bosh-lite? hm's has been using admin/admin in bosh-lite for some time now.

Sent from my iPhone

On Dec 4, 2015, at 8:10 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I didn't change any passwords but Nic, that was the key. The password isn't set right in the health_monitor config file. Thanks! I'm up and running now.

On Fri, Dec 4, 2015 at 7:55 AM, Dr Nic Williams <drnicwilliams(a)gmail.com> wrote:
Cornelia, did you change/create bosh users in lieu of the default admin/admin user?

If so, then perhaps the HM cannot connect to the director




On Fri, Dec 4, 2015 at 4:55 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:
No such task from my director. Any suggestions on how I might go about figuring out why not?

On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io> wrote:
I've tried it just now with my deployment and saw that Director ran `scan and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:
I've just been trying this today and the resurrector does not seem to be functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent - indeed bosh sees this as follows

+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs |
+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | | |
| mysql/0 | running | common-resource-pool | 10.244.0.2 |
| wordpress/0 | running | common-resource-pool | 10.244.0.6 |
+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.


--
Cornelia Davis
(805) 452 8941


--
Cornelia Davis
(805) 452 8941


Casey West
 

I recently ran into an issue where despite having an up-to-date git repo
for bosh-lite, my base vagrant box was a bit outdated and I had to update
that manually using `vagrant box update`.

You can use `vagrant box outdated` to find out if this is a problem for you
as well.

Do these from your bosh-lite checkout.

Best,
— Casey

On Fri, Dec 4, 2015 at 11:14 AM Dmitriy Kalinin <dkalinin(a)pivotal.io> wrote:

are you sure you are on the latest version of bosh-lite? hm's has been
using admin/admin in bosh-lite for some time now.

Sent from my iPhone

On Dec 4, 2015, at 8:10 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I didn't change any passwords but Nic, that was the key. The password
isn't set right in the health_monitor config file. Thanks! I'm up and
running now.

On Fri, Dec 4, 2015 at 7:55 AM, Dr Nic Williams <drnicwilliams(a)gmail.com>
wrote:

Cornelia, did you change/create bosh users in lieu of the default
admin/admin user?

If so, then perhaps the HM cannot connect to the director




On Fri, Dec 4, 2015 at 4:55 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

No such task from my director. Any suggestions on how I might go about
figuring out why not?

On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io>
wrote:

I've tried it just now with my deployment and saw that Director ran
`scan and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound,
0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2: Recreate
VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io>
wrote:

I've just been trying this today and the resurrector does not seem to
be functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh
yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent -
indeed bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool |
10.244.0.2 |
| wordpress/0 | running | common-resource-pool |
10.244.0.6 |

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.

--
Cornelia Davis
(805) 452 8941

--
Cornelia Davis
(805) 452 8941


Cornelia Davis <cdavis@...>
 

Bingo Casey. That was it! Thanks all.

On Fri, Dec 4, 2015 at 8:53 AM, Casey West <cwest(a)pivotal.io> wrote:

I recently ran into an issue where despite having an up-to-date git repo
for bosh-lite, my base vagrant box was a bit outdated and I had to update
that manually using `vagrant box update`.

You can use `vagrant box outdated` to find out if this is a problem for
you as well.

Do these from your bosh-lite checkout.

Best,
— Casey

On Fri, Dec 4, 2015 at 11:14 AM Dmitriy Kalinin <dkalinin(a)pivotal.io>
wrote:

are you sure you are on the latest version of bosh-lite? hm's has been
using admin/admin in bosh-lite for some time now.

Sent from my iPhone

On Dec 4, 2015, at 8:10 AM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

I didn't change any passwords but Nic, that was the key. The password
isn't set right in the health_monitor config file. Thanks! I'm up and
running now.

On Fri, Dec 4, 2015 at 7:55 AM, Dr Nic Williams <drnicwilliams(a)gmail.com>
wrote:

Cornelia, did you change/create bosh users in lieu of the default
admin/admin user?

If so, then perhaps the HM cannot connect to the director




On Fri, Dec 4, 2015 at 4:55 AM, Cornelia Davis <cdavis(a)pivotal.io>
wrote:

No such task from my director. Any suggestions on how I might go about
figuring out why not?

On Thu, Dec 3, 2015 at 9:33 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io>
wrote:

I've tried it just now with my deployment and saw that Director ran
`scan and fix` task after HM saw missing agent.

```
Director task 16
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound,
0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)

Started applying problem resolutions > unresponsive_agent 2:
Recreate VM. Done (00:01:19)

Task 16 done

Started 2015-12-04 05:26:46 UTC
Finished 2015-12-04 05:28:15 UTC
Duration 00:01:29
```

On Thu, Dec 3, 2015 at 9:21 PM, Cornelia Davis <cdavis(a)pivotal.io>
wrote:

I've just been trying this today and the resurrector does not seem to
be functioning.

Running a new bosh-lite instance on vagrant, just deployed fresh
yesterday with stemcell 2776.

I wsh'd into one of the warden containers and stopped the agent -
indeed bosh sees this as follows


+-----------------+--------------------+----------------------+------------+
| Job/index | State | Resource Pool | IPs
|

+-----------------+--------------------+----------------------+------------+
| unknown/unknown | unresponsive agent | |
|
| mysql/0 | running | common-resource-pool |
10.244.0.2 |
| wordpress/0 | running | common-resource-pool |
10.244.0.6 |

+-----------------+--------------------+----------------------+------------+

But the resurrector never recovers it.

--
Cornelia Davis
(805) 452 8941

--
Cornelia Davis
(805) 452 8941

--
Cornelia Davis
(805) 452 8941