Testing behaviour of a production CF environment


Graham Bleach
 

Hello,

What do you use to test the behaviour of your production environments?

We are currently not live and are running these in each environment:
- cf-smoke-tests to test core functionality is working
- cf-acceptance-tests to test behaviour in more detail
- our own custom acceptance tests against code we've written, behaviour
we've configured and care about not breaking
- external monitoring against some deployed apps

We need to stop running cf-acceptance-tests in production, because they
sometimes cause problems if they exit prematurely and eg. leave an
unexpected buildpack as the first buildpack in the list. So we could run
those tests only in our CI environment every time we change something.

However we'd like to identify behaviour changes that aren't caused by our
changes and don't occur in our CI environment. For example, we recently
uncovered a problem with an infrastructure product that we only noticed by
running smoke-tests in production - that error didn't happen in other
environments. We're worried about the coverage we'd lose by not running the
tests.

One option that seems appealing to us is to try to work out a way of
running just the "safe" acceptance tests. For our purposes, "safe" tests
probably means ones that don't need to run as admin, that could run in
their own org - the isolation features of CF probably protect us enough
against impact to people using CF.

But it doesn't seem obvious how to currently run such a subset of the
acceptance-tests and do so in a way that's likely to be stable in the
future, so I asked this question.

Graham


Amit Kumar Gupta
 

Hi Graham,

Your approach sounds good. What you are doing/plan to do in CI sounds
perfect, as well as your plan for production (namely, run what you have in
CI except cf-acceptance-tests). In the README for cf-acceptance-tests, we
state:

These tests are not intended for use against production systems, they are
intended for acceptance environments for teams developing Cloud Foundry
itself. While these tests attempt to clean up after themselves, there is no
guarantee that they will not mutate state of your system in an undesirable
way.

I'd recommend if you're going to run critical workloads on production, you
should consider having a staging environment where you roll out a CF
upgrade before you roll it out to production.

We are actually already tracking an issue related to buildpacks not being
cleaned up:

https://www.pivotaltracker.com/story/show/115199031

But as you'll be able to see, it's not the highest priority at the moment.

The README attempts to give some idea of whether test suites are unsafe to
run in certain contexts:

https://github.com/cloudfoundry/cf-acceptance-tests#explanation-of-test-suites

And the section on Test Execution explains how you can skip test suites,
tests matching a certain regex, etc:

https://github.com/cloudfoundry/cf-acceptance-tests#test-execution

Best,
Amit

On Tue, May 31, 2016 at 10:15 AM, Graham Bleach <
graham.bleach(a)digital.cabinet-office.gov.uk> wrote:

Hello,

What do you use to test the behaviour of your production environments?

We are currently not live and are running these in each environment:
- cf-smoke-tests to test core functionality is working
- cf-acceptance-tests to test behaviour in more detail
- our own custom acceptance tests against code we've written, behaviour
we've configured and care about not breaking
- external monitoring against some deployed apps

We need to stop running cf-acceptance-tests in production, because they
sometimes cause problems if they exit prematurely and eg. leave an
unexpected buildpack as the first buildpack in the list. So we could run
those tests only in our CI environment every time we change something.

However we'd like to identify behaviour changes that aren't caused by our
changes and don't occur in our CI environment. For example, we recently
uncovered a problem with an infrastructure product that we only noticed by
running smoke-tests in production - that error didn't happen in other
environments. We're worried about the coverage we'd lose by not running the
tests.

One option that seems appealing to us is to try to work out a way of
running just the "safe" acceptance tests. For our purposes, "safe" tests
probably means ones that don't need to run as admin, that could run in
their own org - the isolation features of CF probably protect us enough
against impact to people using CF.

But it doesn't seem obvious how to currently run such a subset of the
acceptance-tests and do so in a way that's likely to be stable in the
future, so I asked this question.

Graham


Daniel Jones
 

Hi Graham,

Running acceptance tests in production is absolutely what I'd recommend -
in fact I drove that point home in my talk in Santa Clara last week (I can
forward on the link once the YouTube videos are up).

I've worked with customers who didn't use the official CATS, but instead
favoured writing their own in the BDD framework of their choice. We didn't
find them too onerous to develop and maintain, and an example test would be:

1. Push fixture app
2. Start app
3. Hit app, validate response
4. Hit URL on app to write to a given data service
5. Hit URL to read written value, validate
6. Stop app
7. Delete app

This exercised some of the core user-facing behaviour, and also those of
data services (search for Pivotal's apps like cf-redis-example-app
<https://github.com/pivotal-cf/cf-redis-example-app> which follow the same
pattern). We had additional tests that would log a given unique string
through an app, and then hit the log aggregation system to validate that it
had made its way through. The tests were small, so we had more granular
control over the frequency of each test, and got faster feedback through
parallelisation.

Running these sorts of tests against each Cloud Foundry instance on a CI
server with a wallboard view worked really well. Not only do you get volume
testing for free (I've filled a buildpack cache that way
<http://www.engineerbetter.com/update/2015/08/19/overflowing-buildpack_cache.html>),
you can publish the wallboard URL to PaaS customers and stakeholders alike.
Tying these tests up to alerting/paging systems is also more sensible than
paging people due to IaaS-level failures.

It sounds like you're doing the right thing, and I'd encourage you to
continue and expand your efforts in that area. I'm happy to discuss more if
this is an area of interest for you.

Regards,
Daniel Jones - CTO
+44 (0)79 8000 9153
@DanielJonesEB <https://twitter.com/DanielJonesEB>
*EngineerBetter* Ltd <http://www.engineerbetter.com> - UK Cloud Foundry
Specialists

On Wed, Jun 1, 2016 at 4:06 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Graham,

Your approach sounds good. What you are doing/plan to do in CI sounds
perfect, as well as your plan for production (namely, run what you have in
CI except cf-acceptance-tests). In the README for cf-acceptance-tests, we
state:

These tests are not intended for use against production systems, they are
intended for acceptance environments for teams developing Cloud Foundry
itself. While these tests attempt to clean up after themselves, there is no
guarantee that they will not mutate state of your system in an undesirable
way.

I'd recommend if you're going to run critical workloads on production, you
should consider having a staging environment where you roll out a CF
upgrade before you roll it out to production.

We are actually already tracking an issue related to buildpacks not being
cleaned up:

https://www.pivotaltracker.com/story/show/115199031

But as you'll be able to see, it's not the highest priority at the moment.

The README attempts to give some idea of whether test suites are unsafe to
run in certain contexts:


https://github.com/cloudfoundry/cf-acceptance-tests#explanation-of-test-suites

And the section on Test Execution explains how you can skip test suites,
tests matching a certain regex, etc:

https://github.com/cloudfoundry/cf-acceptance-tests#test-execution

Best,
Amit

On Tue, May 31, 2016 at 10:15 AM, Graham Bleach <
graham.bleach(a)digital.cabinet-office.gov.uk> wrote:

Hello,

What do you use to test the behaviour of your production environments?

We are currently not live and are running these in each environment:
- cf-smoke-tests to test core functionality is working
- cf-acceptance-tests to test behaviour in more detail
- our own custom acceptance tests against code we've written, behaviour
we've configured and care about not breaking
- external monitoring against some deployed apps

We need to stop running cf-acceptance-tests in production, because they
sometimes cause problems if they exit prematurely and eg. leave an
unexpected buildpack as the first buildpack in the list. So we could run
those tests only in our CI environment every time we change something.

However we'd like to identify behaviour changes that aren't caused by our
changes and don't occur in our CI environment. For example, we recently
uncovered a problem with an infrastructure product that we only noticed by
running smoke-tests in production - that error didn't happen in other
environments. We're worried about the coverage we'd lose by not running the
tests.

One option that seems appealing to us is to try to work out a way of
running just the "safe" acceptance tests. For our purposes, "safe" tests
probably means ones that don't need to run as admin, that could run in
their own org - the isolation features of CF probably protect us enough
against impact to people using CF.

But it doesn't seem obvious how to currently run such a subset of the
acceptance-tests and do so in a way that's likely to be stable in the
future, so I asked this question.

Graham


Graham Bleach
 

Hi Amit,

Thanks for your reply.

On 1 June 2016 at 04:06, Amit Gupta <agupta(a)pivotal.io> wrote:

The README attempts to give some idea of whether test suites are unsafe to
run in certain contexts:


https://github.com/cloudfoundry/cf-acceptance-tests#explanation-of-test-suites

And the section on Test Execution explains how you can skip test suites,
tests matching a certain regex, etc:

https://github.com/cloudfoundry/cf-acceptance-tests#test-execution
We've been looking through the test suites and can't see a straightforward
way to run only the "safe" tests that won't affect normal users / don't
require an admin user.

For instance, the apps suite includes both tests we'd like to run, testing
core user-facing behaviour and the admin buildpack lifecycle test with the
issue you linked to. I don't think skipping based on regexes on test names
works well, both because the regex will become long quite quickly and
because cf-acceptance-tests is a moving target - each time we upgraded to a
new release we'd need to review which tests were added and update our
regexes.

I wondered if other people would be interested in having a way to only run
the "non-admin" tests? If so, perhaps re-organising the suites to enable
that would be a welcome change?

Regards,
Graham


Graham Bleach
 

On 1 June 2016 at 09:22, Daniel Jones <daniel.jones(a)engineerbetter.com>
wrote:

Running acceptance tests in production is absolutely what I'd recommend -
in fact I drove that point home in my talk in Santa Clara last week (I can
forward on the link once the YouTube videos are up).
Sounds very relevant, I'll look forward to the video.


I've worked with customers who didn't use the official CATS, but instead
favoured writing their own in the BDD framework of their choice. We didn't
find them too onerous to develop and maintain, and an example test would be:

1. Push fixture app
2. Start app
3. Hit app, validate response
4. Hit URL on app to write to a given data service
5. Hit URL to read written value, validate
6. Stop app
7. Delete app

This exercised some of the core user-facing behaviour, and also those of
data services (search for Pivotal's apps like cf-redis-example-app
<https://github.com/pivotal-cf/cf-redis-example-app> which follow the
same pattern). We had additional tests that would log a given unique string
through an app, and then hit the log aggregation system to validate that it
had made its way through. The tests were small, so we had more granular
control over the frequency of each test, and got faster feedback through
parallelisation.
We have added tests for things we've built / configured, we borrowed a fair
amount in style from CATS:
https://github.com/alphagov/paas-cf/tree/master/tests/src/acceptance

In principle I think the conversations / decisions about which behaviour
should be tested is valuable, as is having tests written in a language /
framework that's understood by the team, so I can understand why people
would do this.

I don't think this works for us for things that are already tested in CATS
though, as it feels like duplication of effort, both to write and maintain
the tests, which is why I'm interested in the idea of moving tests around
within CATS to enable people to run a subset of tests that we consider to
be production-safe.

Graham


Amit Kumar Gupta
 

Hi Graham,

Something like that would be nice. However, technically, every test needs
admin because in their Before hooks they use the admin to create an org,
quota, etc. Sounds like what you want is to run tests that don't require
"admin" *in a way that might affect other users*. Even if we could split
up the tests right now along those lines, it will be a hard thing to
enforce moving forward since CATS is such a moving target, and touched by
so many different teams.

In principle, I'd like CATS to be able to guarantee it will clean up after
itself, do things with minimal affect on other users, and even be something
that could be safe to run in a production environment. And for such a
thing, I'd like to have some automated way to ensure CATS adheres to this
contract. Right now nothing like this exists, so we don't make the
guarantee that it's safe to run against prod. That said, it would be nice
to have that option in the future, so I'd like to keep CATS as
self-contained as possible right now. So for now, the best thing we can do
is identify tests, setup, or patterns that definitely would be bad for a
prod environment and fix them.

Have you been able to identify a complete list of problematic tests? We
can tackle them one-by-one and try to figure out which projects need
issues/PRs opened against them. E.g. I think for the buildpacks issue, it
might be desirable to limit the scope of a buildpack to a space or org,
though it would require some thought to figure out how to make sense of
buildpack priority.

Best,
Amit

On Wed, Jun 1, 2016 at 3:44 AM, Graham Bleach <
graham.bleach(a)digital.cabinet-office.gov.uk> wrote:

Hi Amit,

Thanks for your reply.

On 1 June 2016 at 04:06, Amit Gupta <agupta(a)pivotal.io> wrote:

The README attempts to give some idea of whether test suites are unsafe
to run in certain contexts:


https://github.com/cloudfoundry/cf-acceptance-tests#explanation-of-test-suites

And the section on Test Execution explains how you can skip test suites,
tests matching a certain regex, etc:

https://github.com/cloudfoundry/cf-acceptance-tests#test-execution
We've been looking through the test suites and can't see a straightforward
way to run only the "safe" tests that won't affect normal users / don't
require an admin user.

For instance, the apps suite includes both tests we'd like to run, testing
core user-facing behaviour and the admin buildpack lifecycle test with the
issue you linked to. I don't think skipping based on regexes on test names
works well, both because the regex will become long quite quickly and
because cf-acceptance-tests is a moving target - each time we upgraded to a
new release we'd need to review which tests were added and update our
regexes.

I wondered if other people would be interested in having a way to only run
the "non-admin" tests? If so, perhaps re-organising the suites to enable
that would be a welcome change?

Regards,
Graham