Ansible Tower 3.1 introduces Clustering as an alternate approach to redundancy, replacing the redundancy solution configured with the active-passive nodes that involves primary and secondary instances. For versions older than 3.1, refer to the older versions of this chapter of the Ansible Tower Administration Guide.
Clustering is sharing load between hosts. Each instance should be able to act as an entry point for UI and API access. This should enable Tower administrators to use load balancers in front of as many instances as they wish and maintain good data visibility.
Note
Load balancing is optional and is entirely possible to have ingress on one or all instances as needed.
Each instance should be able to join the Tower cluster and expand its ability to execute jobs. This is a simple system where jobs can and will run anywhere rather than be directed on where to run. Ansible Tower 3.2 introduced the ability for clustered instances to be grouped into different pools/queues.
Instances can be grouped into one or more Instance Groups. Instance groups can be assigned to one or more of the resources listed below.
When a job associated with one of the resources executes, it will be assigned to the instance group associated with the resource. During the execution process, instance groups associated with Job Templates are checked before those associated with Inventories. Similarly, instance groups associated with Inventories are checked before those associated with Organizations. Thus, Instance Group assignments for the three resources form a hierarchy: Job Template > Inventory > Organization.
Supported Operating Systems
The following operating systems are supported for establishing a clustered environment:
Important considerations to note in the new clustering environment:
tower
group in the inventory AND it needs to be the first host listed in the tower
group.inventory
file for Tower deployments should be saved/persisted. If new instances are to be provisioned, the passwords and configuration options, as well as host names, must be made available to the installer.instance_group_tower
.Provisioning new instances involves updating the inventory
file and re-running the setup playbook. It is important that the inventory
file contains all passwords and information used when installing the cluster or other instances may be reconfigured. The current standalone instance configuration does not change for a 3.1 or later deployment. The inventory
file does change in some important ways:
tower
.instance_group_
. Instances are not required to be in the tower group alongside other instance_group_
groups, but one instance must be present in the tower
group. Technically, tower
is a group like any other instance_group_
group, but it must always be present, and if a specific group is not associated with a specific resource, then job execution will always fall back to the tower
group.Instances in the tower
group are responsible for housekeeping tasks like determining where jobs are supposed to be launched and processing playbook events. Moreover, if all tower instance group members fail, then jobs might not be able to run and playbook events might not get written. Therefore, it is important to have enough cluster instances in the tower
group to handle not only housekeeping tasks, but be able to act as backup in the event of a failure as well.
[tower]
hostA
hostB
hostC
[instance_group_east]
hostB
hostC
[instance_group_west]
hostC
hostD
Note
If no groups are selected for a resource then the tower
group is used, but if any other group is selected, then the tower
group will not be used in any way.
The database
group remains for specifying an external Postgres. If the database host is provisioned separately, this group should be empty:
[tower]
hostA
hostB
hostC
[database]
hostDB
It is common to provision Tower instances externally, but it is best to reference them by internal addressing. This is most significant for RabbitMQ clustering where the service is not available at all on an external interface. For this purpose, it is necessary to assign the internal address for RabbitMQ links as such:
[tower]
hostA rabbitmq_host=10.1.0.2
hostB rabbitmq_host=10.1.0.3
hostC rabbitmq_host=10.1.0.3
Note
The number of instances in a cluster should always be an odd number and it is strongly recommended that a minimum of three Tower instances be in a cluster.
The redis_password
field is removed from [all:vars]
Fields for RabbitMQ are as follows:
rabbitmq_port=5672
: RabbitMQ is installed on each instance and is not optional, it is also not possible to externalize it. This setting configures what port it listens on.
rabbitmq_vhost=tower
: Controls the setting for which Tower configures a RabbitMQ virtualhost to isolate itself.
rabbitmq_username=tower
andrabbitmq_password=tower
: Each instance and each instance’s Tower instance are configured with these values. This is similar to Tower’s other uses of usernames/passwords.
rabbitmq_cookie=<somevalue>
: This value is unused in a standalone deployment but is critical for clustered deployments. This acts as the secret key that allows RabbitMQ cluster members to identify each other.
rabbitmq_use_long_names
: RabbitMQ is sensitive to what each instance is named. Tower is flexible enough to allow FQDNs (host01.example.com), short names (host01), or ip addresses (192.168.5.73). Depending on what is used to identify each host in the inventory file, this value may need to be changed:
- For FQDNs and IP addresses, this value needs to be
true
.- For short names, set the value to
false
.- If you are using localhost, do not change the default setting of
rabbitmq_use_long_name=false
to true.- If instances are provisioned to where they reference other instances internally and not on external addresses, then the value for the long name should follow the internal addressing format (see
rabbitmq_host
above).
rabbitmq_enable_manager
: Set this to true to expose the RabbitMQ Management Web Console on each instance.
The following configuration shows the default settings for RabbitMQ:
rabbitmq_port=5672
rabbitmq_vhost=tower
rabbitmq_username=tower
rabbitmq_password=''
rabbitmq_cookie=cookiemonster
Note
rabbitmq_cookie
is a sensitive value, it should be treated like the secret key
in Tower.
Ports and instances used by Tower are as follows:
Clustering/RabbitMQ ports:
Ansible Tower versions 3.2 and later added the ability to optionally define isolated groups inside security-restricted networking zones from which to run jobs and ad hoc commands. Instances in these groups will not have a full installation of Tower, but will have a minimal set of utilities used to run jobs. Isolated groups must be specified in the inventory file prefixed with isolated_group_
. Below is an example of an inventory file for an isolated instance group.
[tower]
towerA
towerB
towerC
[instance_group_security]
towerB
towerC
[isolated_group_govcloud]
isolatedA
isolatedB
[isolated_group_govcloud:vars]
controller=security
In the isolated instance group model, “controller” instances interact with “isolated” instances via a series of Ansible playbooks over SSH. At installation time, by default, a randomized RSA key is generated and distributed as an authorized key to all “isolated” instances. The private half of the key is encrypted and stored within the Tower database, and is used to authenticate from “controller” instances to “isolated” instances when jobs are run.
When a job is scheduled to run on an “isolated” instance:
ansible/ansible-playbook
. As the playbook runs, job artifacts (such as stdout and job events) are written to disk on the “isolated” instance.Isolated groups (nodes) may be created in a way that allow them to exist inside of a VPC with security rules that only permit the instances in its controller group to access them; only ingress SSH traffic from “controller” instances to “isolated” instances is required. When provisioning isolated nodes, your install machine needs to be able to have connectivity to the isolated nodes. In cases where an isolated node is not directly accessible but can be reached indirectly through other hosts, you can designate a “jump host” by using ProxyCommand in your SSH configuration to specify the jump host and then run the installer.
The recommended system configurations with isolated groups are as follows:
isolated_group_tower
.For users who wish to manage SSH authentication from “controller” nodes to “isolated” nodes via some system outside of Tower (such as externally-managed passwordless SSH keys), you can disable this behavior by changing the AWX_ISOLATED_KEY_GENERATION
Tower API setting:
HTTP PATCH /api/v2/settings/jobs/ {'AWX_ISOLATED_KEY_GENERATION': false}
Tower itself reports as much status as it can via the Browsable API at /api/v2/ping
in order to provide validation of the health of the cluster, including:
View more details about Instances and Instance Groups, including running jobs and membership information at /api/v2/instances/
and /api/v2/instance_groups/
.
Each Tower instance is made up of several different services working collaboratively:
Tower is configured in such a way that if any of these services or their components fail, then all services are restarted. If these fail sufficiently often in a short span of time, then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected behavior.
The way jobs are run and reported to a ‘normal’ user of Tower does not change. On the system side, some differences are worth noting:
If a cluster is divided into separate instance groups, then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then either one is just as likely to receive a job as any other in the same group.
As Tower instances are brought online, it effectively expands the work capacity of the Tower system. If those instances are also placed into instance groups, then they also expand that group’s capacity. If an instance is performing work and it is a member of multiple groups, then capacity will be reduced from all groups for which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned. See Deprovision Instances and Instance Groups in the next section for more details.
Note
Not all instances are required to be provisioned with an equal capacity.
Project updates behave differently than they did before. Previously, they were ordinary jobs that ran on a single instance. It is now important that they run successfully on any instance that could potentially run a job. Projects will now sync themselves to the correct version on the instance immediately prior to running the job.
If an instance group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario.
By default, when a job is submitted to the tower queue, it can be picked up by any of the workers. However, you can control where a particular job runs, such as restricting the instances from which a job runs on. If any of the job template, inventory, or organization has instance groups associated with them, a job ran from that job template will not be eligible for the default behavior. That means that if all of the instances inside of the instance groups associated with these 3 resources are out of capacity, the job will remain in the pending state until capacity becomes available.
The order of preference in determining which instance group to submit the job to is as follows:
If instance groups are associated with the job template, and all of these are at capacity, then the job will be submitted to instance groups specified on inventory, and then organization. Jobs should execute in those groups in preferential order as resources are available.
The global tower
group can still be associated with a resource, just like any of the custom instance groups defined in the playbook. This can be used to specify a preferred instance group on the job template or inventory, but still allow the job to be submitted to any instance if those are out of capacity.
As an example, by associating group_a
with a Job Template and also associating the tower
group with its inventory, you allow the tower
group to be used as a fallback in case group_a
gets out of capacity.
In addition, it is possible to not associate an instance group with one resource but designate another resource as the fallback. For example, not associating an instance group with a job template and have it fall back to the inventory and/or the organization’s instance group.
This presents two other great use cases:
Likewise, an administrator could assign multiple groups to each organization as desired, as in the following scenario:
- There are three instance groups: A, B, and C. There are two organizations: Org1 and Org2.
- The administrator assigns group A to Org1, group B to Org2 and then assign group C to both Org1 and Org2 as an overflow for any extra capacity that may be needed.
- The organization administrators are then free to assign inventory or job templates to whichever group they want (or just let them inherit the default order from the organization).
Arranging resources in this way offers a lot of flexibility. Also, you can create instance groups with only one instance, thus allowing you to direct work towards a very specific Host in the Tower cluster.
Deprovisioning Tower does not automatically deprovision instances since clusters do not currently distinguish between a instance that was taken offline intentionally or due to failure. Instead, shut down all services on the Tower instance and then run the deprovisioning tool from any other instance:
Shut down the instance or stop the service with the command, ansible-tower-service stop
.
Run the deprovision command $ awx-manage deprovision_instance —-hostname=<name used in inventory file>
from another instance to remove it from the Tower cluster registry AND the RabbitMQ cluster registry.
Example:
awx-manage deprovision_instance -—hostname=hostB
Similarly, deprovisioning instance groups in Tower does not automatically deprovision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still show up in API endpoints and stats monitoring. These groups can be removed with the following command:
Example:awx-manage unregister_queue --queuename=<name>