Great question. As is the often case with questions related to best practices there are multiple considerations, and eventually depending on your unique environment you should decide what works best for you.
First of all remember it is important for your target agent to be close to the target database so that network communication between agent and target does not become a bottleneck. This means your agent should at least be in the same data center as the target database or when running in the cloud in the same availability zone.
A database like Greenplum is configured to use all system resources and should be balanced to avoid any particular resource from becoming a bottleneck, until the system maxes out compute power i.e. becomes CPU-bound. With that it would arguably be best to avoid running the HVR agent on one of the active nodes in the Greenplum cluster because (1) the gpfdist utility – that HVR uses on the agent environment to get data into Greenplum as fast as possible – is a pretty resource-intensive program, and (2) even when running on the master node access to the database may be compromised with too many resources in-use. We do commonly see Greenplum customers use the standby master node (that is idle unless in a failover scenario) to run the HVR agent which can be a good choice to make best use of available (and typically powerful) hardware resources.
HVR agents get instructions from a hub to process data or perform queries, and are otherwise stateless. As a result you can use a single agent to serve multiple hubs. Each time a job starts (irrespective of which hub initiates the communication) a connection is established to the agent through the HVR remote listener.
It is possible to install multiple agents on a single server. Just ensure different agents use different values for HVR_HOME and HVR_CONFIG, and every agent must be listening on a different port. Note that when setting up an environment like this it is crucial to always set correct values for HVR_HOME and HVR_CONFIG when making changes to the agent configuration (e.g. when upgrading the software).
Now regarding your question, here are some considerations:
- gpfdist can be quite resource intensive to serve the nodes in the Greenplum cluster directly. Running more than a few gpfdist processes concurrently on a single server may result in overloading a server. Of course resource consumption may be mitigated by staggered, scheduled jobs.
- Combining a single (bigger) server/configuration rather than multiple smaller servers provides more flexibility over time to use resources for jobs that need it most e.g. production jobs.
- Having multiple agents enables flexibility in running different versions of the software. E.g. you may want to ensure a new version works well for your QA system before deploying it into production.
- Managing multiple agents on a single server is slightly more complex than having just one agent.
- Consider the use of a load balancer (e.g. in the cloud some of our customers use the AWS Elastic Load Balancer (ELB)) in front of an agent to be flexible in adjusting resources over time (either in size/configuration of the server, and in number of servers). This consideration is independent of whether you decide to use a single or multiple agents on a single server (initially). Behind a load balancer you may end up with multiple servers that all run multiple identical agent configurations with HVR remote listeners on different ports (e.g. HVR 5.2 listener on port 4352 on all servers and HVR 5.3 listener on port 4353 on all servers).
Hope this helps.