All Articles

Elasticsearch EC2 Discovery on AWS

The gist

The Hive and Cortex both uses Elasticsearch for its data store. I wanted to make sure that I could deploy a resilient cluster. However, there was not much information on how to do this in an AWS environment from the documentation. After researching and a bit of Google-fu-ing, I’ve got it to work. I’m assuming if you’re reading this, you’re trying to do the same. Otherwise, you might be lost :O

I plan to write another blog about my set up for The Hive/Cortex. Stay tuned, once it is out, I will link it through this blog as well! But for now, their Github have a great documentation section that does a really good job of what it does, and how to deploy it!

Set up ES 6.8

Note: This is the version I am using with The Hive. If you are doing this for a different project, then ignore this.

Preparation

  • Set up an IAM role with EC2 discovery policy attached
  • Make note of es-access-user KeyID and AWS Secret
  • Attach policy to instance
{
  "Statement": [
    {
      "Action": [
        "ec2:DescribeInstances"
      ],
      "Effect": "Allow",
      "Resource": [
        "*"
      ]
    }
  ],
  "Version": "2012-10-17"
}
sudo apt-get update
sudo apt-get upgrade

sudo apt-get install openjdk-11-jre-headless

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo apt-get install apt-transport-https
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update && sudo apt-get install elasticsearch

Just for debugging purposes, I also like to install these 2 packages:

sudo apt-get install curl
sudo apt-get install jq

Set $JAVA_HOME in /etc/default/elasticsearch:

sudo vi /etc/default/elasticsearch
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

Now set the JVM memory options to at least ½ of the memory on the machine. ES says anything less than that will result in poor performance. (On 8 gb machine, 4 is half.)

sudo vim /etc/elasticsearch/jvm.options
-Xms4g
-Xmx4g

Set up EC2-discovery for AWS:

cd /usr/share/elasticsearch/bin
sudo ./elasticsearch-plugin install discovery-ec2

Set up Elasticsearch Keystore (for both Master + Node):

cd /usr/share/elasticsearch/bin
sudo ./elasticsearch-keystore create
sudo ./elasticsearch-keystore list
sudo ./elasticsearch-keystore add discovery.ec2.access_key (enter key when prompts)
sudo ./elasticsearch-keystore add discovery.ec2.secret_key (enter key when prompts)
sudo ./elasticsearch-keystore list

Once you’re done, it should look like this:

ubuntu@ip-x-x-x-x:/usr/share/elasticsearch/bin$ sudo ./elasticsearch-keystore list
discovery.ec2.access_key
discovery.ec2.secret_key
keystore.seed

If you have ES running already:

Disable shard allocation

curl -H "Content-Type: application/json" -XPUT 'localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.enable": "none" } }'

Stop ES & Kibana:

sudo systemctl stop elasticsearch
sudo systemctl stop kibana

Edit Elasticsearch settings:

sudo vi /etc/elasticsearch/elasticsearch.yml

Master Node Setting:

cluster.name: hive
node.name: hive

node.master: true
node.data: true
node.ingest: true

# path.data: /var/lib/elasticsearch
# path.logs: /var/log/elasticsearch

network.host: [_ec2_,_local_]

discovery.zen.hosts_provider: ec2
discovery.zen.ping.unicast.hosts: ["x.x.x.x", "x.x.x.x"]
discovery.zen.minimum_master_nodes: 1

discovery.ec2.any_group: true
discovery.ec2.host_type: private_ip

cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
discovery.ec2.tag.es_cluster: "hive-prod-elasticsearch"
discovery.ec2.endpoint: ec2.us-west-2.amazonaws.com

thread_pool.index.queue_size: 100000
thread_pool.search.queue_size: 100000
thread_pool.bulk.queue_size: 100000

Data/Ingest/Master-eligible Node Setting: Use the same config above. Edit these as you see fit for your environment

node.master: true
node.data: true
node.ingest: true

Start ES and Kibana back up:

sudo systemctl start elasticsearch
sudo systemctl start kibana

Check system health:

curl -XGET http://localhost:9200/_cluster/health?pretty=true

It should said there are 2 nodes running.

If anything is wrong, check the log:

sudo cat /var/log/elasticsearch/hive.log

Re-enable shard allocation:

curl -H "Content-Type: application/json" -XPUT 'localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.enable": null } }'

FAQ

  • Data nodes - stores data and executes data-related operations such as search and aggregation.
  • Master nodes - in charge of cluster-wide management and configuration actions such as adding and removing nodes.
  • Client nodes - forwards cluster requests to the master node and data-related requests to data nodes.
  • Ingest nodes - for pre-processing documents before indexing.
  • cluster.name - Any name. Must be the same across all nodes.
  • node.name - Any name.
  • node.master - Whether or not this is a master node, meaning one that coordinates the activity of data nodes. In our example, we have 1 master and 2 data nodes. The master can also serve as a data node.
  • node.data - Indicates whether this node can store data.
  • node.ingest - Used to pre-process documents before they are indexed.
  • discovery.zen.hosts_provider - Use ec2 for node discovery.
  • discovery.zen.ping.unicast.hosts - List the IP address of all master and data nodes in the cluster. Enclose in commas.
  • network.host - The private IP address of this machine, i.e., the one shown by the ifconfig -a command. (Not the public address that you use to ssh into)

More info about IAM role set up: