The gist
The Hive and Cortex both uses Elasticsearch for its data store. I wanted to make sure that I could deploy a resilient cluster. However, there was not much information on how to do this in an AWS environment from the documentation. After researching and a bit of Google-fu-ing, I’ve got it to work. I’m assuming if you’re reading this, you’re trying to do the same. Otherwise, you might be lost :O
I plan to write another blog about my set up for The Hive/Cortex. Stay tuned, once it is out, I will link it through this blog as well! But for now, their Github have a great documentation section that does a really good job of what it does, and how to deploy it!
Set up ES 6.8
Note: This is the version I am using with The Hive. If you are doing this for a different project, then ignore this.
Preparation
- Set up an IAM role with EC2 discovery policy attached
- Make note of es-access-user KeyID and AWS Secret
- Attach policy to instance
{
"Statement": [
{
"Action": [
"ec2:DescribeInstances"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
],
"Version": "2012-10-17"
}
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install openjdk-11-jre-headless
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo apt-get install apt-transport-https
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update && sudo apt-get install elasticsearch
Just for debugging purposes, I also like to install these 2 packages:
sudo apt-get install curl
sudo apt-get install jq
Set $JAVA_HOME
in /etc/default/elasticsearch
:
sudo vi /etc/default/elasticsearch
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
Now set the JVM memory options to at least ½ of the memory on the machine. ES says anything less than that will result in poor performance. (On 8 gb machine, 4 is half.)
sudo vim /etc/elasticsearch/jvm.options
-Xms4g
-Xmx4g
Set up EC2-discovery for AWS:
cd /usr/share/elasticsearch/bin
sudo ./elasticsearch-plugin install discovery-ec2
Set up Elasticsearch Keystore (for both Master + Node):
cd /usr/share/elasticsearch/bin
sudo ./elasticsearch-keystore create
sudo ./elasticsearch-keystore list
sudo ./elasticsearch-keystore add discovery.ec2.access_key (enter key when prompts)
sudo ./elasticsearch-keystore add discovery.ec2.secret_key (enter key when prompts)
sudo ./elasticsearch-keystore list
Once you’re done, it should look like this:
ubuntu@ip-x-x-x-x:/usr/share/elasticsearch/bin$ sudo ./elasticsearch-keystore list
discovery.ec2.access_key
discovery.ec2.secret_key
keystore.seed
If you have ES running already:
Disable shard allocation
curl -H "Content-Type: application/json" -XPUT 'localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.enable": "none" } }'
Stop ES & Kibana:
sudo systemctl stop elasticsearch
sudo systemctl stop kibana
Edit Elasticsearch settings:
sudo vi /etc/elasticsearch/elasticsearch.yml
Master Node Setting:
cluster.name: hive
node.name: hive
node.master: true
node.data: true
node.ingest: true
# path.data: /var/lib/elasticsearch
# path.logs: /var/log/elasticsearch
network.host: [_ec2_,_local_]
discovery.zen.hosts_provider: ec2
discovery.zen.ping.unicast.hosts: ["x.x.x.x", "x.x.x.x"]
discovery.zen.minimum_master_nodes: 1
discovery.ec2.any_group: true
discovery.ec2.host_type: private_ip
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
discovery.ec2.tag.es_cluster: "hive-prod-elasticsearch"
discovery.ec2.endpoint: ec2.us-west-2.amazonaws.com
thread_pool.index.queue_size: 100000
thread_pool.search.queue_size: 100000
thread_pool.bulk.queue_size: 100000
Data/Ingest/Master-eligible Node Setting: Use the same config above. Edit these as you see fit for your environment
node.master: true
node.data: true
node.ingest: true
Start ES and Kibana back up:
sudo systemctl start elasticsearch
sudo systemctl start kibana
Check system health:
curl -XGET http://localhost:9200/_cluster/health?pretty=true
It should said there are 2 nodes running.
If anything is wrong, check the log:
sudo cat /var/log/elasticsearch/hive.log
Re-enable shard allocation:
curl -H "Content-Type: application/json" -XPUT 'localhost:9200/_cluster/settings' -d '{ "persistent": { "cluster.routing.allocation.enable": null } }'
FAQ
Data nodes
- stores data and executes data-related operations such as search and aggregation.Master nodes
- in charge of cluster-wide management and configuration actions such as adding and removing nodes.Client nodes
- forwards cluster requests to the master node and data-related requests to data nodes.Ingest nodes
- for pre-processing documents before indexing.cluster.name
- Any name. Must be the same across all nodes.node.name
- Any name.node.master
- Whether or not this is a master node, meaning one that coordinates the activity of data nodes. In our example, we have 1 master and 2 data nodes. The master can also serve as a data node.node.data
- Indicates whether this node can store data.node.ingest
- Used to pre-process documents before they are indexed.discovery.zen.hosts_provider
- Use ec2 for node discovery.discovery.zen.ping.unicast.hosts
- List the IP address of all master and data nodes in the cluster. Enclose in commas.network.host
- The private IP address of this machine, i.e., the one shown by the ifconfig -a command. (Not the public address that you use to ssh into)
More info about IAM role set up: