Elasticsearch


Basic terminology

  • Node is a single server within a cluster. Nodes perform the actual indexing and search work. Each node has a unique id and name.
  • Cluster a collection of nodes that work together to achieve a shared goal. Is assigned a unique name, which by default is elasticsearch. This name is used to join nodes.
  • Index is a collection of similar (not the same) documents, and is uniquely identified by name. By default every index is given 5 shards and 1 replica.
  • Types represents an entity with a similar set of characteristics, and in essence are a way of partitioning documents up. For example book reviews and book comments could each be modelled as types.
  • Document is the unit of information to be indexed. Represented as JSON. Every document must have a type and an index it belongs to.
  • Shards are the division of an index across nodes. This enables the cluster to parallise the work of index store and retreival operations.
  • Replicas clone shards across other nodes one or more times, providing high availability (in the event an individual shard node fails) and increasing search throughput.

Installation

Make sure an Oracle 8 or 10 JVM is available. Elastic 6.4.x interestingly JVM support matrix only supports four JVM’s.

$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

Ensure $JAVA_HOME is set.

$ echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle

After downloading and unpacking the tarball distribution of Elasticsearch, spark it up ./bin/elasticsearch.

In its default configuration will spawn a single node cluster named elasticsearch, with a randomly named node. For example, in the logs produced when starting elasticsearch above:

$ ./bin/elasticsearch
[2018-11-11T21:17:07,090][INFO ][o.e.n.Node             ] [] initializing ...
[2018-11-11T21:17:07,175][INFO ][o.e.n.Node             ] [Yi6V9UY] node name derived from node ID [Yi6V9UYfS2KwZDIxQniRdQ]; set [node.name] to override
[2018-11-11T21:17:12,131][DEBUG][o.e.a.ActionModule     ] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2018-11-11T21:17:12,321][INFO ][o.e.d.DiscoveryModule  ] [Yi6V9UY] using discovery type [zen]
[2018-11-11T21:17:12,999][INFO ][o.e.n.Node             ] [Yi6V9UY] initialized
[2018-11-11T21:17:13,000][INFO ][o.e.n.Node             ] [Yi6V9UY] starting ...
[2018-11-11T21:17:13,148][INFO ][o.e.t.TransportService   ] [Yi6V9UY] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-11-11T21:17:16,377][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [Yi6V9UY] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-11-11T21:17:16,389][INFO ][o.e.n.Node             ] [Yi6V9UY] started

To explicitly set cluster and node names, set the cluster.name and node.name properties, for example:

./bin/elasticsearch -Ecluster.name=bencode-search-cluster -Enode.name=bencode-search-node-one

The REST API

Elasticsearch operations are entirely represented as a REST API. This may include listing out current nodes or indices, populating indices or searching them.

Some common HTTP verb REST conventions:

  • GET is used to query and fetch read-only information
  • PUT is for creating and updating resources, and is idempotent.
  • POST is strictly used for updating resources only, and it NOT idempotent.

List Indicies

$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

List Nodes

$ curl -XGET 'localhost:9200/_cat/nodes?v&pretty'
ip      heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1         10        94  32  0.38  0.54  0.47 mdi    *   bencode-search-node-one

Health

$ curl -XGET 'localhost:9200/_cat/health'
1541932515 21:35:15 bencode-search-cluster green 1 1 0 0 0 0 0 0 - 100.0%

Document Operations

Create Index

$ curl -XPUT 'localhost:9200/products?&pretty'

Results in:

{
    "acknowledged" : true,
    "shards_acknowledged" : true,
    "index" : "products"
}

This index should now appear in the index listing:

$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid                pri rep docs.count docs.deleted store.size pri.store.size
yellow open   products eK-kB4Z2R-aoq2ZJz96Yxw   5   1       0         0   1.2kb       1.2kb

Lets create some more.

$ curl -XPUT 'localhost:9200/customers?&pretty'
{
    "acknowledged" : true,
    "shards_acknowledged" : true,
    "index" : "customers"
}

$ curl -XPUT 'localhost:9200/orders?&pretty'
{
    "acknowledged" : true,
    "shards_acknowledged" : true,
    "index" : "orders"
}

Again, listing them.

$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index   uuid                pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customers 2cXqClWESaaUJnHsWtrNCQ   5   1        0         0   1.1kb       1.1kb
yellow open   orders  v9zIVmpPSuG8CVHmYJQKyw   5   1        0         0     861b        861b
yellow open   products  eK-kB4Z2R-aoq2ZJz96Yxw   5   1        0         0   1.2kb       1.2kb

Populating an Index

$ curl -XPUT 'localhost:9200/products/laptops/1?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "Acer Predator Helios 300",
  "price": 1050,
  "processor": "i7-7700HQ",
  "gpu": "1060 6GB",
  "storage": 128,
  "screen-size": 15.6
}'

Result:

{
  "_index" : "products",
  "_type" : "laptops",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Lets load some more laptop documents:

$ curl -XPUT 'localhost:9200/products/laptops/2?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "HP Pavilion 15T",
  "price": 1200,
  "processor": "i7-8750H",
  "gpu": "1060 3GB",
  "storage": 128,
  "screen-size": 15.6
}'

Result:

{
  "_index" : "products",
  "_type" : "laptops",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Of note is the version which is automatically incremented when the document is changed. shards shows how many shards were accessed for the operation. When populating subsequent documents, ensure that the id (products/laptops/1) is unique.

As of 6.X Elasticsearch no longer supports multiple types per index. A better convention is to therefore name the index something that represents the specific document type. For example, an index for mechanical keyboards might be named keyboards, which contains a type of keyboard.

Create the keyboards index:

$ curl -XPUT 'localhost:9200/keyboards?&pretty'
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "keyboards"
}

Load in a document:

$ curl -XPUT 'localhost:9200/keyboards/keyboard/1?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "Ducky One 2 RGB Black",
  "price": 195,
  "switch": "Cherry Red"
}'

Result:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

And another:

$ curl -XPUT 'localhost:9200/keyboards/keyboard/2?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "Das Keyboard 4",
  "price": 239,
  "switch": "Cherry Brown"
}'

Result:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Auto Document Identifiers

To have ES take care of assigning some unique ID, simply omit the id from the request, and do a POST (instead of a PUT). For example:

$ curl -XPOST 'localhost:9200/keyboards/keyboard/?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "Corsair K70 MK2 RGB",
  "price": 215,
  "switch": "Cherry Brown"
}'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "eNWXAmcBtICRwxvkuXtb",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Listing the indices statistics will show the number of documents each index contains:

$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index 	uuid               	pri rep docs.count docs.deleted store.size pri.store.size
yellow open   keyboards b4Tw2K4cRZeSdaJmzEisPw   5   1      	3        	0 	13.7kb     	13.7kb
yellow open   customers 2cXqClWESaaUJnHsWtrNCQ   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   orders	v9zIVmpPSuG8CVHmYJQKyw   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   orderss   Re5PpYpSR4mdONPXIJ0Cqw   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   products  eK-kB4Z2R-aoq2ZJz96Yxw   5   1      	3        	0   	17kb       	17kb

Retreiving Documents

Simply throw a GET request with the details:

$ curl -XGET 'localhost:9200/keyboards/keyboard/1?pretty'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name" : "Ducky One 2 RGB Black",
    "price" : 195,
    "switch" : "Cherry Red"
  }
}

If the document ID doesn’t exist:

$ curl -XGET 'localhost:9200/keyboards/keyboard/100?pretty'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "100",
  "found" : false
}

Hot tip: To avoid pulling back the document content (_source) in the response, and incuring this expense, you can ask ES to leave it out by adding &_source=false to the request.

$ curl -XGET 'localhost:9200/keyboards/keyboard/1?pretty&_source=false'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "1",
  "_version" : 1,
  "found" : true
}

You can strip out document properties that are not of interest, by specifying the columns you would like to be returned as a comma delimited list, for example &_source=name,comments,model,price.

$ curl -XGET 'localhost:9200/keyboards/keyboard/1?pretty&_source=name'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name" : "Ducky One 2 RGB Black"
  }
}

Existance Checking

Done using the HEAD verb, for example:

$ curl -I -XHEAD 'localhost:9200/keyboards/keyboard/3?pretty'
HTTP/1.1 404 Not Found
content-type: application/json; charset=UTF-8
content-length: 87

Or for a document that does exist:

$ curl -I -XHEAD 'localhost:9200/keyboards/keyboard/2?pretty'
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 231

Updating Documents

Partial and full updates to documents are supported by Elasticsearch. A document to update:

$ curl -XGET 'localhost:9200/keyboards/keyboard/1?pretty&_source=name'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name" : "Ducky One 2 RGB Black"
  }
}

A full update is done with the ‘PUT’ verb. All properties of the document need to be defined even if the existing value of a property has not changed. If not, the property will be removed from the document in the update.

$ curl -XPUT 'localhost:9200/keyboards/keyboard/2?pretty' -H 'Content-Type: application/json' -d'
{
  "name": "Das Keyboard 4",
  "price": 299,
  "switch": "Cherry Blue"
}'

Results:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 2,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 2
}

Partial updates are available via the Update API, which entails using the POST HTTP verb, with a JSON document that has a doc field. Partial updates are nice, because existing changes are retained in the update. For example, to add a new type property to the keyboard document for Das Keyboard 4:

$ curl -XPOST 'localhost:9200/keyboards/keyboard/2/_update?pretty' -H 'Content-Type: application/json' -d'
{
  "doc": {
    "type": "Tenkeyless"
  }
}'

Results:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 6,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 5,
  "_primary_term" : 2
}

Verify the document:

$ curl -XGET 'localhost:9200/keyboards/keyboard/2?pretty'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 6,
  "found" : true,
  "_source" : {
    "name" : "Das Keyboard 4",
    "price" : 299,
    "switch" : "Cherry Blue",
    "type" : "Tenkeyless"
  }
}

The update API is will noop if no actual changes result in the request.

The update API includes scripting support, using the Painless scripting language, and is acheived by sending a script property within the JSON document that is POST’ed. For example:

$ curl -XPOST 'localhost:9200/keyboards/keyboard/2/_update?pretty' -H 'Content-Type: application/json' -d'
{
  "script": "ctx._source.price = ctx._source.price / 2",
  "lang": "painless"
}'

Results:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 7,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 2
}

This should have halved the price, a quick check:

$ curl -XGET 'localhost:9200/keyboards/keyboard/2?pretty'
{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "2",
  "_version" : 7,
  "found" : true,
  "_source" : {
    "name" : "Das Keyboard 4",
    "price" : 149,
    "switch" : "Cherry Blue",
    "type" : "Tenkeyless"
  }
}

Deleting Documents

$ curl -XDELETE 'localhost:9200/keyboards/keyboard/3?pretty'

Results:

{
  "_index" : "keyboards",
  "_type" : "keyboard",
  "_id" : "3",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 3
}

Deleting an Index

Listing out indices in the clusters, find that index orderss is to be deleted:

$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'                     	 
health status index 	uuid               	pri rep docs.count docs.deleted store.size pri.store.size
yellow open   keyboards b4Tw2K4cRZeSdaJmzEisPw   5   1      	3        	0 	14.3kb     	14.3kb
yellow open   customers 2cXqClWESaaUJnHsWtrNCQ   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   orders	v9zIVmpPSuG8CVHmYJQKyw   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   orderss   Re5PpYpSR4mdONPXIJ0Cqw   5   1      	0        	0  	1.2kb      	1.2kb
yellow open   products  eK-kB4Z2R-aoq2ZJz96Yxw   5   1      	3        	0   	17kb       	17kb

Similar to deleting a document, use the DELETE verb:

$ curl -XDELETE 'localhost:9200/orderss'
{
  "acknowledged": true
}

The Multi Get API

Amoung the various document API’s available the Multi Get API allows for retrieval of multiple documents based on an index, type (optional) and id (optional). The response includes a docs array with all the fetched documents.

$ curl -XGET 'localhost:9200/_mget?pretty' -H 'Content-Type: application/json' -d'
{
  "docs": [
  {
    "_index": "keyboards",
    "_type": "keyboard",
    "_id": "1"
  },
  {
    "_index": "keyboards",
    "_type": "keyboard",
    "_id": "2"
  }
  ]
}'

Results:

{
  "docs" : [
  {
    "_index" : "keyboards",
    "_type" : "keyboard",
    "_id" : "1",
    "_version" : 1,
    "found" : true,
    "_source" : {
      "name" : "Ducky One 2 RGB Black",
      "price" : 195,
      "switch" : "Cherry Red"
    }
  },
  {
    "_index" : "keyboards",
    "_type" : "keyboard",
    "_id" : "2",
    "_version" : 7,
    "found" : true,
    "_source" : {
      "name" : "Das Keyboard 4",
      "price" : 149,
      "switch" : "Cherry Blue",
      "type" : "Tenkeyless"
    }
  }
  ]
}

The Bulk API

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

The possible actions are index, create, delete and update. For the create, index and update actions its assumed a document follows after a line feed (\n). For example, to index a new keyboard document as id 4:

curl -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index" : { "_index": "keyboards", "_type": "keyboard", "_id": "4" } }
{ "name": "Cooler Master MK750 RGB", "price": 189, "switch": "Cherry Blue", "type": "Full" }
'

Results:

{
  "took" : 9,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "keyboards",
        "_type" : "keyboard",
        "_id" : "4",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
            "total" : 2,
            "successful" : 1,
            "failed" : 0
        },
        "_seq_no" : 7,
        "_primary_term" : 3,
        "status" : 201
      }
    }
  ]
}

A more realistic example would involve many operations packed together, for example:

curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "keyboards", "_type" : "keyboard", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "keyboards", "_type" : "keyboard", "_id" : "2" } }
{ "create" : { "_index" : "keyboards", "_type" : "keyboard", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "keyboard", "_index" : "keyboards"} }
{ "doc" : {"field2" : "value2"} }
'

Bulk Loading from JSON File

Feed in a bulk file with curl using the --data-binary switch (to preserve newlines).

keyboards.json

{ "index": {} }
{ "name": "Razer BlackWidow Chroma V2", "switch": "Razer Orange", "price": 120, "type": "Full" }
{ "index": {} }
{ "name": "Mad Catz S.T.R.I.K.E. TE", "switch": "Kailh Brown", "price": 190, "type": "Full" }
{ "index": {} }
{ "name": "SteelSeries 6Gv2", "switch": "Cherry MX Black", "price": 280, "type": "Full" }
{ "index": {} }
{ "name": "Logitech G710+", "switch": "Cherry MX Blue", "price": 89, "type": "Full" }

POST them to Elasticsearch with curl:

$ curl -XPOST 'localhost:9200/keyboards/keyboard/_bulk?pretty' -H 'Content-Type: application/json' --data-binary @"keyboards.json"

Results:

{
  "took" : 32,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "keyboards",
        "_type" : "keyboard",
        "_id" : "EkamDGcBjeEQi7qr6n_Y",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 8,
        "_primary_term" : 3,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "keyboards",
        "_type" : "keyboard",
        "_id" : "E0amDGcBjeEQi7qr6n_Y",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 3,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "keyboards",
        "_type" : "keyboard",
        "_id" : "FEamDGcBjeEQi7qr6n_Y",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 9,
        "_primary_term" : 3,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "keyboards",
        "_type" : "keyboard",
        "_id" : "FUamDGcBjeEQi7qr6n_Y",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 10,
        "_primary_term" : 3,
        "status" : 201
      }
    }
  ]
}

Searching

Background

The TF/IDF Algorithm

The Term Frequency / Inverse Document Frequency (TF/IDF), is a numeric statistic that reflects how important a word is to a document, and breaks down like this:

  • Term frequency is how often the term appear in the field of interest (e.g. great in the review field)
  • Inverse document frequency is how often the term is used across all the documents, the idea is to water down irrelevant words such as if, a, then, this and so on.
  • Field length norm the length of the field itself to guage importance (e.g. words in a book title are more important that the words in the book content)

The Query DSL

Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries.

Query Context

A query clause used in query context answers the question “How well does this document match this query clause?”. Besides deciding whether or not the document matches, the query clause also calculates a _score representing how well the document matches, relative to other documents.

Query context is in effect whenever a query clause is passed to a query parameter, such as the query parameter in the search API.

Filter Context

In filter context, a query clause answers the question “Does this document match this query clause?”. The answer is a simple Yes or No. No scores are calculated. Filter context is mostly used for filtering structured data, e.g.

  • Does this timestamp fall into the range 2015 to 2016?
  • Is the status field set to “published”?

An example of query clauses used in query and filter context in the search API:

GET /_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "title":   "Search"        }}, 
        { "match": { "content": "Elasticsearch" }}  
      ],
      "filter": [ 
        { "term":  { "status": "published" }}, 
        { "range": { "publish_date": { "gte": "2015-01-01" }}} 
      ]
    }
  }
}

Another filter context example with range:

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "age": {
            "gte": 20,
            "lte": 30
          }
        }
      }
    }
  }
}'

Stateful vs Stateless

Elasticsearch is stateless for search queries (i.e. no session or cursor). This means no paginated results.

Searching Multiple Indices

Searches both the people and programmers indices:

$ curl -XGET 'localhost:9200/people,programmers/_search?q=dennis&pretty'

Searching with Query String Params

First, searching using query string params, via the Search API. There are dozens of supported parameters available from sort to explain. First up the essential q (query) param:

$ curl -XGET 'localhost:9200/people/_search?q=dennis&pretty'

Results:

{
  "took" : 144,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 4.97903,
    "hits" : [
      {
        "_index" : "people",
        "_type" : "person",
        "_id" : "MM-6EWcBCquaYatLNLOg",
        "_score" : 4.97903,
        "_source" : {
          "name" : "Henrietta Dennis",
          "age" : 53,
          "gender" : "female",
          "company" : "TYPHONICA",
          "email" : "henriettadennis@typhonica.com",
          "phone" : "+1 (811) 498-2016",
          "street" : "778 Bond Street",
          "city" : "Tolu",
          "state" : "Missouri, 4768"
        }
      },
      {
        "_index" : "people",
        "_type" : "person",
        "_id" : "ys-6EWcBCquaYatLNLKf",
        "_score" : 4.922411,
        "_source" : {
          "name" : "Dennis Whitley",
          "age" : 29,
          "gender" : "male",
          "company" : "ZAGGLE",
          "email" : "denniswhitley@zaggle.com",
          "phone" : "+1 (850) 544-2230",
          "street" : "521 Liberty Avenue",
          "city" : "Highland",
          "state" : "Minnesota, 1770"
        }
      }
    ]
  }
}

By default a maximum of 10 results are returned.

The sort param takes an attribute and optional order. Sorting invalidates any result scoring.

$ curl -XGET 'localhost:9200/people/_search?q=dennis&sort=age:asc&pretty'

The size param controls the number of results returned:

$ curl -XGET 'localhost:9200/people/_search?q=state:california&from=42&pretty'

The from param defines the starting index of hits to returnL

$ curl -XGET 'localhost:9200/people/_search?q=state:california&from=12&size=2&pretty'

Searching using the Request Body

Articulating search criteria can be done via the JSON request body. The request body method exposes more functionality than what it possible with query params:

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "from": 20,
  "size": 10,
  "sort": { "age": { "order": "desc" } }
}'

Traverses the inverted index for an exact term match (e.g. state = ‘nsw’).

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": { "name": "gates" }
  },
  "_source": false
}'

Also instructs ES not to drag back actual _source documents that are hit, resulting in a much leaner response footprint. The _source supports wildcard (glob) filtering. To only return document properties that begin with a:

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": { "state": "nsw" }
  },
  "_source": [ "a*", "*st" ]
}'

More granular inclusion and exclusion rules can be defined:

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": { "state": "nsw" }
  },
  "_source": { 
    "includes": [ "a*", "*st" ],
    "excludes": [ "*desc*" ]
  }
}'

Unlike the term search, full text search need not be exact matches. match queries accept text/numerics/dates, analyses them, and constructs a boolean query.

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "name": "jones" }
  }
}'

Phases, their words and the individual relationships between them can be defined with the operator keyword (default is OR):

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "name": { 
        "query": "george jones",
        "operator": "or"
      }
    }
  }
}'

The above or will return hits on the name property that contain either the word george or jones.

The match_phrase query analyses the query text and creates a phrase query out of the result. The analyzer keyword allow a specific analyser to be used. Useful when an exact phrase match (i.e. a sequence of multiple words) is needed.

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_phrase": {
      "street": "pleasant place"
    }
  }
}'

The match_phrase_prefix is the same as match_phrase, except that it allows for prefix matches on the last term in the text.

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_phrase_prefix": {
      "street": "pleasant pl"
    }
  }
}'

Boolean Compound Queries

The Query DSL supports a variety of compound queries, a query that matches on combinations of other queries. The Bool Query is one such compound query available.

  • must clause must appear
  • should clause mostly appears, but may not sometimes
  • must_not
  • filter clause is to always appear in the results, but just not scored

For example, two (full text) match queries compounded together.

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "street": "miami" } },
        { "match": { "street": "court" } }
      ]
    }
  }
}'

Again, but this time with term queries:

curl -XGET 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        {
          "term": { 
            "age": {
              "value": 21,
              "boost": 2.0
            }
          }
        },
        {
          "term": { 
            "age": {
              "value": 28
            }
          }
        }
      ]
    }
  }
}'

The boost param allows a particular (sub) query to elevate the importance by boosting the score by a multiplier. In the above, 21 year olds are twice as important in the query.

Aggregations

Changing gears, its time to showcase one of ES analytical features, aggregations. In a nutshell, you can summarise data based on a search query (awesome!). Types of aggreations include:

  • Bucketing documents are washed against each bucket, and if it satisfies the criteria “falls in” (think GROUP BY in SQL)
  • Metric track and compute numeric statistics across documents
  • Matrix operate on multiple fields and produce a matrix result (no scripting support)
  • Pipeline daisy chain other aggregations

A hugely powerful feature is the ability to nest aggregations. Given a bucket essentially defines a document set, aggregations can also be applied at the bucket level.

Metric Aggregations

Deals with numeric aggregations such as sum, count, avg, min and max. Example that averages the age of all people documents:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "avg_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}'

Results in 52.526:

{
  "took" : 77,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_age" : {
      "value" : 52.526
    }
  }
}

Note the size param instructs that a sampling of documents are to come back into the response.

Aggregations support compound queries, for example the following boolean compound search figures are the average age of people within the state of Victoria.

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "match": { "state": "victoria" }
      }
    }
  },
  "aggregations": {
    "avg_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}'

The nifty stats aggregation will produce a multi-value result, that includes all the bread and butter stats. Example:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "stats_age": {
      "stats": {
        "field": "age"
      }
    }
  }
}'

You’ll get these back in return:

"aggregations" : {
  "stats_age" : {
    "count" : 1000,
    "min" : 20.0,
    "max" : 85.0,
    "avg" : 52.526,
    "sum" : 52526.0
  }
}

Cardinality Aggregation

Next up a cardinality aggregation calculates a count of distinct values.

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "distinct_ages": {
      "cardinality": {
        "field": "age"
      }
    }
  }
}'

Results in 66 unique age values. Metric aggregations (like cardinality) by default only work with numeric fields. This is because ES hashes text data into the inverted index (saving on space and comparison operations). The original text data is stored in fielddata, on an demand in-memory data structure. To apply a cardinality aggregation to say the gender field, can instruct ES to work with the original field value of gender against fielddata. Enabling fielddata is done using the Mapping API.

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Use mappings to define which string fields should be treated as full text fields, which fields contain numbers/dates/geolocations, the format of date values or custom rules to control the mapping for dynamically added fields.

curl -XPUT 'localhost:9200/people/_mapping/person?pretty' -H 'Content-Type: application/json' -d'
{
  "properties": {
    "gender": {
      "type": "text",
      "fielddata": true
    }
  }
}'

If successful, will get an "acknowledged" : true. The field can now be used within aggregations, for example:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "distinct_genders": {
      "cardinality": {
        "field": "gender"
      }
    }
  }
}'

Results in 2 unique genders:

{
  "aggregations" : {
    "distinct_genders" : {
      "value" : 2
    }
  }
}

Bucketing Aggregations

Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be aggregated for the buckets created by their parent bucket aggregation.

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "gender_bucket": {
      "terms": {
        "field": "gender"
      }
    }
  }
}'

Will split genders into buckets, results in:

{
  "aggregations" : {
    "gender_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 518
        },
        {
          "key" : "female",
          "doc_count" : 482
        }
      ]
    }
  }
}

More granular control over buckets can be achieved using ranges:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "age_ranges": {
      "range": {
        "field": "age",
        "ranges": [
          { "to": 30 },
          { "from": 30, "to": 40 },
          { "from": 40, "to": 55 },
          { "from": 55 }
        ]
      }
    }
  }
}'

Results in:

{
  "aggregations" : {
    "age_ranges" : {
      "buckets" : [
        {
          "key" : "*-30.0",
          "to" : 30.0,
          "doc_count" : 164
        },
        {
          "key" : "30.0-40.0",
          "from" : 30.0,
          "to" : 40.0,
          "doc_count" : 144
        },
        {
          "key" : "40.0-55.0",
          "from" : 40.0,
          "to" : 55.0,
          "doc_count" : 217
        },
        {
          "key" : "55.0-*",
          "from" : 55.0,
          "doc_count" : 475
        }
      ]
    }
  }
}

Alternatively buckets can be tagged with more meaningful names, by specifying the key property like so:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "age_ranges": {
      "range": {
        "field": "age",
        "keyed": true,
        "ranges": [
          { "key": "young", "to": 30 },
          { "key": "denial years", "from": 30, "to": 40 },
          { "key": "midlife crisis", "from": 40, "to": 55 },
          { "key": "old", "from": 55 }
        ]
      }
    }
  }
}'

Nested Aggregations

Finally, we can witness the power. For example, the average age between genders:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "gender_bucket": {
      "terms": {
        "field": "gender"
      },
      "aggregations": {
        "average_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

Results in:

{
  "aggregations" : {
    "gender_bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 518,
          "average_age" : {
            "value" : 53.61003861003861
          }
        },
        {
          "key" : "female",
          "doc_count" : 482,
          "average_age" : {
            "value" : 51.36099585062241
          }
        }
      ]
    }
  }
}

There is no limit on nesting levels. For example, the following 3 layer aggregration groups by gender and age range, and then takes the average.

Filter Aggregation

Aggregrations, like queries, support filters. No suprise here.

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "texans_only": {
      "filter": { "term": { "state": "texas" } },
      "aggregations": {
        "average_age": {
          "avg" : {
            "field": "age"
          }
        }
      }
    }
  }
}'

Results:

{
  "aggregations" : {
    "texans_only" : {
      "doc_count" : 17,
      "average_age" : {
        "value" : 52.470588235294116
      }
    }
  }
}

If needed, multiple filters can be specified:

curl -XPOST 'localhost:9200/people/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggregations": {
    "state_filter": {
      "filters": {
        "filters": {
          "florida": { "match": { "state": "florida" } },
          "oregon": { "match": { "state": "oregon" } },
          "colorado": { "match": { "state": "colorado" } }
        }
      },
      "aggregations": {
        "average_age": {
          "avg" : {
            "field": "age"
          }
        }
      }
    }
  }
}'

Creating Test Data

Checkout JSON Generator which provides a number of random data generation functions such as surname() and street(). Ensure the data is formatted appropriately for POST’ing to ES using curl by:

  • ensuring there is only one record per line
  • including a { 'index': {} } bulk API directive before each record, on its own line

For example:

{ "index": {} }
[{"name":"Kelly Page","age":34,"gender":"female","company":"MAGNEATO","email":"kellypage@magneato.com","phone":"+1 (881) 422-3362","street":"933 Ingraham Street","city":"Lafferty","state":"Northern Mariana Islands, 5269"}]
{ "index": {} }
[{"name":"Karina Kennedy","age":42,"gender":"female","company":"URBANSHEE","email":"karinakennedy@urbanshee.com","phone":"+1 (911) 506-2780","street":"240 Lee Avenue","city":"Condon","state":"Ohio, 9719"}]

Stream the file to the Elasticsearch bulk API with curl:

$ curl -XPOST 'localhost:9200/people/person/_bulk?pretty' -H 'Content-Type: application/json' --data-binary @"people.json"