Logstash

Elasticsearch Performance and Tuning

A dedicated performance course run by Matt Gregory from Elastic, an absolute legend with deep Elasticsearch expert. Contents Cool takeaways Tuning for Index Speed Increase the refresh interval Index architecting Bulk Hardware settings to improve performance Disable swapping Indexing Buffer size Best practices and scaling Disable replics for initial loads Use auto-generated IDs Use Cross Cluster Replication Thread Pools Memory Locking Transforms Tuning for search API settings and data modelling to improve search performance Search as few fields as possible One big copy_to field as opposed to individual text multi field Consider mapping identifiers as keywords Document modeling Consider mapping numeric fields as keyword Hardware settings to improve search Warm Up Global Ordinals Warm up filesystem Cache Use index sorting to speed up search Ways to improve searches must and should clauses filter and must not clauses node query cache shard request cache Aggregation performance Search rounded dates Force merge read only indices Search profiler and Explain API Search profiler Search profiler API ID Query section Timing breakdown Collection section Collectors reasons Rewrite section Explain and Tasks API Explain API Score Field length normalization and coordindation Other Query Parameters API Settings to improve indexing performance Hardware settings to improve performance Best Practices and scaling Transforms Cool takeaways Increase the refresh_interval from default 1s to something higher, like 10s. Index typings should be set to strict (default is dynamic) The took param measures raw cluster operation speed, kibana will also reveal a roundtrip time which includes the HTTP layer. Auto generated id’s are always faster One of Matt’s favourite APIs _cluster/allocation/explain Ensure the heap is beefed up a must clause is the first line of defence for scoring, should is then used as the second pass of scoring always format queries as a ‘bool’ Configuration management everywhere (Ansible, etc) dedicated monitoring cluster Tuning for Index Speed Cheatsheet: ...

Elasticsearch Engineer 8.1

Revised 2024 edition based on Elasticsearch 8.1. Recently the opportunity to attend the latest revision of the 4-day Elasticsearch engineer course, which I did in-person about 5 years ago in Sydney. Elasticsearch has often been an integral part of the data solutions I’ve been involved with and I’m quite fond of it. This time round the course only runs in a virtual class room format (using strigo.io) with our awesome trainers Krishna Shah and Kiju Kim. ...

Black belt Elasticsearch

Some more advanced Elasticsearch wisdom I gleaned from Jason Wong and Mark Laney from Elastic. Contents Environment with Config X-Pack Security (the 1337 way) Roles Built-in Query Web UI (batteries included) Internals Lucene Segments Elasticsearch Indexing Transaction Log and Flushing Doc Values Caching Field Modelling Typing Denormalising Range Types Mapping Parameters Fixing Data Painless Reindexing API’s Picking up Mapping Changes Multi-fields Custom Marker (flag) Field Fixing Fields Advanced Search and Aggregations Patterns Wildcard Query Regexp Qury Null Script (painless) Query Script Field Performance Considerations Search Templates Aggregations Percentile Top Hits Scripted (painless) Aggregations Significant Terms Aggregation Pipeline Aggregations Cluster Management Dedicated Nodes Hot Warm Architecture Tags Verify Shard Allocation Forced Awareness Capacity Planning Shard Allocation Litmus Test Primary Shards Scaling with Indices Scaling with Replicas Resources Time Based Data API’s for Managing Indices Document Modelling Nested Objects Nested Aggregations Parent Child Relationships Argh Which Technique is Best? Kibana Considerations Monitoring Task Management API The cat API Performance Issues Thread Pool Queues hot_threads API Indexing Slow Log Search Slow Log The Profile API X-Pack Monitoring Alerting From Dev to Production Disabling Dynamic Indices Production Mode Best Practices Network Best Practices Storage Best Practices Hardware Selection Throttles JVM Poor Query Performance Always Filter Aggregating Too Many Docs Denormalise First Too many shards Unnecessary Scripting Cross Cluster Replication Upgrades Rolling Upgrade Environment with Config Can use environment variables within elasticsearch.yml: ...

Elasticsearch Basics

Some Elasticsearch wisdom I gleaned from Jason Wong and Mark Laney from Elastic. Contents Use cases Log stash vs Beats? Time Series vs Static Data Logstash Installation Starting and Stopping Elasticsearch Killing Communication Discovery module (networking) Security Read-only Enabling X-Pack (Elasticsearch Security) CRUD Ingestion Reading Search Query and Filter Contexts Mapping Inverted Index Multi Fields (keyword fields) Anatomy of an Analyzer Custom Analyzer The reindex API Node Types Cluster state Shards Anatomy of Search (Shards) Troubleshooting Configuration Responses Cluster and Shard Health Diagnosing Issues Improving Search Results Multi-field Search Boosting Fuzziness Exact Terms Sorting Paging Highlighting Aggregations Best Practices Index Aliases Index Templates Scroll Search Cluster Backup Use cases Search Logging Metrics - unlike logs, are typically not in a text format. Business analytics - the aggregation and analysis of patterns (e.g. bucketing aggregations, ML jobs) Security analytics - Log stash vs Beats? Beats are lightweight data shippers, but are not appropriate for ETL type stashing. Logstash on the other hand, can take handle these concerns. But requires a much heavier runtime (JVM). An official SIEM solution is currently under development. ...

Logstash

A quick walkthrough of Logstash, the ETL engine offered by the Elastic Stack. Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite stash Logstash gained its initial popularity with log and metric collection, such as log4j logs, Apache web logs and syslog. Its application has broadened, to all kinds of data sources like large scale event streams, webhooks, database and message queue integration. Once data is transformed and cleaned up is routed to a final destination (i.e. the stash), Elasticsearch is one option, but lots of other choices are there (mongo, S3, Nagios, IRC, email). ...