Data | Benny Simmonds

Logstash

A quick walkthrough of Logstash, the ETL engine offered by the Elastic Stack. Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite stash Logstash gained its initial popularity with log and metric collection, such as log4j logs, Apache web logs and syslog. Its application has broadened, to all kinds of data sources like large scale event streams, webhooks, database and message queue integration. Once data is transformed and cleaned up is routed to a final destination (i.e. the stash), Elasticsearch is one option, but lots of other choices are there (mongo, S3, Nagios, IRC, email). ...

PostgreSQL

PostgreSQL (postgres or pg) is an amazing open source relational database that provides the SQL DSL for interacting with data. Installation is a breeze with any package manager, packages to grab: postgresql and postgresql-common: core server postgresql-client-common and postgresql-client: client libs and binaries postgresql-contrib: useful bolt on modules Once installed, is managed as a daemon by systemd. $ sudo systemctl start postgresql $ sudo systemctl stop postgresql $ sudo systemctl restart postgresql $ sudo systemctl reload postgresql $ sudo systemctl status postgresql Core Concepts Configuration Depends on distro, generally somewhere like /etc/postgresql/11/main. ...

Apache Spark

Recently I’ve had the opportunity to dig into Apache Spark, thanks to some training from Brian Bloechle from Cloudera. What is spark? Fast, flexible, and developer friendly, Apache Spark is the leading platform for large scale SQL, batch processing, stream processing, and machine learning. Java, Scala, Python and R are first class citizens when its comes to consuming the various Spark API’s. I’ll cover PySpark in more detail. Spark is an agnostic processing engine, that can target a number of cluster managers including Spark Standalone, Hadoop’s YARN, Apache Mesos and Kubernetes. In the context of Spark, some useful surrounding ecosystem to be aware of: ...