Data Engineering the MovieLens Dataset
Exploring the MovieLens Dataset using Jupyter, Python, Spark DataFrames, SparkSQL, HiveQL, MySQL, Kafka, Parquet, Cloudera / CDH, HDFS and more.
The Jupyter Notebook session that this post will refer to is found here …
In a previous post we illustrate the R&D Lab that we use to erect arbitrary technology-stacks using LXC/Linux Containers and Docker Containers. Here we’ll make use of that lab. Underlying the JupyterLab Notebook session that follows this post, is a collection of LXC/Linux containers (vps00, vps01, vps02, vps03, vps04 and vps10) that form a Cloudera/CDH 6.1 Hadoop cluster.
Additionally, nested within the vps00 LXC/Linux Container is a running Docker engine that provides the following technology stack components:
Although Cloudera/CDH offers Kafka, Zookeeper and Spark services, they will always lag their latest releases; and this is particularly problematic when components like Spark and Kafka are being fixed and feature-enhanced quickly. We therefore adopt a best-of-breed approach and leave Cloudera/CDH to do what it does best — being a Hadoop cluster that provides HDFS, Hive, Hbase, Map/Reduce, etc. services — and use Docker to supply the rest of the stack:
(pyvenv.d) nmvega@vps00$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
376e4bdd6975 nmvega/kafka:latest "start-kafka.sh" 25 hours ago Up 25 hours 0.0.0.0:9094->9092/tcp stackd_kafka_3
ced612a40b32 nmvega/kafka:latest "start-kafka.sh" 25 hours ago Up 25 hours 0.0.0.0:9093->9092/tcp stackd_kafka_2
83184134f4cf nmvega/kafka:latest "start-kafka.sh" 25 hours ago Up 25 hours 0.0.0.0:9092->9092/tcp stackd_kafka_1
56603b1aaa6a wurstmeister/zookeeper:latest "/bin/sh -c '/usr/sb…" 25 hours ago Up 25 hours 22/tcp, 2888/tcp, 3888/tcp, 0.0.0.0:22181->2181/tcp analytics-ZooKeeper
Basically, we use the CDH cluster as a remote Hadoop services provider.
Note: Because some CDH services require Zookeeper, the reader may wonder about a port conflict between CDH Zookeeper and Docker Zookeeper. To avoid this, we operate CDH Zookeeper on port 2181 (the well-known port for Zookeeper) and operate Docker Zookeeper on port 22181 (seen in the output above). Our docker-provided Zookeeper is used by our docker-provided-Kafka.
Finally, vps00 is also host to our UNIX home directory (/home/nmvega/agile.ds.d/) and, within that directory, also our Anaconda Python 3.x virtual environment. Among the rich standard and 3rd party-libraries that Python provides, the following feature most prominently in this post:
One of the great features of Spark 2.x is its Catalyst engine, which among other things makes Spark language-agnostic. Whether one uses Python, R, Java or Scala to drive Spark makes little-to-no difference since Spark cross-compiles code into optimized Catalyst code.
With the Dev/Ops talk behind us, we turn our attention to the heart of this post: Using the MovieLens dataset to tour Spark DataFrames, SparkSQL, HiveQL, MySQL, Kafka, Parquet, HDFS, and more. Here are the tables of the MovieLens schema (and although — as of this writing — ERD relationships aren’t included, they’re not too difficult to figure out):
We will ask questions of this dataset via MySQL, Spark DataFrames, SparkSQL, HiveQL (via beeline(1)), along the way learning how to accomplish things using various techniques ■ We show how to programmatically plug functionality into Spark (like Avro as a data-source and Kafka as a data-sink) ■ We sqoop(1) table-data from MySQL into Hive/HDFS, serialized as Parquet files; and later ingest those files into Spark DataFrames ■We push Spark DataFrames (which represent movielens tables) through Kafka topic-partitions and recover them at the other end; along the way showing how to embed DataFrame metadata and ephemeral keys to drive per-table Kafka topic-partitioning, as well as show how to pack (denormalize) data for ingress and unpack (re-normalize) data for egress ■ and so on.
The following JupyterLab Notebook session is version v0.9.0, which means your humble author wants to provide additional markdown-cell comments and in-line annotations for clarity. There were too many moving parts to complete everything in one go. Hopefully, though, there’s enough already there for readers to understand what is taking place. The next update will improve on this, and perhaps a dedicated follow-on post will treat how Avro containers can make life easier when working with Kafka.