Tweets Transformer

This is a simple job that filters tweets according to a given sql expression passed as argument to the job.

Requirements

To execute this project the you need to have the following installed and configured in your local machine:

Java 1.8
Scala 2.11.x
Sbt 0.13.16
Apache Spark >= 2.3

Build

$ make all

This will compile the project from scratch, execute unit tests, package it and create the dist directory in the root folder in which you can find the distribution tar package. You can grab this package and expand it anywhere you want and execute the job using the bundled bash script.

Usage

Requirements

After expanding the build package, enter the folder and use the tweets-transformer.sh bash script as follows:

$ cd <path where dist package was expanded>
$ ./tweets-transformer.sh --help

 _                     _         _                        __                                
| |___      _____  ___| |_ ___  | |_ _ __ __ _ _ __  ___ / _| ___  _ __ _ __ ___   ___ _ __ 
| __\ \ /\ / / _ \/ _ \ __/ __| | __| '__/ _` | '_ \/ __| |_ / _ \| '__| '_ ` _ \ / _ \ '__|
| |_ \ V  V /  __/  __/ |_\__ \ | |_| | | (_| | | | \__ \  _| (_) | |  | | | | | |  __/ |   
 \__| \_/\_/ \___|\___|\__|___/  \__|_|  \__,_|_| |_|___/_|  \___/|_|  |_| |_| |_|\___|_| 

Wrapper script to execute the Apache Spark Job that transforms
tweets from their original json into parquet format and some
constraints.

Usage:
  tweets-transformer.sh [--options] INPUT_PATH
  tweets-transformer.sh -h | --help

Options:
  -h --help               Display this help information.

Spark Related Options:
  -sh --spark-home        Spark home path (Default: environment variable SPARK_HOME)
  -m  --master            Spark master (Default: 'local[*]')
  -dm --driver-memory     Spark driver memory (Default: 16G)
  -em --executor-memory   Spark executor memory (Default: 7G)
  -dc --driver-cores      Spark driver cores (Default: 12)
  -ec --executor-cores    Spark executor cores (Default: 45)
  -ne --num-executors     Spark number of executors (Default: 12)
  -el --event-log         Location to save the spark event logs (Default: /tmp/spark-event)
  -jj --job-jar           Spark Job Jar path (Default: ./tweets-transformer.jar)

Job Related Options:
  -o  --output            Output directory [Required.]
  -f --filter             Filter expression for the tweets (Default: 'place is not null')

Execute the pipeline

Depending on your configuration, defined variables and file locations the command below may vary. You can use the help to get further guidance. A general job execution can be invoked like this:

./tweets-transformer.sh \ 
-o "/output/path" \ 
-f "place is not null and lang = 'en'" \ 
"/path/to/json_tweets/*/"

Warnings

By default spark tries to store the events in file:///tmp/spark-event/, this directory must exist before submitting the job.

Executing against a cluster

To execute against a cluster, just use the Spark Related options according to your needs.

Tips

When working with the datasets downloaded from archive.org, you’ll need to specify wildcards in the input data paths. For example, the input data has the following folder structure:
```
2013 (year)
|-09 (month)
|-01 (date)
  |-00 (hour)
  |-02 (hour)
  ...
  |-23 (hour)
|-02 (date)
  |-00 
  ...
  |-23
...
|-31
```
You will need to specify the input path as <base_path>/2013/*/*/*/*. This means that spark needs to traverse all the directories under 2013, and the subsequent directories under the month directory, and so on.
Beware of your cluster resources to request them accordingly.
Spark logs are enabled by default use them to monitor your job; by default the logs go to /tmp/spark-event
Ensure your local Spark installation is properly configured to submit jobs against a YARN cluster.
Remember that when submitting jobs to a cluster, the paths must correspond to HDFS or a distributed filesystem, not your local machine.
After the job finishes, the Spark UI is shutdown, in order to see the stored events we must start the spark history server: $SPARK_HOME/sbin/start-history-server.sh <path-to-spark-events>