Tweets Transformer

This is a simple job that filters tweets according to a given sql expression passed as argument to the job.


To execute this project the you need to have the following installed and configured in your local machine:


$ make all

This will compile the project from scratch, execute unit tests, package it and create the dist directory in the root folder in which you can find the distribution tar package. You can grab this package and expand it anywhere you want and execute the job using the bundled bash script.



After expanding the build package, enter the folder and use the bash script as follows:

$ cd <path where dist package was expanded>
$ ./ --help

 _                     _         _                        __                                
| |___      _____  ___| |_ ___  | |_ _ __ __ _ _ __  ___ / _| ___  _ __ _ __ ___   ___ _ __ 
| __\ \ /\ / / _ \/ _ \ __/ __| | __| '__/ _` | '_ \/ __| |_ / _ \| '__| '_ ` _ \ / _ \ '__|
| |_ \ V  V /  __/  __/ |_\__ \ | |_| | | (_| | | | \__ \  _| (_) | |  | | | | | |  __/ |   
 \__| \_/\_/ \___|\___|\__|___/  \__|_|  \__,_|_| |_|___/_|  \___/|_|  |_| |_| |_|\___|_| 

Wrapper script to execute the Apache Spark Job that transforms
tweets from their original json into parquet format and some

Usage: [--options] INPUT_PATH -h | --help

  -h --help               Display this help information.

Spark Related Options:
  -sh --spark-home        Spark home path (Default: environment variable SPARK_HOME)
  -m  --master            Spark master (Default: 'local[*]')
  -dm --driver-memory     Spark driver memory (Default: 16G)
  -em --executor-memory   Spark executor memory (Default: 7G)
  -dc --driver-cores      Spark driver cores (Default: 12)
  -ec --executor-cores    Spark executor cores (Default: 45)
  -ne --num-executors     Spark number of executors (Default: 12)
  -el --event-log         Location to save the spark event logs (Default: /tmp/spark-event)
  -jj --job-jar           Spark Job Jar path (Default: ./tweets-transformer.jar)

Job Related Options:
  -o  --output            Output directory [Required.]
  -f --filter             Filter expression for the tweets (Default: 'place is not null')

Execute the pipeline

Depending on your configuration, defined variables and file locations the command below may vary. You can use the help to get further guidance. A general job execution can be invoked like this:

./ \ 
-o "/output/path" \ 
-f "place is not null and lang = 'en'" \ 


Executing against a cluster

To execute against a cluster, just use the Spark Related options according to your needs.
