Scripts
This folder contains simple python/pyspark scripts to perform specific tasks on datasets.
df-sampling
This script is for taking stratified samples of bigger datasets.
Usage:
$ spark-submit dataprep/scripts/df-sampling.py --help
20/06/07 10:38:55 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.1.110 instead (on interface wlp0s20f3)
20/06/07 10:38:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/07 10:38:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
usage: df-sampling.py [-h] [--fraction FRACTION] [--seed SEED] [--input INPUT]
[--output OUTPUT] [--class-col CLASS_COL]
[--split-char SPLIT_CHAR]
This program takes a stratified sample from the provided dataset.
optional arguments:
-h, --help show this help message and exit
--fraction FRACTION Fraction of sample
--seed SEED Random seed for reproducibility
--input INPUT Input path of the dataset to sample
--output OUTPUT Output path to store the result
--class-col CLASS_COL
Category column to keep the strata proportions
--split-char SPLIT_CHAR
Specifying this parameter indicates the class-col
contains multiple values separated by this character
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Execution example:
$ spark-submit df-sample.py --fraction 0.2 --input <path to original dataset> --output <path to output folder> --class-col categories --split-char ";"
The above command will take a sample of 20% of the original dataset preserving the classes proportions.
transform-embeddings
This script is for calculating and storing the document embeddings for a dataset of documents using a pre-trained gensim Doc2Vec model. Along with the embeddings, it will also train a sklearn MultilabelBinarizer to encode the categories.
Usage:
$ python ./transform-embeddings.py --help
usage: transform-embeddings.py [-h] --d2v-model D2V_MODEL --output OUTPUT
[--train-test-split] [--test-size TEST_SIZE]
documents_path
This program will transform the provided documents into embeddings using the
specified Doc2Vec model.
positional arguments:
documents_path Path to the raw documents
optional arguments:
-h, --help show this help message and exit
--train-test-split Perform a train test split before saving the results
--test-size TEST_SIZE
Size of the test split if requrested, default to 0.3
required named arguments:
--d2v-model D2V_MODEL
Path to the Gensim Doc2Vec model
--output OUTPUT Path to save the transformed embeddings
Execution example:
$ python transform-embeddings.py --d2v-model <path to a 'd2v_model.model'> --output <path to output folder> --train-test-split <path to original dataset>