Output Formats

Summary of output types available in DLATK. Most of the following commands need the --output_name flag which specifies the output directory.

In the csv-type outputs the DLATK namespace will be dumped as the first line. This allows you to revisit results and know your exact settings.

CSV

Correlations

--csv:
--sort (optional):

dlatkInterface.py -d dla_tutorial -t msgs -c user_id --correlate --csv --outcome_table blog_outcomes --outcomes age gender is_student is_education is_technology --outcome_with_outcome_only --output ~/correlations

feature,age,p,N,CI_l,CI_u,freq,gender,p,N,CI_l,CI_u,freq,is_education,p,N,CI_l,CI_u,freq,is_student,p,N,CI_l,CI_u,freq,is_technology,p,N,CI_l,CI_u,freq
outcome_age,1.0,0.0,1000,1.0,1.0,1000,0.013089769277,0.679288049345,1000,-0.048943029258,0.0750219732934,1000,0.172756193994,9.62582613937e-08,1000,0.111962191762,0.232261824573,1000,-0.45911352125,6.87199798082e-53,1000,-0.50668539378,-0.408754348795,1000,0.117238226305,0.000337894917181,1000,0.0556496028558,0.177938062241,1000
outcome_gender,0.013089769277,0.679288049345,1000,-0.048943029258,0.0750219732934,1000,1.0,0.0,1000,1.0,1.0,1000,-0.161206890368,4.95925394952e-07,1000,-0.220991447954,-0.100215331046,1000,-0.0221719208099,0.483710711985,1000,-0.0840494768068,0.0398759714239,1000,0.065154589658,0.0492504223304,1000,0.00317432871342,0.126636171693,1000
outcome_is_education,0.172756193994,6.41721742624e-08,1000,0.111962191762,0.232261824573,1000,-0.161206890368,7.43888092428e-07,1000,-0.220991447954,-0.100215331046,1000,1.0,0.0,1000,1.0,1.0,1000,-0.14424303335,7.7633413628e-06,1000,-0.204408282607,-0.0829920718038,1000,-0.0537304778134,0.0894683992084,1000,-0.115339374305,0.00829021870801,1000
outcome_is_student,-0.45911352125,6.87199798082e-53,1000,-0.50668539378,-0.408754348795,1000,-0.0221719208099,0.604638389981,1000,-0.0840494768068,0.0398759714239,1000,-0.14424303335,5.8225060221e-06,1000,-0.204408282607,-0.0829920718038,1000,1.0,0.0,1000,1.0,1.0,1000,-0.141292966419,1.82280081643e-05,1000,-0.20152088984,-0.0800006243256,1000
outcome_is_technology,0.117238226305,0.000253421187886,1000,0.0556496028558,0.177938062241,1000,0.065154589658,0.0656672297739,1000,0.00317432871342,0.126636171693,1000,-0.0537304778134,0.0894683992084,1000,-0.115339374305,0.00829021870801,1000,-0.141292966419,9.11400408216e-06,1000,-0.20152088984,-0.0800006243256,1000,1.0,0.0,1000,1.0,1.0,1000

The --sort flag creates a second table where each column is sort by descending effect size and appends the table to the bottom of the csv:

SORTED:
rank,age,r,p,N,CI_l,CI_u,freq,gender,r,p,N,CI_l,CI_u,freq,is_education,r,p,N,CI_l,CI_u,freq,is_student,r,p,N,CI_l,CI_u,freq,is_technology,r,p,N,CI_l,CI_u,freq
1,outcome_age,1.0,0.0,1000,1.0,1.0,1000,outcome_gender,1.0,0.0,1000,1.0,1.0,1000,outcome_is_education,1.0,0.0,1000,1.0,1.0,1000,outcome_is_student,1.0,0.0,1000,1.0,1.0,1000,outcome_is_technology,1.0,0.0,1000,1.0,1.0,1000
2,outcome_is_education,0.172756193994,6.41721742624e-08,1000,0.111962191762,0.232261824573,1000,outcome_is_technology,0.065154589658,0.0656672297739,1000,0.00317432871342,0.126636171693,1000,outcome_age,0.172756193994,9.62582613937e-08,1000,0.111962191762,0.232261824573,1000,outcome_gender,-0.0221719208099,0.483710711985,1000,-0.0840494768068,0.0398759714239,1000,outcome_age,0.117238226305,0.000337894917181,1000,0.0556496028558,0.177938062241,1000
3,outcome_is_technology,0.117238226305,0.000253421187886,1000,0.0556496028558,0.177938062241,1000,outcome_age,0.013089769277,0.679288049345,1000,-0.048943029258,0.0750219732934,1000,outcome_is_technology,-0.0537304778134,0.0894683992084,1000,-0.115339374305,0.00829021870801,1000,outcome_is_technology,-0.141292966419,9.11400408216e-06,1000,-0.20152088984,-0.0800006243256,1000,outcome_gender,0.065154589658,0.0492504223304,1000,0.00317432871342,0.126636171693,1000
4,outcome_gender,0.013089769277,0.679288049345,1000,-0.048943029258,0.0750219732934,1000,outcome_is_student,-0.0221719208099,0.604638389981,1000,-0.0840494768068,0.0398759714239,1000,outcome_is_student,-0.14424303335,5.8225060221e-06,1000,-0.204408282607,-0.0829920718038,1000,outcome_is_education,-0.14424303335,7.7633413628e-06,1000,-0.204408282607,-0.0829920718038,1000,outcome_is_education,-0.0537304778134,0.0894683992084,1000,-0.115339374305,0.00829021870801,1000
5,outcome_is_student,-0.45911352125,6.87199798082e-53,1000,-0.50668539378,-0.408754348795,1000,outcome_is_education,-0.161206890368,7.43888092428e-07,1000,-0.220991447954,-0.100215331046,1000,outcome_gender,-0.161206890368,4.95925394952e-07,1000,-0.220991447954,-0.100215331046,1000,outcome_age,-0.45911352125,6.87199798082e-53,1000,-0.50668539378,-0.408754348795,1000,outcome_is_student,-0.141292966419,1.82280081643e-05,1000,-0.20152088984,-0.0800006243256,1000

Predictions

Output prediction values to a csv.

--prediction_csv:

dlatkInterface.py -d dla_tutorial -t msgs -c user_id  -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16'  --combo_test_regression  --outcome_table blog_outcomes --outcomes age --output ~/prediction.values --prediction_csv

The above command produces the file prediction.values.predicted_data.csv:

Id,age__withLanguage
2863105,25.678967347
3155970,25.0352946253
671748,27.6127094761
4184069,24.1537464257
4016135,25.6544622564
3485704,22.359848659
3321866,18.403504235
...

Probabilities

Output classification probabilities to a csv.

../fwinterface/fwflag_probability_csv:

dlatkInterface.py -d dla_tutorial -t msgs -c user_id  -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16'  --combo_test_classif  --outcome_table blog_outcomes --outcomes is_student is_education is_technology --output ~/probability.values --probability_csv

The above command produces the file probability.values.prediction_probabilities.csv:

Id,is_education__withLanguage,is_student__withLanguage,is_technology__withLanguage
2863105,-0.995385541164,-0.606670069561,-1.04686722815
3155970,-1.02199555614,-0.364230235902,-0.945148184099
671748,-1.03019054224,-0.605114549867,-0.95579279429
4184069,-1.04002299128,-0.673432284751,-1.26935520215
4016135,-0.972723923762,-0.65093645124,-0.860030650986
3485704,-0.957358049868,-0.334117399851,-1.43404698847
...

Prediction / Classification Metrics

Output prediction or classification performance metrics to a csv delimited by |.

--csv:

dlatkInterface.py -d dla_tutorial -t msgs -c user_id  -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16'  --combo_test_regression  --outcome_table blog_outcomes --outcomes age --output ~/regression.metrics --csv

The above command produces the file regression.metrics.variance_data.csv:

row_id|outcome|model_controls|mae|mae_folds|mse|mse_folds|N|num_features|R|r|R2|R2_folds|r_folds|r_p|r_p_folds|rho|rho_p|se_mae_folds|se_mse_folds|se_R2_folds|se_r_folds|se_r_p_folds|se_train_mean_mae_folds|test_size|train_mean_mae|train_mean_mae_folds|train_size|{model_desc}|{modelFS_desc}|w/ lang.
1|age|()1|4.83561091756|4.83367105025|42.8168508661|42.7795793244|978|2000|0.617025365643|0.631052186337|0.380720301847|0.382388363028|0.634404850072|9.32342564224e-110|7.83089815326e-20|0.681655428515|1.40167196195e-134|0.175281125172|3.03944440616|0.0193760958564|0.0165877776996|6.98930451373e-20|0.123031255372|198|3.33229129841|6.45703007082|780|RidgeCV(alphas=array([ 1.00000e+00,  1.00000e-02,  1.00000e-04,  1.00000e+02,     1.00000e+04,  1.00000e+06]),   cv=None, fit_intercept=True, gcv_mode=None, normalize=False,   scoring=None, store_cv_values=False)|None|1

HTML

Correlation Matrix

--rmatrix:
--sort (optional):

dlatkInterface.py -d dla_tutorial -t msgs -c user_id --correlate --rmatrix --outcome_table blog_outcomes --outcomes age gender is_student is_education is_technology --outcome_with_outcome_only --output ~/correlations

Here is HTML output from the above command.

Word Clouds

1 to 3 Gram Clouds

--tagcloud: Produces data for making Wordle tag clouds, saved in text files
--make_wordclouds: Creates pngs from the text file output

dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$1gram$msgs$user_id$16to16' --tagcloud --make_wordclouds --outcome_table blog_outcomes --outcomes age gender is_student is_education is_technology --output ~/1to3grams

creates the following files:

1to3grams_tagcloud_wordclouds/age_pos.png
1to3grams_tagcloud_wordclouds/age_neg.png
1to3grams_tagcloud_wordclouds/gender_1_pos.png
1to3grams_tagcloud_wordclouds/gender_1_neg.png
1to3grams_tagcloud_wordclouds/is_student_pos.png
1to3grams_tagcloud_wordclouds/is_student_neg.png
1to3grams_tagcloud_wordclouds/is_education_pos.png
1to3grams_tagcloud_wordclouds/is_education_neg.png
1to3grams_tagcloud_wordclouds/is_technology_pos.png

Topic Clouds

--topic_tagcloud: Produces data for making topic Wordles, saved in text files
--make_topic_wordclouds: Creates pngs from the text file output

dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16'  --topic_tagcloud --make_topic_wordclouds  --outcome_table blog_outcomes --outcomes age gender is_student is_education is_technology --output ~/fbtopics --topic_lexicon met_a30_2000_freq_t50ll

creates the following directories where the topic clouds are printed:

fbtopics_topic_tagcloud_wordclouds/age
fbtopics_topic_tagcloud_wordclouds/gender
fbtopics_topic_tagcloud_wordclouds/is_student
fbtopics_topic_tagcloud_wordclouds/is_education
fbtopics_topic_tagcloud_wordclouds/is_technology

Print Word Clouds For A Topic Lexicon

Create a word clound png for every topic in a given topic lexicon.

../fwinterface/fwflag_make_all_topic_wordclouds

dlatkInterface.py --topic_lexicon met_a30_2000_freq_t50ll --output all_met_a30_2000_tagclouds --make_all_topic_wordclouds

Ini files

See the Using INI Files tutorial for more info.

Pickles

Save a predictive model to a pickle file with:

dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$1gram$msgs$user_id$16to16' --outcome_table blog_outcomes --outcomes age --train_regression --save_model --picklefile age.pickle

Load a predictive model to a pickle file with:

# Loads the regression model in age.pickle, and uses the features to predict the ages of the users in
# blog_outcomes, and compares the predicted ages to the actual ages in the table.

dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$1gram$msgs$user_id$16to16' --outcome_table blog_outcomes --outcomes age --train_regression --load_model --picklefile age.pickle