DLATK's Pandas Interface
Importing a FeatureGetter or OutcomeGetter
The same methods work for both FeatureGetter and OutcomeGetter.
from dlatk.featureGetter import FeatureGetter
fg = FeatureGetter() # use defaults set in dlaConstants.py
fg = FeatureGetter(corpdb="someDB", corptable="someTB", correl_field="someField", ...) # specify values
fg = FeatureGetter.fromFile('/path/to/init/file') # pass values from file
Init file must have the line [constants] at the top. Also note that none of the strings are quoted. For lists (such as lists of outcome variables) use commas to separate values. Sample init file:
[constants]
corpdb = dla_tutorial
corptable = msgs
correl_field = user_id
feattable = feat$1gram$msgs$user_id$16to16$0_01
Getting feature tables as dataframes
fg = FeatureGetter()
fg_gns = fg.getGroupNormsAsDF(where='') # group norms as dataframe
fg_vals = fg.getValuesAsDF(where='')
fg_zgns = fg.getGroupNormsWithZerosAsDF(groups=[], where='', pivot=False, sparse=False)
fg_vgns = fg.getValuesAndGroupNormsAsDF(where='')
Getting outcome tables as dataframes
og = OutcomeGetter()
# outcome table as dataframe
og_vals = og.getGroupAndOutcomeValuesAsDF(outcomeField = None, where='')
# outcome table as dataframe with group freq thresh applied
og_out = og.getGroupsAndOutcomesAsDF(groupThresh = 0, lexicon_count_table=None, groupsWhere = '', sparse=False)
Examples
In these examples the testInitFile is the same as the sample init file above.
Features
from dlatk.featureGetter import FeatureGetter
fg = FeatureGetter.fromFile("testInitFile.txt")
Get group Norms:
fg_gns = fg.getGroupNormsAsDF()
fg_gns.head()
group_norm
group_id feat
003ae43fae340174a67ffbcf19da1549 neighbors 0.00026
all 0.00390
jason 0.00026
<newline> 0.00130
caused 0.00026
Get values:
fg_vals = fg.getValuesAsDF()
fg_vals.head()
value
group_id feat
003ae43fae340174a67ffbcf19da1549 neighbors 1
all 15
jason 1
<newline> 5
caused 1
Get group norms with zeros:
fg_zgns = fg.getGroupNormsWithZerosAsDF()
fg_zgns.head()
group_norm
group_id feat
003ae43fae340174a67ffbcf19da1549 ! 0.096464
" 0.000780
# 0.000000
#12 0.000000
$ 0.000000
% 0.000000
Create a pivot table:
fg_zgns_piv = fg.getGroupNormsWithZerosAsDF(pivot=True)
fg_zgns_piv.head()
group_norm
feat ¿ – — ‘ ’ “ ” •
group_id
003ae43fae340174a67ffbcf19da1549 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
01f6c25f87600f619e05767bf8942a5f 0.0 0.0 0.0 0.0 0.000677 0.0 0.0 0.0
02be98c1005c0e7605385fbc5009de61 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
0318cc38971845f7470f34704de7339d 0.0 0.0 0.0 0.0 0.001647 0.0 0.0 0.0
040b2b154e4074a72d8a7b9697ec76d2 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0
Create a sparse dataframe:
fg_sparse = fg.getGroupNormsWithZerosAsDF(sparse=True)
fg_sparse.density
0.07432567922874671
fg_sparse.head()
group_norm
group_id feat
003ae43fae340174a67ffbcf19da1549 ! 0.096464
" 0.000780
# 0.000000
#12 0.000000
$ 0.000000
% 0.000000
Outcomes
Init file:
[constants]
corpdb = dla_tutorial
corptable = msgs
correl_field = user_id
feattable = feat$1gram$msgs$user_id$16to16$0_01
outcometable = blog_outcomes
outcomefields = age, is_education
outcomecontrols = gender
Initialize:
from dlatk.outcomeGetter import OutcomeGetter
og = OutcomeGetter.fromFile('testInitFile.txt')
Get outcomes and controls:
outAndCont = og.getGroupsAndOutcomesAsDF()
outAndCont.head()
age is_education gender
group_id
28451 27 NaN 0
174357 23 NaN 1
216833 24 NaN 0
317581 26 NaN 0
446275 17 NaN 1
outcome = og.getGroupAndOutcomeValuesAsDF()
outcome.head()
age
user_id
3991108 17
3417138 25
3673414 14
3361075 16
4115327 14
Features and Outcomes in one dataframe
Initialize:
from dlatk.featureStar import FeatureStar
fs = FeatureStar.fromFile('testInitFile.txt')
Get both dataframe with all info:
fAndO_df = fs.combineDFs(fg=None, og=None, fillNA=True)
fg can be either a FeatureGetter or a dataframe with index on group_id. Similarly, og can be either a OutcomeGetter or a dataframe with index on group_id. Alternatively, you can pass nothing to the method, which will return a dataframe with with data from the feature and outcome tables in FeatureStar.
fAndO = fs.combineDFs() # pass nothing
fAndO = fs.combineDFs(someFeatureGetter, someOutcomeGetter) # pass objects
fAndO = fs.combineDFs(someFeatureDF, someOutcomeDF) # pass dataframes