PUMS1\_dplyr
================
Win-Vector LLC
4/24/2018

``` r
library("DBI")
library("dplyr")
```

    ## 
    ## Attaching package: 'dplyr'

    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag

    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union

``` r
library("rquery")
```

    ## Loading required package: wrapr

    ## 
    ## Attaching package: 'wrapr'

    ## The following object is masked from 'package:dplyr':
    ## 
    ##     coalesce

``` r
db <- dbConnect(RSQLite::SQLite(), ":memory:")  
dbWriteTable(db, "dpus", readRDS("ss16pus.RDS"))    
dbWriteTable(db, "dhus", readRDS("ss16hus.RDS"))

dbGetQuery(db, "SELECT * FROM dpus LIMIT 5")    
```

    ##   RT  SERIALNO SPORDER  PUMA         ST  ADJINC AGEP              CIT CITWP
    ## 1  P 000000338      03 02701 Alabama/AL 1007588   06 Born in the U.S.  <NA>
    ## 2  P 000000338      05 02701 Alabama/AL 1007588   08 Born in the U.S.  <NA>
    ## 3  P 000000343      03 01400 Alabama/AL 1007588   12 Born in the U.S.  <NA>
    ## 4  P 000000539      04 01400 Alabama/AL 1007588   11 Born in the U.S.  <NA>
    ## 5  P 000002284      02 00600 Alabama/AL 1007588   08 Born in the U.S.  <NA>
    ##    COW DDRS DEAR DEYE DOUT DPHY DRAT DRATX DREM  ENG  FER  GCL  GCM  GCR HINS1
    ## 1 <NA>   No   No   No <NA>   No <NA>  <NA>   No <NA> <NA> <NA> <NA> <NA>    No
    ## 2 <NA>   No   No   No <NA>   No <NA>  <NA>   No <NA> <NA> <NA> <NA> <NA>    No
    ## 3 <NA>   No   No   No <NA>   No <NA>  <NA>  Yes <NA> <NA> <NA> <NA> <NA>    No
    ## 4 <NA>   No   No   No <NA>   No <NA>  <NA>  Yes <NA> <NA> <NA> <NA> <NA>    No
    ## 5 <NA>   No   No   No <NA>   No <NA>  <NA>   No <NA> <NA> <NA> <NA> <NA>   Yes
    ##   HINS2 HINS3 HINS4 HINS5 HINS6 HINS7 INTP JWMNP JWRIP JWTR LANP
    ## 1    No    No   Yes    No    No    No <NA>  <NA>  <NA> <NA> <NA>
    ## 2    No    No   Yes    No    No    No <NA>  <NA>  <NA> <NA> <NA>
    ## 3    No    No   Yes    No    No    No <NA>  <NA>  <NA> <NA> <NA>
    ## 4    No    No   Yes    No    No    No <NA>  <NA>  <NA> <NA> <NA>
    ## 5    No    No    No    No    No    No <NA>  <NA>  <NA> <NA> <NA>
    ##                      LANX                                 MAR MARHD MARHM MARHT
    ## 1 No, speaks only English Never married or under 15 years old  <NA>  <NA>  <NA>
    ## 2 No, speaks only English Never married or under 15 years old  <NA>  <NA>  <NA>
    ## 3 No, speaks only English Never married or under 15 years old  <NA>  <NA>  <NA>
    ## 4 No, speaks only English Never married or under 15 years old  <NA>  <NA>  <NA>
    ## 5 No, speaks only English Never married or under 15 years old  <NA>  <NA>  <NA>
    ##   MARHW MARHYP                         MIG  MIL MLPA MLPB MLPCD MLPE MLPFG MLPH
    ## 1  <NA>   <NA> Yes, same house (nonmovers) <NA> <NA> <NA>  <NA> <NA>  <NA> <NA>
    ## 2  <NA>   <NA> Yes, same house (nonmovers) <NA> <NA> <NA>  <NA> <NA>  <NA> <NA>
    ## 3  <NA>   <NA> Yes, same house (nonmovers) <NA> <NA> <NA>  <NA> <NA>  <NA> <NA>
    ## 4  <NA>   <NA> Yes, same house (nonmovers) <NA> <NA> <NA>  <NA> <NA>  <NA> <NA>
    ## 5  <NA>   <NA> Yes, same house (nonmovers) <NA> <NA> <NA>  <NA> <NA>  <NA> <NA>
    ##   MLPI MLPJ MLPK NWAB NWAV NWLA NWLK NWRE  OIP  PAP                       RELP
    ## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Biological son or daughter
    ## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>    Stepson or stepdaughter
    ## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Biological son or daughter
    ## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Biological son or daughter
    ## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>             Other relative
    ##   RETP                                  SCH    SCHG         SCHL SEMP    SEX
    ## 1 <NA> Yes, public school or public college Grade 1 Kindergarten <NA> Female
    ## 2 <NA> Yes, public school or public college Grade 2      Grade 1 <NA> Female
    ## 3 <NA> Yes, public school or public college Grade 6      Grade 5 <NA> Female
    ## 4 <NA> Yes, public school or public college Grade 4      Grade 4 <NA>   Male
    ## 5 <NA> Yes, public school or public college Grade 1 Kindergarten <NA>   Male
    ##   SSIP  SSP WAGP WKHP  WKL  WKW  WRK YOEP          ANC            ANC1P
    ## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>       Single African American
    ## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>       Single African American
    ## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>     Multiple African American
    ## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Not reported     Not reported
    ## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Not reported     Not reported
    ##          ANC2P DECADE                  DIS DRIVESP                         ESP
    ## 1 Not reported   <NA> Without a disability    <NA> Both parents in labor force
    ## 2 Not reported   <NA> Without a disability    <NA> Both parents in labor force
    ## 3        Irish   <NA>    With a disability    <NA>   Mother in the labor force
    ## 4 Not reported   <NA>    With a disability    <NA> Both parents in labor force
    ## 5 Not reported   <NA> Without a disability    <NA>                        <NA>
    ##    ESR FHICOVP FOD1P FOD2P                          HICOV
    ## 1 <NA>      No  <NA>  <NA> With health insurance coverage
    ## 2 <NA>      No  <NA>  <NA> With health insurance coverage
    ## 3 <NA>      No  <NA>  <NA> With health insurance coverage
    ## 4 <NA>      No  <NA>  <NA> With health insurance coverage
    ## 5 <NA>     Yes  <NA>  <NA> With health insurance coverage
    ##                          HISP INDP JWAP JWDP MIGPUMA MIGSP  MSP NAICSP NATIVITY
    ## 1 Not Spanish/Hispanic/Latino <NA> <NA> <NA>    <NA>  <NA> <NA>   <NA>   Native
    ## 2 Not Spanish/Hispanic/Latino <NA> <NA> <NA>    <NA>  <NA> <NA>   <NA>   Native
    ## 3 Not Spanish/Hispanic/Latino <NA> <NA> <NA>    <NA>  <NA> <NA>   <NA>   Native
    ## 4 Not Spanish/Hispanic/Latino <NA> <NA> <NA>    <NA>  <NA> <NA>   <NA>   Native
    ## 5 Not Spanish/Hispanic/Latino <NA> <NA> <NA>    <NA>  <NA> <NA>   <NA>   Native
    ##                                            NOP               OC OCCP PAOC PERNP
    ## 1 Living with two parents: Both parents NATIVE              Yes <NA> <NA>  <NA>
    ## 2 Living with two parents: Both parents NATIVE              Yes <NA> <NA>  <NA>
    ## 3       Living with mother only: Mother NATIVE              Yes <NA> <NA>  <NA>
    ## 4 Living with two parents: Both parents NATIVE              Yes <NA> <NA>  <NA>
    ## 5                                         <NA> No (includes GQ) <NA> <NA>  <NA>
    ##   PINCP       POBP POVPIP POWPUMA POWSP
    ## 1  <NA> Alabama/AL    158    <NA>  <NA>
    ## 2  <NA> Alabama/AL    158    <NA>  <NA>
    ## 3  <NA> Alabama/AL    072    <NA>  <NA>
    ## 4  <NA> Alabama/AL    003    <NA>  <NA>
    ## 5  <NA> Alabama/AL    079    <NA>  <NA>
    ##                                     PRIVCOV                         PUBCOV
    ## 1 Without private health insurance coverage    With public health coverage
    ## 2 Without private health insurance coverage    With public health coverage
    ## 3 Without private health insurance coverage    With public health coverage
    ## 4 Without private health insurance coverage    With public health coverage
    ## 5    With private health insurance coverage Without public health coverage
    ##                  QTRBIR                           RAC1P
    ## 1    April through June Black or African American alone
    ## 2 January through March Black or African American alone
    ## 3    April through June               Two or More Races
    ## 4 January through March                     White alone
    ## 5    April through June                     White alone
    ##                             RAC2P                            RAC3P RACAIAN
    ## 1 Black or African American alone  Black or African American alone      No
    ## 2 Black or African American alone  Black or African American alone      No
    ## 3               Two or More Races White; Black or African American      No
    ## 4                     White alone                      White alone      No
    ## 5                     White alone                      White alone      No
    ##   RACASN RACBLK RACNH RACNUM RACPI RACSOR RACWHT  RC SCIENGP SCIENGRLP  SFN
    ## 1     No    Yes    No      1    No     No     No Yes    <NA>      <NA> <NA>
    ## 2     No    Yes    No      1    No     No     No Yes    <NA>      <NA> <NA>
    ## 3     No    Yes    No      2    No     No    Yes Yes    <NA>      <NA> <NA>
    ## 4     No     No    No      1    No     No    Yes Yes    <NA>      <NA> <NA>
    ## 5     No     No    No      1    No     No    Yes Yes    <NA>      <NA> <NA>
    ##    SFR SOCP  VPS                     WAOB FAGEP FANCP FCITP FCITWP FCOWP FDDRSP
    ## 1 <NA> <NA> <NA> US state (POB = 001-059)    No    No    No     No    No     No
    ## 2 <NA> <NA> <NA> US state (POB = 001-059)    No    No    No     No    No     No
    ## 3 <NA> <NA> <NA> US state (POB = 001-059)    No    No    No     No    No     No
    ## 4 <NA> <NA> <NA> US state (POB = 001-059)    No    No    No     No    No     No
    ## 5 <NA> <NA> <NA> US state (POB = 001-059)    No    No    No     No    No    Yes
    ##   FDEARP FDEYEP FDISP FDOUTP FDPHYP FDRATP FDRATXP FDREMP FENGP FESRP FFERP
    ## 1     No     No    No     No     No     No      No     No    No    No    No
    ## 2     No     No    No     No     No     No      No     No    No    No    No
    ## 3     No     No    No     No     No     No      No     No    No    No    No
    ## 4     No     No    No     No     No     No      No     No    No    No    No
    ## 5    Yes    Yes   Yes     No    Yes     No      No    Yes    No    No    No
    ##   FFODP FGCLP FGCMP FGCRP FHINS1P FHINS2P FHINS3C FHINS3P FHINS4C FHINS4P
    ## 1    No    No    No    No      No      No    <NA>      No      No      No
    ## 2    No    No    No    No      No      No    <NA>      No      No      No
    ## 3    No    No    No    No      No      No    <NA>      No      No      No
    ## 4    No    No    No    No      No      No    <NA>      No      No      No
    ## 5    No    No    No    No     Yes     Yes    <NA>     Yes    <NA>     Yes
    ##   FHINS5C FHINS5P FHINS6P FHINS7P FHISP FINDP FINTP FJWDP FJWMNP FJWRIP FJWTRP
    ## 1    <NA>      No      No      No    No    No    No    No     No     No     No
    ## 2    <NA>      No      No      No    No    No    No    No     No     No     No
    ## 3    <NA>      No      No      No    No    No    No    No     No     No     No
    ## 4    <NA>      No      No      No    No    No    No    No     No     No     No
    ## 5    <NA>     Yes     Yes     Yes    No    No    No    No     No     No     No
    ##   FLANP FLANXP FMARHDP FMARHMP FMARHTP FMARHWP FMARHYP FMARP FMIGP FMIGSP
    ## 1    No     No      No      No      No      No      No    No    No     No
    ## 2    No     No      No      No      No      No      No    No    No     No
    ## 3    No     No      No      No      No      No      No    No    No     No
    ## 4    No     No      No      No      No      No      No    No    No     No
    ## 5    No     No      No      No      No      No      No    No    No     No
    ##   FMILPP FMILSP FOCCP FOIP FPAP FPERNP FPINCP FPOBP FPOWSP FPRIVCOVP FPUBCOVP
    ## 1     No     No    No   No   No     No     No    No     No        No       No
    ## 2     No     No    No   No   No     No     No    No     No        No       No
    ## 3     No     No    No   No   No     No     No    No     No        No       No
    ## 4     No     No    No   No   No     No     No    No     No        No       No
    ## 5     No     No    No   No   No     No     No    No     No       Yes      Yes
    ##   FRACP FRELP FRETP FSCHGP FSCHLP FSCHP FSEMP FSEXP FSSIP FSSP FWAGP FWKHP
    ## 1    No    No    No     No     No    No    No    No    No   No    No    No
    ## 2    No    No    No     No     No    No    No    No    No   No    No    No
    ## 3    No    No    No     No     No    No    No    No    No   No    No    No
    ## 4    No    No    No     No     No    No    No    No    No   No    No    No
    ## 5    No    No    No     No     No    No    No    No    No   No    No    No
    ##   FWKLP FWKWP FWRKP FYOEP
    ## 1    No    No    No    No
    ## 2    No    No    No    No
    ## 3    No    No    No    No
    ## 4    No    No    No    No
    ## 5    No    No    No    No

``` r
dpus <- tbl(db, "dpus")     
dhus <- tbl(db, "dhus")

# print(dpus)   

# view(rsummary(db, "dpus")) 



# Wykonuje tyle przebiegów PUMS1.Rmd
# w pakiecie dplyr, ile jest praktyczne.
# Zwróć uwagę, że w tych wczesnych etapach dane pozostają w bazie danych.
# 




target_emp_levs <- c(
  "Employee of a private for-profit company or busine",
  "Employee of a private not-for-profit, tax-exempt, ",
  "Federal government employee",                    
  "Local government employee (city, county, etc.)",   
  "Self-employed in own incorporated business, profes",
  "Self-employed in own not incorporated business, pr",
  "State government employee")


dpus <- dpus %>%
  select(., AGEP, COW, ESR,  PERNP, 
         PINCP, SCHL, SEX, WKHP) %>%
  mutate_at(., c("AGEP", "PERNP", "PINCP", "WKHP"),
            as.numeric) %>%
  filter_all(., all_vars(!is.na(.))) %>% 
  mutate(., COW = SUBSTR(COW, 1, 50)) %>%
  filter(., (PINCP>1000) & 
           (ESR=="Civilian employed, at work") & 
           (PINCP<=250000) & 
           (PERNP>1000) & (PERNP<=250000) & 
           (WKHP>=30) & 
           (AGEP>=18) & (AGEP<=65) & 
           (COW %in% target_emp_levs)) %>%
  mutate(., 
         SCHL = ifelse(is.na(SCHL) |
                         (!(SCHL %in% 
                              c("Associate's degree",
                                "Bachelor's degree",
                                "Doctorate degree",
                                "Master's degree",
                                "Professional degree beyond a bachelor's degree"))),
                       "No Advanced Degree",
                       SCHL))

glimpse(dpus)
```

    ## Rows: ??
    ## Columns: 8
    ## Database: sqlite 3.33.0 [:memory:]
    ## $ AGEP  <dbl> 24, 31, 26, 27, 54, 64, 27, 47, 24, 58, 41, 61, 61, 43, 21, 5...
    ## $ COW   <chr> "Employee of a private for-profit company or busine", "Employ...
    ## $ ESR   <chr> "Civilian employed, at work", "Civilian employed, at work", "...
    ## $ PERNP <dbl> 22000, 21000, 21000, 25000, 31200, 40000, 13000, 36000, 20000...
    ## $ PINCP <dbl> 22000, 21000, 25800, 25000, 31200, 40000, 20200, 36000, 20000...
    ## $ SCHL  <chr> "No Advanced Degree", "No Advanced Degree", "No Advanced Degr...
    ## $ SEX   <chr> "Male", "Female", "Female", "Female", "Male", "Male", "Female...
    ## $ WKHP  <dbl> 40, 40, 40, 40, 40, 40, 40, 50, 40, 40, 40, 30, 40, 40, 40, 6...

``` r
dpus %>%
  group_by(., SCHL, SEX) %>%
  summarize(., mean_income = mean(PINCP)) %>%
  ungroup(.) %>%
  arrange(., SCHL, SEX)
```

    ## Warning: Missing values are always removed in SQL.
    ## Use `mean(x, na.rm = TRUE)` to silence this warning
    ## This warning is displayed only once per session.

    ## # Source:     lazy query [?? x 3]
    ## # Database:   sqlite 3.33.0 [:memory:]
    ## # Ordered by: SCHL, SEX
    ##    SCHL               SEX    mean_income
    ##    <chr>              <chr>        <dbl>
    ##  1 Associate's degree Female      40990.
    ##  2 Associate's degree Male        56544.
    ##  3 Bachelor's degree  Female      56569.
    ##  4 Bachelor's degree  Male        76132.
    ##  5 Doctorate degree   Female      84251.
    ##  6 Doctorate degree   Male        96944.
    ##  7 Master's degree    Female      69107.
    ##  8 Master's degree    Male        94054.
    ##  9 No Advanced Degree Female      32048.
    ## 10 No Advanced Degree Male        43292.
    ## # ... with more rows

``` r
# wprowadza dane z bazy danych do R
dpus <- collect(dpus)

dpus$SCHL <- relevel(factor(dpus$SCHL), 
                     "No Advanced Degree")
dpus$COW <- relevel(factor(dpus$COW), 
                    target_emp_levs[[1]])
dpus$SEX <- relevel(factor(dpus$SEX), 
                    "Male")

set.seed(2019)
is_train <- runif(nrow(dpus))>=0.2
dpus_train <- dpus[is_train, , drop = FALSE]
dpus_test <- dpus[!is_train, , drop = FALSE]

model <- lm(PINCP ~ AGEP + COW + SCHL + SEX, 
            data = dpus_train)
summary(model)
```

    ## 
    ## Call:
    ## lm(formula = PINCP ~ AGEP + COW + SCHL + SEX, data = dpus_train)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -114164  -19792   -5197   12793  204368 
    ## 
    ## Coefficients:
    ##                                                        Estimate Std. Error
    ## (Intercept)                                            12569.62     709.23
    ## AGEP                                                     809.17      15.79
    ## COWEmployee of a private not-for-profit, tax-exempt,   -6657.41     747.20
    ## COWFederal government employee                         10390.29    1217.75
    ## COWLocal government employee (city, county, etc.)      -6077.28     777.66
    ## COWSelf-employed in own incorporated business, profes   5599.18    1120.97
    ## COWSelf-employed in own not incorporated business, pr -13944.71     953.94
    ## COWState government employee                           -9268.98     937.10
    ## SCHLAssociate's degree                                 10009.64     668.87
    ## SCHLBachelor's degree                                  29608.35     487.56
    ## SCHLDoctorate degree                                   50375.03    1782.52
    ## SCHLMaster's degree                                    43505.87     709.90
    ## SCHLProfessional degree beyond a bachelor's degree     62155.63    1428.10
    ## SEXFemale                                             -13869.17     395.48
    ##                                                       t value Pr(>|t|)    
    ## (Intercept)                                            17.723  < 2e-16 ***
    ## AGEP                                                   51.249  < 2e-16 ***
    ## COWEmployee of a private not-for-profit, tax-exempt,   -8.910  < 2e-16 ***
    ## COWFederal government employee                          8.532  < 2e-16 ***
    ## COWLocal government employee (city, county, etc.)      -7.815 5.68e-15 ***
    ## COWSelf-employed in own incorporated business, profes   4.995 5.92e-07 ***
    ## COWSelf-employed in own not incorporated business, pr -14.618  < 2e-16 ***
    ## COWState government employee                           -9.891  < 2e-16 ***
    ## SCHLAssociate's degree                                 14.965  < 2e-16 ***
    ## SCHLBachelor's degree                                  60.728  < 2e-16 ***
    ## SCHLDoctorate degree                                   28.261  < 2e-16 ***
    ## SCHLMaster's degree                                    61.285  < 2e-16 ***
    ## SCHLProfessional degree beyond a bachelor's degree     43.523  < 2e-16 ***
    ## SEXFemale                                             -35.069  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 33260 on 29522 degrees of freedom
    ## Multiple R-squared:  0.2922, Adjusted R-squared:  0.2919 
    ## F-statistic: 937.6 on 13 and 29522 DF,  p-value: < 2.2e-16

``` r
dpus_test$predicted_income <- predict(model,
                                      newdata = dpus_test)
WVPlots::ScatterHist(dpus_test, "predicted_income", "PINCP",
                     "PINCP as function of predicted income on held-out data",
                     smoothmethod = "identity",
                     contour = TRUE)
```

![](PUMS1_dplyr_files/figure-gfm/unnamed-chunk-1-1.png)<!-- -->

``` r
DBI::dbDisconnect(db)
```
