#author: Group7 #name: Jaydip Patel, Vincent Chang, Nicole Haberer, Ivy Wang, & Griffin Parr

The setting for this project is to work with survey data from a popular (using the term loosely) pregnancy data website. Although the data are in a convenient format (download the .csv here: https://spacefem.com/pregnant/charts/duedate40.php ), this project will require you to transform variables and to bypass defaults in some functions that you should already be familiar with from Nick Reinholtz’s class. Not every task will be familiar. For example, I expect you will need to spend some time debugging plotting commands.

There are two deliverables for this project. 1. A well-commented and debugged script (.R file) that produces requested plots and calculations in response to the prompts below. 2. A brief write up of the answers to the questions in the prompts below (standard word processing format or pdf). The write up should be brief, but professional. Think of your audience as a non- technically oriented manager who wants to know the answers to the questions posed below. Accordingly, you should keep the write up free of “technical” details (i.e., “Our group then used lines() because it does not produce a new plot object, allowing us to plot the normal density curve on the same plot as the data.”), but it should have plenty of contextual detail (i.e., “As you can see from Figure 4, there is not much difference between the distributions of gestational periods for firstborn children and children of mothers who have previously had children.”)

Data Processing. Project Instructions. 1. Download the data to the working directory you will use for this project. Use the link above to navigate to the data. Within your script, set the working directory in order to shorten the path. 2. Read in the data (using read.csv()). 3. Use the “names()” function applied to your data frame object to inspect the names of the variables in the data. Save as a separate data frame a slimmer data frame that contains only information on the following variables: id, motherbirthyear, multiples, pounds, bornonday, duedatebase, induced, and excluded. 4. Use the summary() command to summarize each of the variables in this slimmer data frame. In the write-up, you should briefly describe these variables, with reference to a summary table of information. Keep in mind the relevant detail that a non-technical manager would want to know.

Executive Summary We began this project by subsetting the dataset into the slimmer dataframe you requested. As we read on the website and noticed in the data, there were several shortcomings in the dataset that needed cleaning. We only included the survey responses in which the “bornondate” was between 245 days (35 Weeks) and 301 days (43 days). We threw out the rows that SpaceFem excluded from the “exclude column” for various reasons. Another limitation was that if the survey respondent did not know their LMP date, it was assumed it was 280 days before the due date.

We were able to recreate the graph (see plot 1) comparing spontaneous and induced births to determine if either had an effect on the length of a pregnancy. We analyzed whether or not a mother having previous births affected the length of the pregnancy (see plot 2) and concluded that whether or not this was a mother’s first birth had no impact on the length of the pregnancy. We also concluded that the detection type had no impact on the length of a pregnancy (see plot 3). We investigated the impact of weight on the length of a pregnancy (see plot 4) and concluded that babies weighing less than 7 pounds tended to be born earlier while babies weighing more than or equal to 7 pounds were born slightly later. We examined whether the mother’s age had any impact on the length of the pregnancy (see plot 5) and concluded that there is a distinct spike in later births (41-42 weeks) for mothers born in the 1970s.

We overlaid a normal distribution onto a plot (see plot 6) and noticed that there is a greater likelihood of a baby being born before 40 weeks than after and a much greater variance in early births than late ones.

#global code
##global adjustments starts 01
rm(list=ls())
#seed for replication
set.seed(100)
oldwarnval <- getOption("warn") #preserve old value
options(warn = -1) #warnings are silent now

##global adjustments ends 01


#####generalized plot function
plotFunction <- function(x1,y1,x2,y2,x3 = -1,y3
                         ,ztitle = "plot"
                         ,zlegendText = c("var1","var2")
                         ,customYlim = c(0,7/100)
                         ,customLegend = c(35.2,37.5,0.065, 0.055)
                         ,customGrid = c(8,7)
                        ){
  #X fisrt line plot
  #Y second line plot  
  #Z thir line plot
  x = x1;y = y1;
  plot(x,y
       ,type= "o"
       ,main= ztitle
       ,col = rgb(6/10, 0, 0)  #36,137,371 #color of the line chart
       ,lwd = 2                  #line thickness
       ,pch = 19                  # add points, solid circle,
       ,cex = 0.3                # size of the point on the chart
       ,tick = F
       ,xaxs = "i"
       ,yaxs = "i"
       ,xaxt = "n"
       ,ylim = customYlim
       ,xlab = ""
       ,ylab = ""
  )  #customGrid = c(8,7)
  grid(nx = customGrid[1]#8
       ,ny = customGrid[2]#7
       , lty=1, col = "cornsilk2")

  lines(x, y
        ,type= "o"
        ,col = rgb(6/10, 0, 0)  #36,137,371 #color of the line chart
        ,lwd = 2                  #line thickness
        ,pch = 19                  # add points, solid circle,
        ,cex = 0.3                 # size of the point on the chart
  )
  x = x2; y = y2;
  lines(x, y
        ,type= "o"
        ,col = rgb(149/255, 41/255, 176/255)
        ,lwd = 2
        ,pch = 19
        ,cex = 0.3
  )
  if(x3 != -1){
    x = x3; y = y3;
    zcol <- c((rgb(6/10, 0, 0)), rgb(36/255, 29/255, 219/255),
      rgb(149/255, 41/255, 176/255)) #3 colors
    zcol1 <- rgb(36/255, 29/255, 219/255)
    lines(x, y, type = "o"
          ,col.lab="blue"
          ,col = zcol1
          ,lwd = 2
          ,pch = 19
          ,cex = 0.3
    )
  }else{
    x = 0; y = 0;
    zcol <- c((rgb(6/10, 0, 0)), rgb(149/255, 41/255, 176/255)) # 2 colors
  }

  axis(side = 1, at = 35:43
       ,labels = paste(c(35:43),"weeks")
       ,col.axis = rgb(151/255, 145/255, 145/255)
       ,tick = F
  )
  axis(side = 2
       ,col.axis = rgb(151/255, 145/255, 145/255)
       ,tick = F
  )
  legend( x = c(customLegend[1],customLegend[2] )
         ,y = c(customLegend[3],customLegend[4] ) #x=c(35.2,37.5),y=c(0.98, 0.75)
         ,legend = zlegendText
         ,fill = zcol
         ,text.col = rgb(151/255, 145/255, 10/255)
         ,border = 1
         ,box.lty= 0
         ,bty = "o"
         ,y.intersp=1.5
  )
  abline(v = 40)

  #return(1)
}
#read file
df <- data.frame(read.csv("~/Downloads/Tony/Project 1/sf_gestationdata.csv",stringsAsFactors = F))

Figure 1: Summary information for each Variable

  • id - a unique identification number for one survey entry. Each survey entry is assigned a unique identifier. This dataset contains almost 14,000 unique survey entries.
  • motherbirthyear - denotes the year in which the mother was born. multiples - describes how many babies were being born. If a mother was expecting twins it is labeled with a 2. A mother expecting triplets was given the label 3.
  • pounds - describes the baby’s weight in pounds.
  • previousbirths - tells us if this is the mother’s first birth or not. If it is her first birth, her survey entry was assigned the number 0. If it is her second, she was assigned 1. If it was her third, she was assigned 2 and this pattern continues.
  • bornonday - shows the difference between the date the baby was born and the LMP date. We, along with the website, decided to only include survey entries in which the baby was born between 245 (35 weeks) days and 301 (43 weeks) One limitation of the dataset is that if the LMP was not known, the data assumes it was 280 days before the due date. The dataset assumes 280 days because of Naegele’s rule which is the traditional method of calculating a due date. duedatebase - describes if the due date was determined by ovulation, LMP, or ultrasound.
  • induced - indicates if the birth was induced or not. If it was induced, it was given a 1, if it was spontaneous it was given a 0.
  • exclude - indicates whether a record was excluded from the charts for some reason. Approximately 0.02% of the survey entries were excluded.
  • #elmininate date based on exclude column
    df <- df[df$exclude == 0,]
    #slimmer dataFrame
    df02 <- df[,c("id", "motherbirthyear", "multiples", "pounds", "bornonday"
                  , "duedatebase", "induced", "exclude", "previousbirths")]
    
    #head at a glance
    head(df02)
    
    ##   id motherbirthyear multiples pounds bornonday duedatebase induced
    ## 1  6            1974         1      7       277       Other       1
    ## 3  8            1979         1      7       277         LMP       0
    ## 4  9            1975         1      8       278         LMP       0
    ## 5 10            1984         1     10       286         LMP       0
    ## 6 11            1984         1      8       284         LMP       1
    ## 7 12            1978         1      8       287  ultrasound       0
    ##   exclude previousbirths
    ## 1       0              0
    ## 3       0              0
    ## 4       0              1
    ## 5       0              1
    ## 6       0              0
    ## 7       0              0
    
    summary(df02)
    
    ##        id        motherbirthyear   multiples         pounds       
    ##  Min.   :    6   Min.   :    0   Min.   :1.000   Min.   :  0.000  
    ##  1st Qu.: 3660   1st Qu.: 1979   1st Qu.:1.000   1st Qu.:  6.000  
    ##  Median : 7169   Median : 1983   Median :1.000   Median :  7.000  
    ##  Mean   : 7063   Mean   : 1907   Mean   :1.007   Mean   :  6.358  
    ##  3rd Qu.:10486   3rd Qu.: 1987   3rd Qu.:1.000   3rd Qu.:  8.000  
    ##  Max.   :13799   Max.   :19955   Max.   :3.000   Max.   :610.000  
    ##    bornonday     duedatebase           induced          exclude 
    ##  Min.   :100.0   Length:13255       Min.   :0.0000   Min.   :0  
    ##  1st Qu.:272.0   Class :character   1st Qu.:0.0000   1st Qu.:0  
    ##  Median :279.0   Mode  :character   Median :0.0000   Median :0  
    ##  Mean   :275.3                      Mean   :0.2413   Mean   :0  
    ##  3rd Qu.:286.0                      3rd Qu.:0.0000   3rd Qu.:0  
    ##  Max.   :333.0                      Max.   :1.0000   Max.   :0  
    ##  previousbirths   
    ##  Min.   : 0.0000  
    ##  1st Qu.: 0.0000  
    ##  Median : 0.0000  
    ##  Mean   : 0.4104  
    ##  3rd Qu.: 1.0000  
    ##  Max.   :70.0000
    

    PlotOne

    Based on separating the data into induced vs. spontaneous births, it is clear that both variables are centered around the same mean. The distribution of both spontaneous and induced births are normally distributed but are both skewed slightly left. As seen in Plot 1, there is significantly more data on spontaneous births relative to induced births. This limitation inhibits our ability to make supportable interpretations. Even with limited induced data, we conclude that whether a birth was induced or spontaneous appears to have no observable impact on the length of the pregnancy.

    ####Working with the data.
    #add week column to the df
    df02$week <- round(df02$bornonday/7,02) #revisit # week generation
    ####plotOne
    #add Spontaneous column to the table
    df02$Spontaneous <- ifelse(df02$induced == 1,0,1)
    #aggregate table to plot
    df03induced <- aggregate(induced ~ week,FUN = sum, data = df02[,c("week","induced","Spontaneous")])
    df03spontaneous <- aggregate(Spontaneous ~ week,FUN = sum, data = df02[,c("week","induced","Spontaneous")])
    
    #data frames to plot One 
    df03induced <- df03induced[df03induced$week >= 35.00 & df03induced$week <= 43.00,]
    df03spontaneous <- df03spontaneous[df03spontaneous$week >= 35.00
                                        & df03spontaneous$week <= 43.00,]
    #normalizing the axis 
    normFactor <- max(df03spontaneous$Spontaneous) + 51 #normalize
    df03spontaneous$Spontaneous <- df03spontaneous$Spontaneous/normFactor
    #normFactor <- max(df03induced$induced) #normalize
    df03induced$induced <- df03induced$induced/normFactor
    
    #custom parameters
    customYlim = c(0,1)
    customLegend = c(35.2,37.5,0.98, 0.75)
    customGrid = c(10,7)
    #plot One 
    plotFunction(x1 = df03spontaneous$week, y1 = df03spontaneous$Spontaneous,
                 x2 = df03induced$week , y2= df03induced$induced,
                 ztitle = "PlotOne : Spontaneous vs induced",
                 zlegendText = c("Spontaneous","induced"),
                 customYlim = customYlim,
                 customLegend = customLegend,
                 customGrid = customGrid
    )
    
    plot of chunk unnamed-chunk-3

    PlotTwo

    This chart compares the length of pregnancy for mothers undergoing their first birth compared to mothers who have had previous births. The distribution are both normally distributed but are slightly left skewed. There does not appear to be any statistically significant differences in the length of a mother’s pregnancy. Therefore, we would argue that conventional medical wisdom arguing that shorter first pregnancies result in shorter second pregnancies is not supported by this data.

    #### plotTwo
    # First births vs births 2+
    # head(df02)
    #new dfs for plots # weeks in [35,43]
    df04fb <- df02[df02$previousbirths == 0 & df02$week >=35.00 & df02$week <=43.00,]  #"first Birth df"
    df04pb <- df02[df02$previousbirths == 1 & df02$week >=35.00 & df02$week <=43.00,]  #"Previous Birth df"
    
    #aggregate tables
    df04fbA <- aggregate(previousbirths ~ week,FUN= length , df04fb[,c("week","previousbirths")] )
    df04pbA <- aggregate(previousbirths ~ week,FUN= length , df04pb[,c("week","previousbirths")] )
    
    #count to be normalized
    count <- sum(df04fbA$previousbirths)
    df04fbA$previousbirths <- df04fbA$previousbirths/count #normalization with count
    count <- sum(df04pbA$previousbirths)
    df04pbA$previousbirths <- df04pbA$previousbirths/count #normalization with count
    
    #plotTwo
    plotFunction(x1 = df04fbA$week, y1 = df04fbA$previousbirths,
                 x2 = df04pbA$week, y2= df04pbA$previousbirths,
                 ztitle = "PlotTwo : First births vs births 2+.",
                 zlegendText = c("First births","Previous births")
    )
    
    plot of chunk unnamed-chunk-4

    PlotThree

    This chart compares the length of the pregnancy for the three different detection types. The distributions for all three are normally distributed with a slight left skew. The data indicates that the different detection types have no statistically significant impact on the length of a mother’s pregnancy.

    ####Plot 3
    #Detection type:Ultrasound vs Ovulation vs LMP
    #head(df02)
    unique(df02$duedatebase)
    
    ## [1] "Other"      "LMP"        "ultrasound" "NULL"       "ovulation"
    
    df05 <- df02[df02$duedatebase=="LMP"             #df for ploting
                   | df02$duedatebase=="ultrasound"
                   | df02$duedatebase=="ovulation",]
    
    #dummy columns
    df05$LMP <- ifelse(df05$duedatebase == "LMP",1,0) #LMP column
    df05$ultrasound <- ifelse(df05$duedatebase == "ultrasound",1,0) #ultrasound column
    df05$ovulation <- ifelse(df05$duedatebase == "ovulation",1,0) # Ovulation
    
    #aggregate tables
    df05Lmp <- aggregate(LMP ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")])
    df05ultrasound <- aggregate(ultrasound ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")])
    df05ovulation <- aggregate(ovulation ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")])
    
    #remove weeks !in [35.00,43.00]
    df05Lmp <- df05Lmp[df05Lmp$week >= 35.00 & df05Lmp$week <= 43.00,]
    df05ultrasound <- df05ultrasound[df05ultrasound$week >= 35.00 & df05ultrasound$week <= 43.00,]
    df05ovulation <- df05ovulation[df05ovulation$week >= 35.00 & df05ovulation$week <= 43.00,]
    
    #counter to normalize the table
    count = sum(df05Lmp$LMP)
    df05Lmp$LMP <- df05Lmp$LMP/count #nomalize 
    count = sum(df05ultrasound$ultrasound)
    df05ultrasound$ultrasound <- df05ultrasound$ultrasound/count
    count = sum(df05ovulation$ovulation)
    df05ovulation$ovulation <- df05ovulation$ovulation/count
    
    #plotThree
    plotFunction(x1 = df05Lmp$week, y1 = df05Lmp$LMP,
                 x2 = df05ovulation$week , y2= df05ovulation$ovulation,
                 x3 = df05ultrasound$week, y3= df05ultrasound$ultrasound,
                 ztitle = "PlotThree : Detection type: ultrasound, ovulation, LMP.",
                 zlegendText = c("LMP","Ovulation", "Ultrasound")
    )
    
    plot of chunk unnamed-chunk-5

    PlotFour

    This chart compares the length of pregnancy for babies who were born weighing less than 7 pounds and babies who were born weighing more than or equal to 7 pounds. The means are the same and the data is relatively normally distributed. The distributions for babies weighing under 7 pounds and more than or equal to 7 pounds are skewed left. To us, this chart appears to indicate that that babies born weighing less than 7 pounds tend to be born slightly earlier while babies weighing more than or equal to 7 pounds tend to be born slightly later.

    #####plotFour
    #Baby weight >= 7lbs versus Baby weight < 7lbs
    #head(df02,20)
    df02$poundsGT7 <- ifelse(df02$pounds >=7, 1, 0) #dummy column
    df06b7 <- df02[df02$poundsGT7 == 0,c("week","poundsGT7")]  #below 7
    df06a7 <- df02[df02$poundsGT7 == 1,c("week","poundsGT7")]  #above 7
    
    #week within 35 to 43
    df06b7 <- df06b7[df06b7$week >= 35.00 & df06b7$week <=43.00,]
    df06a7 <- df06a7[df06a7$week >= 35.00 & df06a7$week <=43.00,]
    
    #aggregate tables for ploting
    df06b7tab <- aggregate(poundsGT7 ~ week ,FUN = length, data = df06b7) # below 7
    df06a7tab <- aggregate(poundsGT7 ~ week ,FUN = length, data = df06a7) # above 7
    
    #count for normalize
    count <- sum(df06b7tab$poundsGT7)
    df06b7tab$poundsGT7 <- df06b7tab$poundsGT7/count
    count <- sum(df06a7tab$poundsGT7)
    df06a7tab$poundsGT7 <- df06a7tab$poundsGT7/count
    
    #PlotFour : Baby weight >= 7lbs versus Baby weight < 7lbs
    plotFunction(x1 = df06a7tab$week , y1= df06a7tab$poundsGT7,
                 x2 = df06a7tab$week, y2 = df06b7tab$poundsGT7,
                 ztitle = "PlotFour : Baby weight >= 7lbs versus Baby weight < 7lbs",
                 zlegendText = c("Baby weight >= 7lbs","Baby weight < 7lbs")
    )
    
    plot of chunk unnamed-chunk-6

    PlotFive

    Comparing mothers born in the 1970s compared to those born in the 1980s or later, both groups still center around the 40 week mean, however, there is a distinct spike in later births (41-42 weeks) for mothers born in the 1970s. One speculation is that this spike may be attributed to induced births around 41-42 weeks as shown in the original graph.

    ####PlotFive 
    #head(df02,20)
    df70s <-df02[df02$motherbirthyear > 1970 & df02$motherbirthyear <=1979,c("week","motherbirthyear")]
    df80s <-df02[df02$motherbirthyear > 1980 & df02$motherbirthyear < 2019,c("week","motherbirthyear")]
    
    df70s <-df70s[df70s$week >= 35.00 & df70s$week <= 43.00,]
    df80s <-df80s[df80s$week >= 35.00 & df80s$week <= 43.00,]
    
    #aggregate tables
    df70stab <- aggregate(motherbirthyear ~ week,FUN = length,df70s)
    df80stab <- aggregate(motherbirthyear ~ week,FUN = length,df80s)
    
    #normalization
    counter <- sum(df70stab$motherbirthyear)+ 10
    df70stab$motherbirthyear <- df70stab$motherbirthyear/counter
    counter <- sum(df80stab$motherbirthyear)+ 10
    df80stab$motherbirthyear <- df80stab$motherbirthyear/counter
    
    #PlotFive: Mothers whose birth year is in the 1970s, versus mothers whose birthyear is 1980s or later.
    ztitle <- "PlotFive: Mothers whose birth year is in the 1970s, versus mothers whose birthyear is 1980s or later."
    plotFunction(x1 = df80stab$week , y1= df80stab$motherbirthyear,
                 x2 = df70stab$week, y2 = df70stab$motherbirthyear,
                 ztitle= ztitle,
                 zlegendText = c("birth year is in the 1980s","birth year is in the 1970s")
    )
    
    plot of chunk unnamed-chunk-7

    NextPart(PlotSix)

    Calculate the sample mean and standard deviation of the born-on date, and use the dnorm() command to overlay a normal distribution plot on a plot like you generate above, excluding observations excluded from the charts (help(dnorm) if you haven’t used this before). Does the normal distribution provide a useful description of the likelihood a baby is born before the due date? Does the normal distribution provide a useful description of the likelihood a baby is born after the due date?

    Yes, this normal distribution gives a clear view of the likelihood a baby is born before the due date and after the due date. Overall, this distribution is centered on a mean just under 40 weeks with a slight skew to the left. This indicates that there is a greater likelihood of a baby being born before 40 weeks than after and a much greater variance in early births than late ones.

    ####Que 3
    #Calculate the sample mean and standard deviation of the born-on date, and use the dnorm()
    #command to overlay a normal distribution plot on a plot like you generate above,
    #excluding observations excluded from the charts. Does the normal distribution 
    #provide a useful description of the likelihood a baby is born before the due date?
    #Does the normal distribution provide a useful description of the likelihood a baby
    #is born after the due date?
    
    #mean & Sd
    mean1 <- mean(df02[df02$week >= 35.00 & df02$week < 43.00,"week"])   #39.77773
    sd1 <- sd(df02[df02$week >= 35.00 & df02$week < 43.00,"week"])       #1.417654
    
    #grid
    x1 = 3500:43/7 #grid, increment in 7th  # from Website, group discussion
    mean1 = mean(df05$week)
    sd1 = 9.57/7
    d <- dnorm(x1, mean = mean1, sd = sd1)
    d <- d/sum(d)
    #plotSix
    plotFunction(x1 = df05Lmp$week, y1 = df05Lmp$LMP,
                 x2 = df05ovulation$week , y2= df05ovulation$ovulation,
                 x3 = df05ultrasound$week, y3= df05ultrasound$ultrasound,
                 ztitle = "PlotSix : Born on date",
                 zlegendText = c("LMP","Ovulation", "Ultrasound")
    )
    lines(x = x1, y = d
                ,type= "o"
                ,col = "black"  #black #color of the line chart
                ,lty = "dashed"                  #line thickness
               # ,pch = 0                  # add points, solid circle,
                ,cex = 0.3                 # size of the point on the chart
    )
    
    plot of chunk unnamed-chunk-8
    ###global adjustments start 2
    options(warn = oldwarnval) #reset the warning message
    ###global adjustments end 2