#author: Group7 #name: Jaydip Patel, Vincent Chang, Nicole Haberer, Ivy Wang, & Griffin Parr
The setting for this project is to work with survey data from a popular (using the term loosely) pregnancy data website. Although the data are in a convenient format (download the .csv here: https://spacefem.com/pregnant/charts/duedate40.php ), this project will require you to transform variables and to bypass defaults in some functions that you should already be familiar with from Nick Reinholtz’s class. Not every task will be familiar. For example, I expect you will need to spend some time debugging plotting commands.
There are two deliverables for this project. 1. A well-commented and debugged script (.R file) that produces requested plots and calculations in response to the prompts below. 2. A brief write up of the answers to the questions in the prompts below (standard word processing format or pdf). The write up should be brief, but professional. Think of your audience as a non- technically oriented manager who wants to know the answers to the questions posed below. Accordingly, you should keep the write up free of “technical” details (i.e., “Our group then used lines() because it does not produce a new plot object, allowing us to plot the normal density curve on the same plot as the data.”), but it should have plenty of contextual detail (i.e., “As you can see from Figure 4, there is not much difference between the distributions of gestational periods for firstborn children and children of mothers who have previously had children.”)
Data Processing. Project Instructions. 1. Download the data to the working directory you will use for this project. Use the link above to navigate to the data. Within your script, set the working directory in order to shorten the path. 2. Read in the data (using read.csv()). 3. Use the “names()” function applied to your data frame object to inspect the names of the variables in the data. Save as a separate data frame a slimmer data frame that contains only information on the following variables: id, motherbirthyear, multiples, pounds, bornonday, duedatebase, induced, and excluded. 4. Use the summary() command to summarize each of the variables in this slimmer data frame. In the write-up, you should briefly describe these variables, with reference to a summary table of information. Keep in mind the relevant detail that a non-technical manager would want to know.
Executive Summary We began this project by subsetting the dataset into the slimmer dataframe you requested. As we read on the website and noticed in the data, there were several shortcomings in the dataset that needed cleaning. We only included the survey responses in which the “bornondate” was between 245 days (35 Weeks) and 301 days (43 days). We threw out the rows that SpaceFem excluded from the “exclude column” for various reasons. Another limitation was that if the survey respondent did not know their LMP date, it was assumed it was 280 days before the due date.
We were able to recreate the graph (see plot 1) comparing spontaneous and induced births to determine if either had an effect on the length of a pregnancy. We analyzed whether or not a mother having previous births affected the length of the pregnancy (see plot 2) and concluded that whether or not this was a mother’s first birth had no impact on the length of the pregnancy. We also concluded that the detection type had no impact on the length of a pregnancy (see plot 3). We investigated the impact of weight on the length of a pregnancy (see plot 4) and concluded that babies weighing less than 7 pounds tended to be born earlier while babies weighing more than or equal to 7 pounds were born slightly later. We examined whether the mother’s age had any impact on the length of the pregnancy (see plot 5) and concluded that there is a distinct spike in later births (41-42 weeks) for mothers born in the 1970s.
We overlaid a normal distribution onto a plot (see plot 6) and noticed that there is a greater likelihood of a baby being born before 40 weeks than after and a much greater variance in early births than late ones.
#global code ##global adjustments starts 01 rm(list=ls()) #seed for replication set.seed(100) oldwarnval <- getOption("warn") #preserve old value options(warn = -1) #warnings are silent now ##global adjustments ends 01 #####generalized plot function plotFunction <- function(x1,y1,x2,y2,x3 = -1,y3 ,ztitle = "plot" ,zlegendText = c("var1","var2") ,customYlim = c(0,7/100) ,customLegend = c(35.2,37.5,0.065, 0.055) ,customGrid = c(8,7) ){ #X fisrt line plot #Y second line plot #Z thir line plot x = x1;y = y1; plot(x,y ,type= "o" ,main= ztitle ,col = rgb(6/10, 0, 0) #36,137,371 #color of the line chart ,lwd = 2 #line thickness ,pch = 19 # add points, solid circle, ,cex = 0.3 # size of the point on the chart ,tick = F ,xaxs = "i" ,yaxs = "i" ,xaxt = "n" ,ylim = customYlim ,xlab = "" ,ylab = "" ) #customGrid = c(8,7) grid(nx = customGrid[1]#8 ,ny = customGrid[2]#7 , lty=1, col = "cornsilk2") lines(x, y ,type= "o" ,col = rgb(6/10, 0, 0) #36,137,371 #color of the line chart ,lwd = 2 #line thickness ,pch = 19 # add points, solid circle, ,cex = 0.3 # size of the point on the chart ) x = x2; y = y2; lines(x, y ,type= "o" ,col = rgb(149/255, 41/255, 176/255) ,lwd = 2 ,pch = 19 ,cex = 0.3 ) if(x3 != -1){ x = x3; y = y3; zcol <- c((rgb(6/10, 0, 0)), rgb(36/255, 29/255, 219/255), rgb(149/255, 41/255, 176/255)) #3 colors zcol1 <- rgb(36/255, 29/255, 219/255) lines(x, y, type = "o" ,col.lab="blue" ,col = zcol1 ,lwd = 2 ,pch = 19 ,cex = 0.3 ) }else{ x = 0; y = 0; zcol <- c((rgb(6/10, 0, 0)), rgb(149/255, 41/255, 176/255)) # 2 colors } axis(side = 1, at = 35:43 ,labels = paste(c(35:43),"weeks") ,col.axis = rgb(151/255, 145/255, 145/255) ,tick = F ) axis(side = 2 ,col.axis = rgb(151/255, 145/255, 145/255) ,tick = F ) legend( x = c(customLegend[1],customLegend[2] ) ,y = c(customLegend[3],customLegend[4] ) #x=c(35.2,37.5),y=c(0.98, 0.75) ,legend = zlegendText ,fill = zcol ,text.col = rgb(151/255, 145/255, 10/255) ,border = 1 ,box.lty= 0 ,bty = "o" ,y.intersp=1.5 ) abline(v = 40) #return(1) } #read file df <- data.frame(read.csv("~/Downloads/Tony/Project 1/sf_gestationdata.csv",stringsAsFactors = F))
Figure 1: Summary information for each Variable
#elmininate date based on exclude column df <- df[df$exclude == 0,] #slimmer dataFrame df02 <- df[,c("id", "motherbirthyear", "multiples", "pounds", "bornonday" , "duedatebase", "induced", "exclude", "previousbirths")] #head at a glance head(df02)
## id motherbirthyear multiples pounds bornonday duedatebase induced ## 1 6 1974 1 7 277 Other 1 ## 3 8 1979 1 7 277 LMP 0 ## 4 9 1975 1 8 278 LMP 0 ## 5 10 1984 1 10 286 LMP 0 ## 6 11 1984 1 8 284 LMP 1 ## 7 12 1978 1 8 287 ultrasound 0 ## exclude previousbirths ## 1 0 0 ## 3 0 0 ## 4 0 1 ## 5 0 1 ## 6 0 0 ## 7 0 0
summary(df02)
## id motherbirthyear multiples pounds ## Min. : 6 Min. : 0 Min. :1.000 Min. : 0.000 ## 1st Qu.: 3660 1st Qu.: 1979 1st Qu.:1.000 1st Qu.: 6.000 ## Median : 7169 Median : 1983 Median :1.000 Median : 7.000 ## Mean : 7063 Mean : 1907 Mean :1.007 Mean : 6.358 ## 3rd Qu.:10486 3rd Qu.: 1987 3rd Qu.:1.000 3rd Qu.: 8.000 ## Max. :13799 Max. :19955 Max. :3.000 Max. :610.000 ## bornonday duedatebase induced exclude ## Min. :100.0 Length:13255 Min. :0.0000 Min. :0 ## 1st Qu.:272.0 Class :character 1st Qu.:0.0000 1st Qu.:0 ## Median :279.0 Mode :character Median :0.0000 Median :0 ## Mean :275.3 Mean :0.2413 Mean :0 ## 3rd Qu.:286.0 3rd Qu.:0.0000 3rd Qu.:0 ## Max. :333.0 Max. :1.0000 Max. :0 ## previousbirths ## Min. : 0.0000 ## 1st Qu.: 0.0000 ## Median : 0.0000 ## Mean : 0.4104 ## 3rd Qu.: 1.0000 ## Max. :70.0000
PlotOne
Based on separating the data into induced vs. spontaneous births, it is clear that both variables are centered around the same mean. The distribution of both spontaneous and induced births are normally distributed but are both skewed slightly left. As seen in Plot 1, there is significantly more data on spontaneous births relative to induced births. This limitation inhibits our ability to make supportable interpretations. Even with limited induced data, we conclude that whether a birth was induced or spontaneous appears to have no observable impact on the length of the pregnancy.
####Working with the data. #add week column to the df df02$week <- round(df02$bornonday/7,02) #revisit # week generation ####plotOne #add Spontaneous column to the table df02$Spontaneous <- ifelse(df02$induced == 1,0,1) #aggregate table to plot df03induced <- aggregate(induced ~ week,FUN = sum, data = df02[,c("week","induced","Spontaneous")]) df03spontaneous <- aggregate(Spontaneous ~ week,FUN = sum, data = df02[,c("week","induced","Spontaneous")]) #data frames to plot One df03induced <- df03induced[df03induced$week >= 35.00 & df03induced$week <= 43.00,] df03spontaneous <- df03spontaneous[df03spontaneous$week >= 35.00 & df03spontaneous$week <= 43.00,] #normalizing the axis normFactor <- max(df03spontaneous$Spontaneous) + 51 #normalize df03spontaneous$Spontaneous <- df03spontaneous$Spontaneous/normFactor #normFactor <- max(df03induced$induced) #normalize df03induced$induced <- df03induced$induced/normFactor #custom parameters customYlim = c(0,1) customLegend = c(35.2,37.5,0.98, 0.75) customGrid = c(10,7) #plot One plotFunction(x1 = df03spontaneous$week, y1 = df03spontaneous$Spontaneous, x2 = df03induced$week , y2= df03induced$induced, ztitle = "PlotOne : Spontaneous vs induced", zlegendText = c("Spontaneous","induced"), customYlim = customYlim, customLegend = customLegend, customGrid = customGrid )
PlotTwo
This chart compares the length of pregnancy for mothers undergoing their first birth compared to mothers who have had previous births. The distribution are both normally distributed but are slightly left skewed. There does not appear to be any statistically significant differences in the length of a mother’s pregnancy. Therefore, we would argue that conventional medical wisdom arguing that shorter first pregnancies result in shorter second pregnancies is not supported by this data.
#### plotTwo # First births vs births 2+ # head(df02) #new dfs for plots # weeks in [35,43] df04fb <- df02[df02$previousbirths == 0 & df02$week >=35.00 & df02$week <=43.00,] #"first Birth df" df04pb <- df02[df02$previousbirths == 1 & df02$week >=35.00 & df02$week <=43.00,] #"Previous Birth df" #aggregate tables df04fbA <- aggregate(previousbirths ~ week,FUN= length , df04fb[,c("week","previousbirths")] ) df04pbA <- aggregate(previousbirths ~ week,FUN= length , df04pb[,c("week","previousbirths")] ) #count to be normalized count <- sum(df04fbA$previousbirths) df04fbA$previousbirths <- df04fbA$previousbirths/count #normalization with count count <- sum(df04pbA$previousbirths) df04pbA$previousbirths <- df04pbA$previousbirths/count #normalization with count #plotTwo plotFunction(x1 = df04fbA$week, y1 = df04fbA$previousbirths, x2 = df04pbA$week, y2= df04pbA$previousbirths, ztitle = "PlotTwo : First births vs births 2+.", zlegendText = c("First births","Previous births") )
PlotThree
This chart compares the length of the pregnancy for the three different detection types. The distributions for all three are normally distributed with a slight left skew. The data indicates that the different detection types have no statistically significant impact on the length of a mother’s pregnancy.
####Plot 3 #Detection type:Ultrasound vs Ovulation vs LMP #head(df02) unique(df02$duedatebase)
## [1] "Other" "LMP" "ultrasound" "NULL" "ovulation"
df05 <- df02[df02$duedatebase=="LMP" #df for ploting | df02$duedatebase=="ultrasound" | df02$duedatebase=="ovulation",] #dummy columns df05$LMP <- ifelse(df05$duedatebase == "LMP",1,0) #LMP column df05$ultrasound <- ifelse(df05$duedatebase == "ultrasound",1,0) #ultrasound column df05$ovulation <- ifelse(df05$duedatebase == "ovulation",1,0) # Ovulation #aggregate tables df05Lmp <- aggregate(LMP ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")]) df05ultrasound <- aggregate(ultrasound ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")]) df05ovulation <- aggregate(ovulation ~ week,FUN= sum , data = df05[,c("week","LMP","ultrasound","ovulation")]) #remove weeks !in [35.00,43.00] df05Lmp <- df05Lmp[df05Lmp$week >= 35.00 & df05Lmp$week <= 43.00,] df05ultrasound <- df05ultrasound[df05ultrasound$week >= 35.00 & df05ultrasound$week <= 43.00,] df05ovulation <- df05ovulation[df05ovulation$week >= 35.00 & df05ovulation$week <= 43.00,] #counter to normalize the table count = sum(df05Lmp$LMP) df05Lmp$LMP <- df05Lmp$LMP/count #nomalize count = sum(df05ultrasound$ultrasound) df05ultrasound$ultrasound <- df05ultrasound$ultrasound/count count = sum(df05ovulation$ovulation) df05ovulation$ovulation <- df05ovulation$ovulation/count #plotThree plotFunction(x1 = df05Lmp$week, y1 = df05Lmp$LMP, x2 = df05ovulation$week , y2= df05ovulation$ovulation, x3 = df05ultrasound$week, y3= df05ultrasound$ultrasound, ztitle = "PlotThree : Detection type: ultrasound, ovulation, LMP.", zlegendText = c("LMP","Ovulation", "Ultrasound") )
PlotFour
This chart compares the length of pregnancy for babies who were born weighing less than 7 pounds and babies who were born weighing more than or equal to 7 pounds. The means are the same and the data is relatively normally distributed. The distributions for babies weighing under 7 pounds and more than or equal to 7 pounds are skewed left. To us, this chart appears to indicate that that babies born weighing less than 7 pounds tend to be born slightly earlier while babies weighing more than or equal to 7 pounds tend to be born slightly later.
#####plotFour #Baby weight >= 7lbs versus Baby weight < 7lbs #head(df02,20) df02$poundsGT7 <- ifelse(df02$pounds >=7, 1, 0) #dummy column df06b7 <- df02[df02$poundsGT7 == 0,c("week","poundsGT7")] #below 7 df06a7 <- df02[df02$poundsGT7 == 1,c("week","poundsGT7")] #above 7 #week within 35 to 43 df06b7 <- df06b7[df06b7$week >= 35.00 & df06b7$week <=43.00,] df06a7 <- df06a7[df06a7$week >= 35.00 & df06a7$week <=43.00,] #aggregate tables for ploting df06b7tab <- aggregate(poundsGT7 ~ week ,FUN = length, data = df06b7) # below 7 df06a7tab <- aggregate(poundsGT7 ~ week ,FUN = length, data = df06a7) # above 7 #count for normalize count <- sum(df06b7tab$poundsGT7) df06b7tab$poundsGT7 <- df06b7tab$poundsGT7/count count <- sum(df06a7tab$poundsGT7) df06a7tab$poundsGT7 <- df06a7tab$poundsGT7/count #PlotFour : Baby weight >= 7lbs versus Baby weight < 7lbs plotFunction(x1 = df06a7tab$week , y1= df06a7tab$poundsGT7, x2 = df06a7tab$week, y2 = df06b7tab$poundsGT7, ztitle = "PlotFour : Baby weight >= 7lbs versus Baby weight < 7lbs", zlegendText = c("Baby weight >= 7lbs","Baby weight < 7lbs") )
PlotFive
Comparing mothers born in the 1970s compared to those born in the 1980s or later, both groups still center around the 40 week mean, however, there is a distinct spike in later births (41-42 weeks) for mothers born in the 1970s. One speculation is that this spike may be attributed to induced births around 41-42 weeks as shown in the original graph.
####PlotFive #head(df02,20) df70s <-df02[df02$motherbirthyear > 1970 & df02$motherbirthyear <=1979,c("week","motherbirthyear")] df80s <-df02[df02$motherbirthyear > 1980 & df02$motherbirthyear < 2019,c("week","motherbirthyear")] df70s <-df70s[df70s$week >= 35.00 & df70s$week <= 43.00,] df80s <-df80s[df80s$week >= 35.00 & df80s$week <= 43.00,] #aggregate tables df70stab <- aggregate(motherbirthyear ~ week,FUN = length,df70s) df80stab <- aggregate(motherbirthyear ~ week,FUN = length,df80s) #normalization counter <- sum(df70stab$motherbirthyear)+ 10 df70stab$motherbirthyear <- df70stab$motherbirthyear/counter counter <- sum(df80stab$motherbirthyear)+ 10 df80stab$motherbirthyear <- df80stab$motherbirthyear/counter #PlotFive: Mothers whose birth year is in the 1970s, versus mothers whose birthyear is 1980s or later. ztitle <- "PlotFive: Mothers whose birth year is in the 1970s, versus mothers whose birthyear is 1980s or later." plotFunction(x1 = df80stab$week , y1= df80stab$motherbirthyear, x2 = df70stab$week, y2 = df70stab$motherbirthyear, ztitle= ztitle, zlegendText = c("birth year is in the 1980s","birth year is in the 1970s") )
NextPart(PlotSix)
Calculate the sample mean and standard deviation of the born-on date, and use the dnorm() command to overlay a normal distribution plot on a plot like you generate above, excluding observations excluded from the charts (help(dnorm) if you haven’t used this before). Does the normal distribution provide a useful description of the likelihood a baby is born before the due date? Does the normal distribution provide a useful description of the likelihood a baby is born after the due date?
Yes, this normal distribution gives a clear view of the likelihood a baby is born before the due date and after the due date. Overall, this distribution is centered on a mean just under 40 weeks with a slight skew to the left. This indicates that there is a greater likelihood of a baby being born before 40 weeks than after and a much greater variance in early births than late ones.
####Que 3 #Calculate the sample mean and standard deviation of the born-on date, and use the dnorm() #command to overlay a normal distribution plot on a plot like you generate above, #excluding observations excluded from the charts. Does the normal distribution #provide a useful description of the likelihood a baby is born before the due date? #Does the normal distribution provide a useful description of the likelihood a baby #is born after the due date? #mean & Sd mean1 <- mean(df02[df02$week >= 35.00 & df02$week < 43.00,"week"]) #39.77773 sd1 <- sd(df02[df02$week >= 35.00 & df02$week < 43.00,"week"]) #1.417654 #grid x1 = 3500:43/7 #grid, increment in 7th # from Website, group discussion mean1 = mean(df05$week) sd1 = 9.57/7 d <- dnorm(x1, mean = mean1, sd = sd1) d <- d/sum(d) #plotSix plotFunction(x1 = df05Lmp$week, y1 = df05Lmp$LMP, x2 = df05ovulation$week , y2= df05ovulation$ovulation, x3 = df05ultrasound$week, y3= df05ultrasound$ultrasound, ztitle = "PlotSix : Born on date", zlegendText = c("LMP","Ovulation", "Ultrasound") ) lines(x = x1, y = d ,type= "o" ,col = "black" #black #color of the line chart ,lty = "dashed" #line thickness # ,pch = 0 # add points, solid circle, ,cex = 0.3 # size of the point on the chart )
###global adjustments start 2 options(warn = oldwarnval) #reset the warning message ###global adjustments end 2