Web Scraping: The Sequel | Propwall.my

Alright. Time to take another shot at web scraping. My attempt at scraping data off iBilik.my left me a little frustrated because of how long it took, and also at how i couldn’t get much information because of all the duplicated posts.

I think Propwall.my would be a much better choice considering how their listings are arranged. Propwall is:

…Malaysia’s most ADVANCED property search website that provides marketing and research solutions to property agents, developers, and investors.

The emphasis on “advanced” was not mine. They just seem to be pretty sure of themselves.

I figured that since i have some idea of how to do this now, it would be a lot easier to organize my code in a way that’s more neat. Well…for me at least. And plus, i could insert more comments this time.

I picked Mont Kiara as my location and I started off the same way, with loading the stringr, rvest, and ggvis packages.

#load libraries
library(stringr)
library(rvest)
library(ggvis)
library(dplyr)
library(ggplot2)

#The first half of the URL..
site_first = "http://www.propwall.my/mont_kiara/classifieds?page="
#...and this is the second half of the URL
site_second = "&tab=Most%20Relevance&keywords=Mont%20Kiara%2C%20Kuala%20Lumpur&filter_id=17&filter_type=Location&listing=For%20Rent&view=list"

#concatenate them together, with the coerced digit in between them. This digit is the page number
siteCom = paste(site_first, as.character(1), site_second, sep = "")
siteLocHTML = html(siteCom)

You might be wondering why i keep loading ggvis if i won’t be using their charts. Embarrassingly enough, it’s because i keep forgetting which package allows me to use the piping operator “%>%”, but i always remember that i can use that operator if ggvis is loaded.

Similar to what i did in the iBilik.my post, i extracted the information for the first search page, and then rbind-ed the remaining pages; as so:

#Extract the descriptions of the first page...
siteLocHTML %>% html_nodes("h4.media-heading a") %>% 
  html_text() %>% data.frame() -> x

#...and also the links to these postings 
siteLocHTML %>%  html_nodes("#list-content") %>% 
  html_nodes(".media") %>% 
  html_nodes(".media-heading") %>% 
  html_nodes("a") %>%
  html_attr("href") %>% 
  data.frame() -> y

#Since we already have the extractions for the first page, so no we...
#...can loop through numbers 2 to 250 and rbind them with page 1 extracts
for(i in 2:250){

  siteCom = paste(site_first, as.character(i), site_second, sep = "")
  siteLocHTML = html(siteCom)
  
siteLocHTML %>% html_nodes("h4.media-heading a") %>% 
  html_text() %>% data.frame() -> x_next

siteLocHTML %>%  html_nodes("#list-content") %>% 
  html_nodes(".media") %>% 
  html_nodes(".media-heading") %>% 
  html_nodes("a") %>%
  html_attr("href") %>% 
  data.frame() -> y_next

x = rbind(x, x_next)
y = rbind(y, y_next)

}

#column bind the description and links
complete = cbind(x,y)
complete[,2] = as.character(complete[,2])

names(complete) = c("Description", "Link")

#file backup
write.csv(complete, "complete_propwall.csv", row.names = FALSE)

#And remove the remaining dataframes from the environment
rm(x_next, y_next, x, y)

Up to this point, the code has sifted through 250 search pages, with 20 listings each; which adds up to 5000 listings. However, i only have the links to the postings, and not the information such as rental, layout, etc. In order for me to have that information, i would need to navigate to each link and get the rental, layout, and date…for 5000 posts.

There was no means with which i could determine if a post was duplicated or not, so i haven’t removed any of the 5000 posts. The only observations that were removed were based on unavailability of certain information (no rental info given) or the ridiculousness of a rental rate (RM 1 million rental! Probably meant for the sale section and not the rental section).

Anywho, the next step was to get the post information, and i used the code below:

for(i in 1:nrow(complete)){
siteLocHTML = html(complete[i,"Link"])

siteLocHTML %>%  html_nodes(".clearfix") %>%
  html_nodes("#price") %>% html_text() %>% c() -> y

#Rentals come out in quadruples, but i need only one
y = unique(y)
price = c(price, y)


siteLocHTML %>% html_nodes("#content") %>% 
  html_nodes("p") %>% html_text() -> z

dates = c(dates, z)

siteLocHTML %>%  html_nodes(".clearfix") %>%
  html_nodes("td") %>% html_text() %>% c() -> a
b = a[8]
a = a[4]

furnish = c(furnish, a)
layout = c(layout, b)

}


#take only the first 4922 rows
final = complete[1:4922,] 

#cbind the new columns
final$price = price
final$furnish = furnish[1:4922]
final$layout = layout[1:4922]
final$posted = dates[1:4922]

#remove NAs
final_fil = na.omit(final)

#remove other NAs as shown in the website
final_fil = final_fil[!final_fil[,4] == "-NA-",]
final_fil = final_fil[!final_fil[,5] == "-NA-",]

#remove rownames
rownames(final_fil) = NULL

final_fil$furnish = as.factor(final_fil$furnish)
final_fil$layout = as.factor(final_fil$layout)

rm(y,z,a,b,price, furnish, layout, dates)

I should mention that i had run with some issues in finishing the data extraction up to this point. The issue mainly had to do with the fact after 4922 posts, i got an error. The error had more to do with navigating to so many pages than the code itself. With that in mind, i had to limit my sample to 4922 observations.

Next up, a little string manipulation to get the rental rates in a format more acceptable for calculations. I also need to get only the dates from the posting information. I changed the original formatting for rentals and dates…

> "RM 5,200 (RM 3 psf)"
> "Posted by Champion YEE 016-6012080 on 02/11/2015"

…into this…

> 5200
> 02/11/2015

and the code i used was the following script:

#Extracting dates and rentals
x = c()
for(i in 1:nrow(final_fil)){

  x[i] = substring(final_fil[i, "posted"], nchar(final_fil[i, "posted"])-9, nchar(final_fil[i, "posted"]))

}

rm(complete, final,a,b,furnish, i, layout, price)

final_fil$date = x

x = c()

for(i in 1:nrow(final_fil)){
  
y = gregexpr("\\(",final_fil[i,"price"])[[1]][1]

x[i] = substring(final_fil[i,"price"], 4, y-2)
  
}
z
x = as.integer(str_replace_all(x, ",", ""))

final_fil$rental = x 

And last but not least, the name of the residency. I already have the names of the residency written as, for instance, Vista Kiara, Mont Kiara”. What i didn’t want was the “Mont Kiara” bit, and so i added another column “residency” that holds only the names:

#Extract only the name of the residency
final_fil$residency = str_replace_all(final_fil[,1], ", Mont Kiara", "")

Anyone who uses R will tell you that dplyr is awesome. It’s just….just sooo nice. Although i’m still trying to remember everything it can do, what follows is perhaps a very good example of dplyr’s awesomeness.

Since i now have the residency name, the rentals, and the layouts; i can produce a single dataframe that shows the average rental for each layout, for each residency….and all that with one line of dplyr magic:

Mkiara_data = data.frame(summarise(group_by(final_fil, residency, layout), Average_Rental = mean(rental)))

And so if we take a look at the last 6 rows of this data frame, you’ll see:

tail(Mkiara_data)

residency layout Average_Rental
162 i-Zen Kiara I Studio 9000.000
163 i-Zen Kiara II 1-Bedroom 6165.714
164 i-Zen Kiara II 2-Bedroom 3600.000
165 i-Zen Kiara II 3-Bedroom 5450.000
166 i-Zen Kiara II 4-Bedroom 5943.750
167 i-Zen Kiara II 5-Bedroom 6668.7507

Hadley Wickham should win an award just for developing that.

Now time to take a look at the data by looking at some graphs. Using the final_fil dataframe, a table can be made that can count the number of times the residency name appears and i can use that data to plot a graph.

#Number of posts per residency
x = data.frame(table(final_fil$residency)) 
names(x) = c("Residency", "Number_of_Posts")  

ggplot(x, aes(x=reorder(Residency,Number_of_Posts), y=Number_of_Posts)) + 
  geom_bar(stat="identity", color="white") + 
  xlab("Residency") + ylab("Number of posts") + coord_flip() + 
  ggtitle("Number of posts per residency") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Red"))

The resulting graph will look like this:

Mont Kiara - Number of posts per residency

(You can click on the image to get a better look at the chart)

I would like to think that the higher the number of posts, the more unpopular the residency is. Obviously, there are some arguments against that, but using that understanding we can see that there are very few posts for places such as The Residence and Aman Kiara, while places such as Vista Kiara and Ceriaan Kiara have seen a lot of rentals available. I’m no property guy, but i’m guessing there are plenty of more factors at play here such as when was the property launched, the type of residency (condo, house, etc), the neighborhood (metro, suburban), and so on and so forth. I’ll let you figure that one out.

Perhaps we need some more information, such at the proportions of these listings that are categorized under what level of furnishing.

#Create table for residency and furnishing type
z = data.frame(table(final_fil[,c("residency", "furnish")]))
z = z[z[,"Freq"] != 0,]
names(z) = c("residency", "furnishing", "Posts")
rownames(z) = NULL

#Plot
ggplot(z, aes(x=reorder(residency,Posts), y=Posts, fill=furnishing)) + 
  geom_bar(stat="identity") + coord_flip() + 
  scale_fill_brewer(palette="Dark2") + 
  xlab("Residency") + ggtitle("Number of posts per residency, with furnishing type") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black"))

Listings by furnishing type

I think it’s safe to say that very few, if not none, of the listings have been labeled as unfurnished. I think that’s mainly because many of the owners/agents that post on these sort of sites classify a unit as semi-furnished if it only has things such as a fan, air-conditioner, and a kitchen cabinet; but, ironically enough, no actual furniture. From my experience, what is considered unfurnished in Malaysia is, quite literally, an empty unit…with wires still hanging off the ceiling and walls.

I thought it would also be interesting to see how the layouts are distributed in all of these listings.

#Create table for residency and layout columns
y = data.frame(table(final_fil[,c("residency", "layout")]))
y = y[y[,"Freq"] != 0,]
names(y) = c("residency", "layout", "Posts")
rownames(y) = NULL

#Plot
ggplot(y, aes(x=residency, y=Posts, fill=layout)) + 
  geom_bar(stat="identity") + coord_flip() + 
  scale_fill_brewer(palette="Dark2") + 
  xlab("Residency") + ggtitle("Number of posts per residency, with layouts") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black"))

And that gives us…

Number of posts, including layouts

I understand that it would’ve been better if the bars were in descending order, but i just couldn’t figure out how to do that. The usual reordering inside the “aes” did not work and i’m still trying to understand why. So for now, i figured ggplot2’s usual alphabetical order will suffice.

As you can clearly see, a big portion of all posts are categorized under 3-bedrooms. The most notable of which is the one for Vista Kiara which, if you recall, had the highest number of rental listings in our sample.

And this finally brings us to the numbers that matter. The rental rates. All numbers here are shown as rental per month. The following code is to determine the average rental per layout type in Mont Kiara. After having inspected the dataframes, it looks like there are plenty of outliers in each category. With that in mind, i think a boxplot would be best. And plus, It will be great to see how much would it cost to rent a unit of a particular room layout, on average; in Mont Kiara.

Breaks= c()
for(i in 1:30){Breaks[i] = i*1000}

ggplot(final_fil, aes(x=layout, y=rental)) + geom_boxplot() + 
  stat_summary(fun.y = "mean", geom="point", shape = 22, size = 3, fill = "red") + 
  xlab("Layout") + ggtitle("Distribution of rentals per layout") + ylab("Rentals") + 
  theme(plot.title=element_text(size=16, face = "bold", color = "Black")) + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black")) +
  scale_y_continuous(breaks=Breaks)

Boxplot, by unit layout

There seem to be a whole bunch of outliers for the 1-bedroom category, which explains why the mean (the red dot) is not so close to the median (the horizontal line in the box). The median would indicate that the average rental for a 1-bedroom might range somewhere around RM 3500. That conclusion seems to be concurred by the Studio category, since many people consider those two groups one and the same.

It’s also worth noting that there doesn’t seem to be much difference between the 1-bedroom and the 2-bedrooms, while the 8-bedroom category seems to be from one data point. I’m not exactly certain what’s the difference between a 4-bedroom and a 4.1 bedroom, but if the means are to be used, it’ll cost you about another RM 1000 in rent. The 3-bedroom, the category that seems to be readily available as we’ve seen in the previous charts; goes from 4500 to 5000 ringgit. I would think that’s still pretty steep, even if most of the listings are fully furnished.

I was hoping to end this post with a chart that shows the average rental for each residency, and for each layout; but i’m worried that it would end up as a lot of “visual noise”. Besides, i still need to figure out how to arrange the bars using ggplot2 in circumstances where i’m using 3 variables, of which 2 are categorical. The best that i could come up with is using the facet_grid() function in ggplot2. And so here they are, two at a time:

1-Bedroom and Studio

2-Bedroom and 3-Bedroom

4-Bedroom and 5-Bedroom

6-Bedroom

I haven’t plotted the charts for 4.1 bedrooms and 6+ bedrooms because they seem to be coming from very few observations, and so there really was no need to plot them.

As always below are all the data i worked with.

Propwall_link, Propwall_link_details_edited, Propwall_averages.

Advertisements
Posted in Uncategorized | Tagged , , , , | Leave a comment

An Attempt at Web Scraping: Cyberjaya Rental Rates

I came across this short tutorial on how to use the rvest package to scrape information from websites. It looked pretty straightforward, although it took a while to get the hang of some of the HTML jargon. So i figured i should take a shot at this scraping myself.

I picked local site iBilik.my as a good example. From the website, iBilik is:

…Malaysia’s largest and No.1 Room / Homestay / Short Term Rental website, with over 100,000 listings posted online all across Malaysia, Kuala Lumpur, Ampang, Bangsar, Cheras, Setapak, Damansara, Petaling Jaya, Subang Jaya, and Penang.

The plan was to somehow retrieve all the rental rates for postings that were put up in Cyberjaya. I noticed that there are 25 postings shown on each search page, so having R go through 250 search pages should give me 6250 postings.

Now in my naiveté, i thought that this whole scraping of raw data would take less than a minute. Add to that the data cleaning, maybe another 20 minutes.

It took me 3 days.

This was because of a combination of work and…well, my computer kept crashing (6 times) so much of the time because of all the memory it had to use; and this was me not having any other apps open.

I also think it would have taken less time if i had figured out how to get rental rates from the search pages rather than having R navigate to each of the posting pages and only then get the prices.

Anyway, since i’m still trying to remember all the stuff i forgot in R, I decided to loop through the pages and postings. The plan was to first extract the information on the first search page, and then extract the information from pages 2 to 250 one by one and then rbind them to the extraction from page 1. I originally intended to extract from 4000 pages, but my computer kept dying. Then it went down to 2000, then 1000, 500, and eventually 250.

The information extracted are the posting description put up by the user, the location (in this case, it’s always Cyberjaya), and the link that navigates to the page that shows the post’s room details (rental, size, room type, etc).

After that, all that information would be cbind-ed and we can use the resulting dataframe to loop through all the post detail links and extract the price, and also the date that it was posted.

Just so you know, having the code finish up to this point took FOREVAAAAAAR! I’m talking HOURS.

This brings us to the issue of duplicated posts on iBilik. The site has no limit on how many times a user can post the same thing over and over again. With that in mind, duplicates had to be removed. The criterion i used to identify duplicates is the description of the post. I thought if the same description appears more than once, that’s as good of a criterion as any…especially for this site.

What pissed me off was that, in 6250 posts…only 1059 were unique. Over 80% of all posts were duplicated! I guess maybe i shouldn’t be so surprised considering how there were 6250 postings, but the dates only ranged from the 4th October to 10th November…in Cyberjaya alone. In any case, i looked up all postings that mention the words “middle”, “master”, or “small”; and then plotted the histograms.

It was interesting to see that from the master room postings, quite a number of postings fell in the range of 450 to 500 per month. While the middle rooms ranged mostly from 450 to 750. This gap would perhaps be explained by the fact that most master room related posts are for sharing between two or more students, and also by the fact that middle rooms are more available here in Cyberjaya. I would like to think that it’s also because the small rooms are always snapped up by the students, so you won’t usually find a lot of postings for small rooms. And also that the master rooms are picked up by people who work in one of the many companies here.

Below is the R script i used. There are no comments on the code, so it looks kinda messy. You can also find a copy of the complete file with the links, rentals, and dates of all the postings. There are also links to the (poorly constructed) histograms of each room category.

library(stringr)
library(rvest)
library(ggvis)

site = "http://www.ibilik.my"

siteLoc = "http://www.ibilik.my/rooms/cyberjaya?page="

siteLocN = paste(siteLoc, as.character(1), sep = "")
siteLocHTML = html(siteLocN)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>% 
  html_nodes(".location") %>% 
  html_text() %>% 
  data.frame() -> x

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>%
  html_nodes("a") %>%
  html_text() %>% 
  data.frame() -> y

siteLocHTML %>% html_nodes("table.room_list") %>%
  html_nodes("a") %>% 
  html_attr("href") %>%
  data.frame() -> z


for(i in 2:250){
siteLocN = paste(siteLoc, as.character(i), sep = "")
siteLocHTML = html(siteLocN)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>% 
  html_nodes(".location") %>% 
  html_text() %>% 
  data.frame() -> x_Next

x = rbind(x, x_Next)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>%
  html_nodes("a") %>%
  html_text() %>% 
  data.frame() -> y_Next

y = rbind(y, y_Next)

siteLocHTML %>% html_nodes("table.room_list") %>%
  html_nodes("a") %>% 
  html_attr("href") %>%
  data.frame() -> z_Next

z = rbind(z, z_Next)

} 


complete = cbind(y,x,z)
names(complete) = c("Title", "Location", "Link")

write.csv(complete, "complete.csv", row.names = FALSE)

rm(x_Next, y_Next, z_Next, x, y)


#Prices and dates

dummy = c()
dummy_date = c()


for(i in 1:nrow(z)){
link = paste(site, z[i,1], sep = "")

dummy[i] = html(link) %>% 
  html_nodes(".extras_wrapper p:nth-child(2)") %>% 
  html_text()

fullString = 0
html(link) %>% html_nodes(".stamp") %>% html_text() -> fullString

dummy_date[i] = substring(fullString, nchar(fullString)-10, nchar(fullString))

if(fullString == 0){dummy_date[i] = fullString}

}


complete$price  = dummy
complete$date = dummy_date

rm(dummy, dummy_date)

#Clean up and plots

filter = duplicated(complete[,1])
complete_fil = complete[!filter,]

complete_fil[,"price"] = gsub("RM ", "", complete_fil[,"price"]) 
complete_fil[,"price"] = gsub(",", "", complete_fil[,"price"])

complete_fil[,"price"] = as.integer(complete_fil[,"price"])

rownames(complete_fil) = NULL

new_dates = as.character(strptime(complete_fil[,"date"], format = "%d-%b %Y"))

complete_fil[,"date"] = as.Date(new_dates)

complete_fil = na.omit(complete_fil)

Middle = complete_fil[grep(c("Middle"), complete_fil[,1]),]
Middle = rbind(Middle, complete_fil[grep(c("middle"), complete_fil[,1]),])

Middle = Middle[-grep("Master", Middle[,1]),]
Middle = Middle[-grep("Small", Middle[,1]),]
Middle = Middle[-grep("master", Middle[,1]),]
rownames(Middle) = NULL
mean(Middle[,"price"])


Master = complete_fil[grep(c("Master"), complete_fil[,1]),]
Master = rbind(Master, complete_fil[grep(c("master"), complete_fil[,1]),])

Master = Master[-grep("Middle", Master[,1]),]
Master = Master[-grep("middle", Master[,1]),]
rownames(Master) = NULL
mean(Master[,"price"])


Small = complete_fil[grep(c("Small"), complete_fil[,1]),]
Small = rbind(Small, complete_fil[grep(c("small"), complete_fil[,1]),])

Small = Small[-grep("Middle", Small[,1]),]
rownames(Small) = NULL
mean(Small[,"price"])

summary(Master[,"price"])
summary(Middle[,"price"])
summary(Small[,"price"])


par(mfrow = c(1,3))

hist(Master[,"price"], breaks = 15, main = "Rental Distribution of Master Rooms", 
     xlab = "Rental per Month", col = "grey")

hist(Middle[,"price"], breaks = 20, main = "Rental Distribution of Middle Rooms", 
     xlab = "Rental per Month", col = "grey")

hist(Small[,"price"], 15, main = "Rental Distribution of Small Rooms", 
     xlab = "Rental per Month", col = "grey") 

scraped_data, histograms

Posted in Uncategorized | Leave a comment

Eroding Commitment

There’s an old saying that goes: “If you dream and want something hard enough, but have no commitment….then you’re probably full of shit”. After having finished the Getting and Cleaning Data course on Coursera, I haven’t typed a single line of R code in a long time; and since i was only just starting out on the basics, my foundations in the language are a little shaky again. So i had to start all over again.

Below is a function that loops through an entire dataframe and find all the column names that contain a given value.

searchCol = function(name, dataframe){
  x = 0 
  newList = c() #This will be vector that would be returned
  
  #Start loop for all the values to be seached for
  
  for(i in 1:length(name)){
    #Start loop for each row in the dataframe...
    for(j in 1:nrow(dataframe)){
      
      #...same thing for the columns
      for(k in 1:ncol(dataframe)){
        
        #If the matching criteria is met, add the value to the newList vector
        if(dataframe[j,k] == name[i]){
          
          newList[x+1] = names(dataframe)[k]
          x = x+1
        }       
      }
    }
  }  
  return(newList)
}

Below is a simple demostration:

mat = data.frame(matrix(1:50, 5, 10))
mat

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 6 11 16 21 26 31 36 41 46
2 2 7 12 17 22 27 32 37 42 47
3 3 8 13 18 23 28 33 38 43 48
4 4 9 14 19 24 29 34 39 44 49
5 5 10 15 20 25 30 35 40 45 50

So to find which columns do the figures 28 and 39 fall in..

tst = c(28, 39)
searchCol(tst, mat)

…we would get:

[1] "X6" "X8"

God, i hate having to start from scratch again.

Posted in Uncategorized | Tagged , , | Leave a comment

Summarizing/Exploring Global Wealth Data

Alright. Time to take a crack at another data set, and see if i can hone some of those data cleaning skills that i’ve just learned on Coursera.

The data sets I’m using are from the World Bank’s Changing Wealth of Nations report. I’m using data compiled for 1995, 2000, and 2005. The data were all combined together in one Excel workbook, so I saved the three annual data separately in different files and then loaded them in R under the variables data95, data20, and data05 for 1995, 2000, and 2005; respectively.

Given that the data is not exactly arranged in a tidy manner, some cleaning had to be done. I experimented with combining data frames in lists so that I can loop together all the data frames within the list. I’ve been told that the plyr package does provide that functionality but I’m still not used to it; and besides I’m still going through all the online tutorials.

Below is the code I used to combine and the data sets and clean them by removing the variables/columns that are not needed.

library(plyr)
library(dplyr)
library(stringr)

#adding the year column to each data frame
data95$year = "1995"
data20$year = "2000"
data05$year = "2005"

#combine all datasets in a list
dflist = list(data95, data20, data05)

for(i in 1:length(dflist)){
  
  #to remove all rows where the second column show now data
  for(j in nrow(dflist[[i]]):1){
    if(nchar(dflist[[i]][j,2]) == 0){dflist[[i]] = dflist[[i]][-j,]}
  }
  
  #remove all columns where all values are NA
  dflist[[i]] = naAll.omit(dflist[[i]])
  #remove all rows where there is an NA value
  dflist[[i]] = na.omit(dflist[[i]])
  
  #remove all commas from all columns
  for(k in ncol(dflist[[i]]):1){
    dflist[[i]][,k] = str_replace_all(dflist[[i]][,k], ",", "")
  }
  
}

#convert from a list to a dataframe
final = ldply(dflist) 

The psych R package has the helpful describeBy() function which runs summary statistics for all unique values of a particular variable, in a given dataframe. And although you can subset the output to extract any particular group’s summary stats…I didn’t find it intuitive enough for manipulation. That said, I thought making a custom function would be more helpful…at least for me.

Below is the function that would return a data frame of all the groups for which the summary stats are to be run. The parameters are the dataframe and the name of the column, and the resulting data frame was assigned to the variable sumFinal.

 deStats = function(data, column){
  
  #index number of the column name
  gIndex = which(colnames(data) == column)
  
  #assign unique column names to a variable
  groups = unique(data[,gIndex])
  
  #create an empty list, to be filled later
  clist = list()
  
  #a loop to create list of data frames containing..
  #..summary data frames
  for(i in 1:length(groups)){
    x = which(colnames(data) == column)
    y = groups[i]
    z = subset(data, data[,x] == y)
     
   a = describe(z)
    
   #add columns that include the group name and..
   #..measurements
   a$group = y
   a$measurements = rownames(a)
   
   #add data frames to empty list
   clist[[i]] = a 
   
  }
  #convert list to dataframe
  b = ldply(clist)
  
  limit1 = ncol(b)-1
  limit2 = ncol(a)-2
  
  cPost = c(limit1:ncol(b))
  
  #re-position location of group and measurement columns
  b = b[,c(cPost,1:limit2)]
  
  #remove row names
  rownames(b) = NULL
  
  return(b)
}
sumFinal = deStats(final, "Region")

Most of the NA values that have been generated are a result of running summary stats on the character variables. You can find the data output at the end of this post.

I thought it might also be a good time for me to try out the explData() function I mentioned in my first post, and see how it fares with this tidy data. It keeps giving me an error, as i seem to have not placed the character class filters in the function’s code. So I’d have to run the function on only the numeric columns like so…

explData(final[,5:21])

…which gives me the following output:

Rplot

To elaborate on the plot once more, the blue boxes represent a positive relationship and the strength of that relationship is signified by the color’s intensity. The crossed out boxes simply mean that the relationship is not statistically significant, at an alpha level of 0.05.

I was also interested in checking if I could derive the percentage changes for each variable from 1995 to 2005. The only problem I had with this is that not all countries have data from 1995 to 2005. With that in mind, the loops would definitely be creating quite a number of NA values during the analysis. I also noticed that I’m using a whole bunch of loops in my data cleaning and summarization; something I’m hoping i will eventually stop doing once I get the hang of the plyr and dplyr packages. Anyway, here’s the code for calculating the percentages, working on the the final tidy data set that was created earlier.

#list of unique countries
uniCount = unique(final$Economy)

#create empty data frame to fill in later using loop
perDF = data.frame()


for(i in 1:length(uniCount)){
  
  x = final[final$Economy == uniCount[i] & final$year == "1995",]
  y = final[final$Economy == uniCount[i] & final$year == "2005",]
  
    for(j in 1:ncol(y)){       
      if(class(x[1,j]) != "character"){
        
        perDF[i,j] = (y[1,j] - x[1,j])/x[1,j]} else {
        
          if(j <= 4){perDF[i,j] = x[1,j]}else{
            perDF[i,j] = NA
          }      
        }
    }
}

for(i in nrow(perDF):1){
  if(sum(is.na(perDF[i,])) == ncol(perDF)){perDF = perDF[-i,]}
}

colnames(perDF) = make.names(colnames(final))
perDF = naAll.omit(perDF)

Using the ever so wonderful ggplot2 package (and also with the help of this awesome handbook), i’ve plotted the percentage changes in the country’s Pasture Land wealth, and color coding by the country’s Region. Below is the code and the output:

library(ggplot2)

plot = ggplot(perDF, aes(x = reorder(Economy,Pasture.Land), y = Pasture.Land, fill = Region)) + 
  geom_bar(stat = "identity")

plot + coord_flip() + labs(x = "Country", y = "% Change") + 
  theme(axis.text.y=element_text(face = "bold", color = "black"), 
        axis.text.x=element_text(face = "bold", color = "black")) + 
  ggtitle("Percentage Change in Pasture Land Wealth - 1995 to 2005") + 
  theme(plot.title = element_text(face = "bold", size=rel(1.5)))
 

2015-07-06 04_16_07-Plot Zoom

Interesting note on the above plot: African countries situated in the sub-Saharan region seem to have experienced the most reduction in wealth derived from pasture land; and the opposite seems to be holding true for countries situated in Latin America and the Caribbean.

There’s actually still a whole lot more room for exploration in this dataset, but this is all i’m going to be posting because the code for exploring the rest is pretty much the same.

If you’re interested in the data sets i’ve used and the outputs that resulted from the summarization, i’ve uploaded the files below:

Raw datasets:

data95, data20, data05

Outputs:

final, sumFinal, perDF

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Summarizing Development Funds Data

Still going through the Getting and Cleaning Data course on Coursera while also enrolling in the Data Manipulation using dplyr course on DataCamp. That said, i guess working on a teeny data exploration task on actual raw data would help in remembering all these new functions.

The data I’ve used is the “full version” of the Research Release file compiled by AidData. The zip file was about 180MB in size, and about 600MB when extracted. Dimensions of the csv file were 44,210 rows by 99 columns. Data is from 1946 to 2013.

What I was curious to find out was which country contributed the most to each development/assistance category, that was shown in the file; and where Somalia was the recipient. The categories that were excluded were emergency relief funds, and emergency food aid.

I couldn’t find out the units used in the funds commitments variable, and so i’ve shown the data as they were shown in the file. Having said that, and after skimming through the data a bit, i find it a little difficult to believe that these figures are not units of 1,000 (if not 1,000,000). What i can say however is that the amounts are converted to USD and discounted to present day dollars; as of the day of the file’s publication. You can get more information about the data file from AidData’s website.

Started off with picking only the columns i needed, and then subsetted the data by selecting only ones where Somalia was a recipient.There were some funds where the purpose of the funds where not declared, and so purpose of those funds were labeled as “UNDISCLOSED” in the purpose column. This was done using a simple loop.

The second task, also done with a loop, was to create a list of data frames; each data frame containing data summarized by each donor country, and summations done on the funds, for each purpose category.

The last loop row binds all the data frames that were contained in the list generated and assigns it to a variable “e”; which is our final dataset. There were a few NA rows that were naturally generated because of the UNDISCLOSED observations, however those are then removed from the final data frame.

Here’s the code i ran to come up with the final data frame. The code assumes the data is already imported and named “aid”.You’ll have to excuse me, as the code is pretty sloppy…

library(dplyr)

#Exclude columns that are not needed..
aid_sub = select(aid, year, donor, donor_type, recipient, crs_purpose_name, commitment_amount_usd_constant)
#...filter data where the recipient is Somalia
aid_sub_som = subset(aid_sub, recipient == "Somalia")

#aid where the purpose is not available is labeled as UNDISCLOSED
x = aid_sub_som$crs_purpose_name
for(i in 1:length(x)){
if(x[i] == ""){x[i] = "UNDISCLOSED"}
}
aid_sub_som$crs_purpose_name = x

aid_sub_som = aid_sub_som[,5:6]

#to create a list of dataframes that hold all the summarised data
uDonor = unique(aid_sub_som$donor)
a = list()
for(i in 1:length(uDonor)){
x = subset(aid_sub_som, donor == uDonor[i])
b = x$donor
y = summarise_each(group_by(x[,5:6], crs_purpose_name), funs(sum))
len = length(y)
b = b[len]
y = cbind(b, y)
y[,1] = as.character(y[,1])
a[[i]] = y
}

#to deconstruct the list of dataframes and bind them to one dataframe
e = data.frame()
for(i in 1:length(a)){
c = data.frame(a[[i]])
d = data.frame(a[[i+1]])
e = rbind(e,c,d)
if(i+1 == length(a)){break}
}

…and here’s the final output after having plotted it on Tableau.

Somalia Development Funds Data

Notable observations are:

  1. The United States and Italy seem to be in the forefront in the area of agricultural development, food crop production, and livestock related assistance.
  2. With almost $4M in funds, Sweden has provided the most in basic healthcare and, along with the US, also in Civilian peace building and conflict resolution with $2.8M.
  3. Among the donor countries who have refugees situated with them, Finland has contributed the most spending almost $5.5M over the period.
  4. Japan and the Netherlands seem to have assisted highest in the category of Relief coordination and support services.
  5. It is interesting to see that Norway has focused assistance mainly in the areas of public sector policy and administration $2.3M, and basic health infrastructure $1M. While Canada has contributed basic drinking water supply and sanitation $1.2M.

There is one very notable exception, the United Arab Emirates, who have supplied with somewhere in the range of $10M in funds. However the aid was provided in the 1980s and not categorized, and so they were removed from the table, for the sake of the analysis.

Posted in Uncategorized | Tagged , , , | Leave a comment

R Functions: Removing NA Columns

I realize that this might sound a little nerdy, but making simple and useful R functions can be quite fun.

Here are a couple of simple R functions i occasionally use that remove columns with NAs.

This function would remove all columns from a data frame that contain any NA values:

AnyNaCol = function(x) {
limit= ncol(x) 
for(i in limit:1){
if(anyNA(x[,i] == TRUE)) {x[,i] = NULL}
}
return(x)
}

And here’s one that would only remove those columns where all the values in the column are NA:

AllNaCol = function(x) {
limit= ncol(x)
for(i in limit:1){
if(sum(is.na(x[,i])) == nrow(x)) {x[,i] = NULL}
}
return(x)
}

I usually put all kinds of comments on my code, but i thought that this might be pretty straight forward.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Scraping IMDB’s Search Results

Learning how to scrape web pages on the internet, and given how i’m still a complete programming newbie, it took some getting used to. Credit to a certain Lee Hawthorn on how to do this.

I used SelectorGadget, as was suggested in Hawthorn’s article, but i noticed that it has a Chrome extension. This made the whole process even more easier! I’m not even certain i did this right, but it seems to give me the results i want, which when you’re starting out, is all that seems to matter.

Whenever you search for a word or phrase in IMDB’s search bar and refine the search by movie titles, you get a maximum of 200 search results. I’m still not too sure how i can get more than just 200, but for now, i guess this’ll have to do.

I created the function searchIMDB with the following script:

searchIMDB = function(searchTerm){

library(dplyr)
library(rvest)
library(stringr)

fullURL = paste("http://www.imdb.com/find?q=", searchTerm, "&s=tt&ttype=ft&ref_=fn_ft", sep = "")

page = html(fullURL)

movieTitles = page %>% 
  html_nodes(".result_text") %>%
  html_text()

movieTitles.df = na.omit(tbl_df(data.frame(movieTitles)))

trim(movieTitles.df)
}

And so running the script…

head(searchIMDB("crazy"))

gives me the following list of movie titles:

>
Source: local data frame [6 x 1]

movieTitles
1 C.R.A.Z.Y. (2005) aka “Crazy”
2 Crazy (I) (2000)
3 Crazy (II) (2008)
4 Crazy, Stupid, Love. (2011)
5 The Crazies (2010)
6 Like Crazy (2011)

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment