An Attempt at Web Scraping: Cyberjaya Rental Rates

I came across this short tutorial on how to use the rvest package to scrape information from websites. It looked pretty straightforward, although it took a while to get the hang of some of the HTML jargon. So i figured i should take a shot at this scraping myself.

I picked local site iBilik.my as a good example. From the website, iBilik is:

…Malaysia’s largest and No.1 Room / Homestay / Short Term Rental website, with over 100,000 listings posted online all across Malaysia, Kuala Lumpur, Ampang, Bangsar, Cheras, Setapak, Damansara, Petaling Jaya, Subang Jaya, and Penang.

The plan was to somehow retrieve all the rental rates for postings that were put up in Cyberjaya. I noticed that there are 25 postings shown on each search page, so having R go through 250 search pages should give me 6250 postings.

Now in my naiveté, i thought that this whole scraping of raw data would take less than a minute. Add to that the data cleaning, maybe another 20 minutes.

It took me 3 days.

This was because of a combination of work and…well, my computer kept crashing (6 times) so much of the time because of all the memory it had to use; and this was me not having any other apps open.

I also think it would have taken less time if i had figured out how to get rental rates from the search pages rather than having R navigate to each of the posting pages and only then get the prices.

Anyway, since i’m still trying to remember all the stuff i forgot in R, I decided to loop through the pages and postings. The plan was to first extract the information on the first search page, and then extract the information from pages 2 to 250 one by one and then rbind them to the extraction from page 1. I originally intended to extract from 4000 pages, but my computer kept dying. Then it went down to 2000, then 1000, 500, and eventually 250.

The information extracted are the posting description put up by the user, the location (in this case, it’s always Cyberjaya), and the link that navigates to the page that shows the post’s room details (rental, size, room type, etc).

After that, all that information would be cbind-ed and we can use the resulting dataframe to loop through all the post detail links and extract the price, and also the date that it was posted.

Just so you know, having the code finish up to this point took FOREVAAAAAAR! I’m talking HOURS.

This brings us to the issue of duplicated posts on iBilik. The site has no limit on how many times a user can post the same thing over and over again. With that in mind, duplicates had to be removed. The criterion i used to identify duplicates is the description of the post. I thought if the same description appears more than once, that’s as good of a criterion as any…especially for this site.

What pissed me off was that, in 6250 posts…only 1059 were unique. Over 80% of all posts were duplicated! I guess maybe i shouldn’t be so surprised considering how there were 6250 postings, but the dates only ranged from the 4th October to 10th November…in Cyberjaya alone. In any case, i looked up all postings that mention the words “middle”, “master”, or “small”; and then plotted the histograms.

It was interesting to see that from the master room postings, quite a number of postings fell in the range of 450 to 500 per month. While the middle rooms ranged mostly from 450 to 750. This gap would perhaps be explained by the fact that most master room related posts are for sharing between two or more students, and also by the fact that middle rooms are more available here in Cyberjaya. I would like to think that it’s also because the small rooms are always snapped up by the students, so you won’t usually find a lot of postings for small rooms. And also that the master rooms are picked up by people who work in one of the many companies here.

Below is the R script i used. There are no comments on the code, so it looks kinda messy. You can also find a copy of the complete file with the links, rentals, and dates of all the postings. There are also links to the (poorly constructed) histograms of each room category.

library(stringr)
library(rvest)
library(ggvis)

site = "http://www.ibilik.my"

siteLoc = "http://www.ibilik.my/rooms/cyberjaya?page="

siteLocN = paste(siteLoc, as.character(1), sep = "")
siteLocHTML = html(siteLocN)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>% 
  html_nodes(".location") %>% 
  html_text() %>% 
  data.frame() -> x

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>%
  html_nodes("a") %>%
  html_text() %>% 
  data.frame() -> y

siteLocHTML %>% html_nodes("table.room_list") %>%
  html_nodes("a") %>% 
  html_attr("href") %>%
  data.frame() -> z


for(i in 2:250){
siteLocN = paste(siteLoc, as.character(i), sep = "")
siteLocHTML = html(siteLocN)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>% 
  html_nodes(".location") %>% 
  html_text() %>% 
  data.frame() -> x_Next

x = rbind(x, x_Next)

siteLocHTML %>% html_nodes("table.room_list") %>% 
  html_nodes(".title") %>%
  html_nodes("a") %>%
  html_text() %>% 
  data.frame() -> y_Next

y = rbind(y, y_Next)

siteLocHTML %>% html_nodes("table.room_list") %>%
  html_nodes("a") %>% 
  html_attr("href") %>%
  data.frame() -> z_Next

z = rbind(z, z_Next)

} 


complete = cbind(y,x,z)
names(complete) = c("Title", "Location", "Link")

write.csv(complete, "complete.csv", row.names = FALSE)

rm(x_Next, y_Next, z_Next, x, y)


#Prices and dates

dummy = c()
dummy_date = c()


for(i in 1:nrow(z)){
link = paste(site, z[i,1], sep = "")

dummy[i] = html(link) %>% 
  html_nodes(".extras_wrapper p:nth-child(2)") %>% 
  html_text()

fullString = 0
html(link) %>% html_nodes(".stamp") %>% html_text() -> fullString

dummy_date[i] = substring(fullString, nchar(fullString)-10, nchar(fullString))

if(fullString == 0){dummy_date[i] = fullString}

}


complete$price  = dummy
complete$date = dummy_date

rm(dummy, dummy_date)

#Clean up and plots

filter = duplicated(complete[,1])
complete_fil = complete[!filter,]

complete_fil[,"price"] = gsub("RM ", "", complete_fil[,"price"]) 
complete_fil[,"price"] = gsub(",", "", complete_fil[,"price"])

complete_fil[,"price"] = as.integer(complete_fil[,"price"])

rownames(complete_fil) = NULL

new_dates = as.character(strptime(complete_fil[,"date"], format = "%d-%b %Y"))

complete_fil[,"date"] = as.Date(new_dates)

complete_fil = na.omit(complete_fil)

Middle = complete_fil[grep(c("Middle"), complete_fil[,1]),]
Middle = rbind(Middle, complete_fil[grep(c("middle"), complete_fil[,1]),])

Middle = Middle[-grep("Master", Middle[,1]),]
Middle = Middle[-grep("Small", Middle[,1]),]
Middle = Middle[-grep("master", Middle[,1]),]
rownames(Middle) = NULL
mean(Middle[,"price"])


Master = complete_fil[grep(c("Master"), complete_fil[,1]),]
Master = rbind(Master, complete_fil[grep(c("master"), complete_fil[,1]),])

Master = Master[-grep("Middle", Master[,1]),]
Master = Master[-grep("middle", Master[,1]),]
rownames(Master) = NULL
mean(Master[,"price"])


Small = complete_fil[grep(c("Small"), complete_fil[,1]),]
Small = rbind(Small, complete_fil[grep(c("small"), complete_fil[,1]),])

Small = Small[-grep("Middle", Small[,1]),]
rownames(Small) = NULL
mean(Small[,"price"])

summary(Master[,"price"])
summary(Middle[,"price"])
summary(Small[,"price"])


par(mfrow = c(1,3))

hist(Master[,"price"], breaks = 15, main = "Rental Distribution of Master Rooms", 
     xlab = "Rental per Month", col = "grey")

hist(Middle[,"price"], breaks = 20, main = "Rental Distribution of Middle Rooms", 
     xlab = "Rental per Month", col = "grey")

hist(Small[,"price"], 15, main = "Rental Distribution of Small Rooms", 
     xlab = "Rental per Month", col = "grey") 

scraped_data, histograms

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s